# Load packages
library(tidyverse)
library(bupaverse)
library(patchwork)
library(viridis)
library(DT)
library(psmineR)
# Global settings
::opts_chunk$set(
knitrecho = TRUE,
message = FALSE,
warning = FALSE
)theme_set(theme_minimal(base_family = 'serif'))
# Get data
<- read_csv('./input/nba-final-2024-game-5.csv', col_names = TRUE) %>%
nba_2024_g5_pm_data mutate(
resource_id = case_when(
== 'Luka Doncic' ~ 'Luka_Dončić',
resource_id == 'Kristaps Porzingis' ~ 'Kristaps_Porziņģis',
resource_id TRUE ~ resource_id)
)
Project Introduction 👋🏽
Inspired by Kröckel and Bodendorf (2020), this analysis will use R to explore process mining on a subject I am fond of - basketball. I have played basketball all my life, although my failing left knee means I play infrequently in my older age.
I will be analysing game 5 of the 2024 NBA final between Dončić, Dallas Mavericks and Tatum, Boston Celtics. Well done Celtics 🍀, who last won a championship 16 years ago. I will avoid findings which could just as easily be found in a games summary statistics and focus on what unique insights process mining provides. Let us start with the data.
Data Treatment
The data collation was a manual effort, a labour of love. After watching game 5 of the 2024 NBA finals I painstakingly noted down each activity along with the timestamp.
To add assurance to this data set I checked each activity available on @nba.com/play-by-play matched on timestamp and cross checked the summary player stats against @nba.com/box-score.
A few self imposed rules during data creation:
To define the process under analysis, one end to end run or ‘case’ ends with a shot
For free throws the case ends at the last free throw attempt
Unless specified from the nba website I have timestamped at the beginning of each activity
Create Eventlog
An eventlog in process mining is a structured collection of recorded events that represent the execution of activities usually within a business process, in this context, an NBA game.
It typically contains information such as the event type, timestamp, case identifier, and other relevant attributes, serving as the primary input for process mining techniques to discover, monitor, and improve actual processes.
Once an event log object has been created, the object can be used across multiple analyses.
<- nba_2024_g5_pm_data %>%
nba_final_g5_eventlog mutate(lifecycle_id = 'complete',
timestamp = ymd_hms(paste0('20240617 ', timestamp))) %>%
eventlog(case_id = 'case_id',
activity_id = 'activity_id',
activity_instance_id = 'seq',
lifecycle_id = 'lifecycle_id',
timestamp = 'timestamp',
resource_id = 'resource_id',
order = 'seq',
validate = TRUE)
Analyses
Control-Flow
Control flow in process mining refers to the sequence and decision points of activities within a process. This perspective explores the flow of the game.
Traces
Traces are the distinct processes of an eventlog, in this context, a play which starts at the beginning of a quarter of post shot attempt until another shot attempt.
Coverage
Reviewing the trace coverage relative to cases let us us see if the same plays are used throughout or if plays are variable. The plots display relative frequencies above 0.02 to remove the plot tail and make for better viewing.
<- nba_final_g5_eventlog %>%
log_trace trace_coverage('trace') %>%
filter(relative > 0.02) %>%
plot() +
theme(legend.position = 'top') +
scale_fill_viridis('Relative Frequency',
option = 'D',
direction = -1) +
scale_y_continuous(limits = c(0, 0.45),
breaks = seq(0, 0.45, 0.05)) +
scale_x_continuous(breaks = seq(0, 1, 0.2)) +
labs(title = 'Play variance') +
ylab('')
<- nba_final_g5_eventlog %>%
united_log_trace act_unite(
pass = c(
'pass-to-assist',
'pass',
'pass-turnover',
'pass-to-start-quarter'
),dribble = c('dribble', 'dribble-turnover'),
rebound = c('defensive-rebound', 'offensive-rebound'),
shot = c('shot-miss', 'shot-make')
%>%
) trace_coverage('trace') %>%
filter(relative > 0.02) %>%
plot() +
theme(legend.position = 'top') +
scale_fill_viridis('Relative Frequency',
option = 'D',
direction = -1) +
scale_y_continuous(limits = c(0, 0.45),
breaks = seq(0, 0.45, 0.05)) +
scale_x_continuous(breaks = seq(0, 1, 0.2)) +
labs(title = 'Play variance with unified activities') +
ylab('')
+ united_log_trace) +
(log_trace plot_layout(axes = 'collect')
As the left plot displays a max relative frequency of 0.03, the plays of this game varied considerably. Even when grouping activities e.g. dribble and dribble-turnover both set as dribble, a max relative frequency of 0.06 as shown in the right plot.
With more detailed data around off the ball movements and data on many more games, perhaps you could use trace coverage analysis to: - Indicate if systematic plays are used, shown by an increase in the relative frequency - Compare across teams and games to see if this metric shares patterns with tradition metrics i.e. win/loss, plus/minus etc
Team Trace Length
Reviewing the trace sizes for each team i.e. The number of activities used until a shot attempt.
<- nba_final_g5_eventlog %>%
shot_made_plot filter_activity_presence('shot-make') %>%
group_by(teams_play) %>%
trace_length('log') %>%
plot() +
theme(strip.background = element_rect(fill = viridis::viridis(1, direction = 1))) +
geom_jitter(alpha = 0.7, size = 2) +
geom_boxplot(fill = viridis::viridis(1, alpha = 0.6, direction = -1),
colour = '#000000') +
stat_boxplot(geom = 'errorbar', linetype = 'dashed', width = 0.1) +
scale_y_continuous(limits = c(0, 18)) +
labs(title = 'Teams Traces by Shot Made')
<- nba_final_g5_eventlog %>%
shot_miss_plot filter_activity_presence('shot-miss') %>%
group_by(teams_play) %>%
trace_length('log') %>%
plot() +
theme(strip.background = element_rect(fill = viridis::viridis(1, direction = 1))) +
geom_jitter(alpha = 0.7, size = 2) +
geom_boxplot(fill = viridis::viridis(1, alpha = 0.6, direction = -1),
colour = '#000000') +
stat_boxplot(geom = 'errorbar', linetype = 'dashed', width = 0.1) +
scale_y_continuous(limits = c(0, 18)) +
labs(title = 'Teams Traces by Shot Miss')
+ shot_miss_plot) +
(shot_made_plot plot_layout(axes = 'collect')
All look similar except Dallas for plays which end in a miss. Lets look into this further by spliting Dallas traces by quarter.
%>%
nba_final_g5_eventlog filter(teams_play == 'Dallas Mavericks') %>%
filter_activity_presence('shot-miss') %>%
group_by(quarter) %>%
trace_length('log') %>%
plot() +
theme(strip.background = element_rect(fill = viridis::viridis(1, direction = 1))) +
geom_jitter(alpha = 0.7, size = 2) +
geom_boxplot(fill = viridis::viridis(1, alpha = 0.6, direction = -1),
colour = '#000000') +
stat_boxplot(geom = 'errorbar', linetype = 'dashed', width = 0.1) +
scale_y_continuous(limits = c(0, 18)) +
labs(title = 'Dallas Traces by Shot Miss & Quarter')
The 4th quarter has higher average and max trace lengths at a time when they needed to make a push.
There could be multiple reasons as to why, perhaps Boston played brilliant defence forcing Dallas to pass more in order to shift the defence and make space.
Regardless this ran down the clock limiting the number of available opportunities to reduce the deficit.
Let us view that 4th quarter outlier process to see why.
<- nba_final_g5_eventlog %>%
longest_trace filter(teams_play == 'Dallas Mavericks') %>%
filter_activity_presence('shot-miss') %>%
filter_trace_length(percentage = 0.01) %>%
count(case_id, sort = TRUE) %>%
head(1) %>%
pull(case_id)
%>%
nba_final_g5_eventlog filter_case(cases = longest_trace) %>%
as.data.frame() %>%
select(case_id, seq, activity_id, player_team, teams_play) %>%
datatable(class = c('compact', 'hover', 'row-border'),
rownames = FALSE,
options = list(dom = 't'))
This outlier is explained by how a play is defined by a shot attempt. Boston held position at the onset but turned over possession to Dallas, shortly after Dallas made a shot attempt.
Precedences
Reviewing the activity process matrix allows for investigation into antecedent and consequential activities. Each activity row will sum to 100%.
%>%
nba_final_g5_eventlog process_matrix(type = frequency('relative-antecedent')) %>%
plot() +
theme(legend.position = 'top') +
scale_fill_viridis(option = 'D', name = 'Relative Antecedence', direction = -1)
Across the game ~70% of all passes were followed by a dribble whilst ~20% of the time it was followed by another pass. Additional data and further analysis could shed light on, if these figures typified finals or playing styles of NBA teams.
~55% of the time a defensive rebound was followed by a dribble, and ~38% a pass. Your taught to grab the rebound and look for the outlet (at least in the United Kingdom) yet in this game the dribble was preferred. If you had player positions you may look to investigate if a lack of outlet options was the reason or perhaps the rebound was caught closer to the 3pt line than under the rim so it made more sense to move the ball up the court for a fast break.
Performance
Performance analysis typically involves metrics such as time to complete activity, time in between each activity, and total time of process to understand the factors affecting process performance.
Throughput Time
This analysis is limited by the time stamps collected. If I had start and end times of each activity, more granular analysis could be performed. As it stands only total time across process is available, also know as throughput time.
<- nba_final_g5_eventlog %>%
throughput_make_plot group_by(teams_play) %>%
filter_activity_presence('shot-make', method = 'none') %>%
throughput_time('log', units = 'secs') %>%
plot() +
theme(strip.background = element_rect(fill = viridis::viridis(1, direction = 1))) +
geom_jitter(alpha = 0.7, size = 2, ) +
geom_boxplot(fill = viridis::viridis(1, alpha = 0.6, direction = -1),
colour = '#000000') +
stat_boxplot(geom = 'errorbar', linetype = 'dashed', width = 0.1) +
scale_y_continuous(limits = c(-5, 45)) +
labs(title = 'Teams Traces by Shot Make')
<- nba_final_g5_eventlog %>%
throughput_miss_plot group_by(teams_play) %>%
filter_activity_presence('shot-miss', method = 'none') %>%
throughput_time('log', units = 'secs') %>%
plot() +
theme(strip.background = element_rect(fill = viridis::viridis(1, direction = 1))) +
geom_jitter(alpha = 0.7, size = 2) +
geom_boxplot(fill = viridis::viridis(1, alpha = 0.6, direction = -1),
colour = '#000000') +
stat_boxplot(geom = 'errorbar', linetype = 'dashed', width = 0.1) +
scale_y_continuous(limits = c(-5, 45)) +
labs(title = 'Teams Traces by Shot Miss')
+ throughput_miss_plot) +
(throughput_make_plot plot_layout(axes = 'collect')
Bostons interquartile range is wider than Dallas’s, 50% of Bostons plays are more varied in time than 50% of Dallas’.
For shots made by Dallas the interquartile range is half that of Boston. 50% of Dallas’s plays were quick, perhaps rushed due to score deficit.
Process Map
Process mining isn’t complete without a process map. Split by team plays, you could dig for many insights looking across both maps.
Boston Celtics
%>%
nba_final_g5_eventlog filter(teams_play == 'Boston Celtics') %>%
process_map(
type_nodes = frequency('absolute'),
type_edges = performance(mean, 'secs'),
rankdir = 'TB'
)
Dallas Mavericks
%>%
nba_final_g5_eventlog filter(teams_play == 'Dallas Mavericks') %>%
process_map(
type_nodes = frequency('absolute'),
type_edges = performance(mean, 'secs'),
rankdir = 'TB'
)
A few noteworthy insights:
Boston was more industrious than Dallas. Having completed more passes, dribbles, rebounds, blocks etc
Boston was 1 second quicker on average to make a shot attempt after a defensive rebound.
Both teams passed the ball with similar average times: Boston with 1.2 seconds and Dallas 1.51 seconds, for passes not leading to an assist. Boston 2.31 seconds and Dallas 2.5 seconds for passes leading to an assist. In both cases Boston were on average quicker and sharper in moving the ball.
Spectra
Referenced in @bupaverse_spectrum, Denisov, Fahland, and Aalst (2018) provides another avenue to analyse performance. What the authors term ‘performance spectra’ defined as ‘The Performance Spectrum is a fully detailed data structure and visualization of all cases over all segments over time’ provides a taxonomy of performance patterns.
%>%
nba_final_g5_eventlog ps_detailed(segment_coverage = 0.15, classification = 'teams_play') +
scale_colour_viridis_d(name = 'Team Plays') +
theme(
strip.background = element_rect(fill = viridis::viridis(1, direction = 1)),
strip.text = element_text(colour = '#ffffff'),
legend.position = "top"
)
I have opted for 15% coverage of plays (or cases) as most would be sparse and beyond evaluating. In line with the taxonomy, this event has generally 2 elementary patterns:
For dribble to pass and vice versa: The pattern follows a single segment with globally occurring instances, with regular repetitions, and a continuous workload across the course of the event, the full 45 mins.
The rest of the pairwise segments had a similar taxonomy with a sparse workload.
Interesting analysis when applied to administrative data as is the case for Denisov, Fahland, and Aalst (2018). However offers little when applied to this dataset. If this dataset included off the ball movement and/or multiple games, perhaps analysis of performance spectra would be more fruitful.
Organisational
Organisational analysis refers to the examination of the social and organisational aspects of a business process, such as the resources, responsibilities, and interactions of the individuals involved.
Resource Industry
Looking at the top 10 industrious players, the players with higher frequency of activities. You would expect to see your star players at the top.
%>%
nba_final_g5_eventlog resource_frequency(level = 'resource') %>%
head(10) %>%
plot() +
theme(legend.position = 'none') +
scale_fill_viridis('Frequency',
option = 'D',
direction = -1) +
labs(title = 'Player Industriousness') +
xlab('')
As you would expect, Tatum, Dončić, and Brown rank in the top 3. Irving was not as involved as Dallas would have liked. Boston hold 4 of the 5 top spots, they certainly brought the game to Dallas.
Specialisation
%>%
nba_final_g5_eventlog group_by(player_team) %>%
resource_frequency(level = 'resource-activity') %>%
filter(!resource_id %in% c('Boston Celtics', 'Dallas Mavericks'),
> 1) %>%
absolute plot() +
theme(legend.position = 'top') +
scale_fill_viridis(option = 'D', name = 'Relative Antecedence', direction = -1) +
ylab('')
One game doesn’t make a specialist but for the purposes of exploration let us see who was a one-game specialist, who excels at a type of activity. Note, I have removed any activities that happened only once to reduce dimensions and make it easy viewing.
A few noteworthy insights
There are more Dallas players than Boston, the specialisation of activities was spread wider for Dallas.
Dončić held the pass master title for Dallas and was certainly his teams passing specialist whilst Boston shared that specialisation across 4 players.
Dončić specialised in defensive rebounds for his team as well as passing, were as Boston shared this responsibility across 4 players.
Although there were more Dallas specialists, Dončić was the main specialist in passing, and defensive rebounding comparative to his team whilst Boston often had several specialists in any given activity. This would have hindered Dallas when Dončić was subbed and benefited Boston during bench rotation.
Handover work
Handover analysis explores how work is handed off from one person or step to the next.
In the following resource maps, I will be looking at the 20% most frequent plays or traces.
Boston Celtics
%>%
nba_final_g5_eventlog mutate(resource_id = if_else(
== 'Jaylen Brown', 'J. Brown', resource_id)) %>%
resource_id filter(teams_play == 'Boston Celtics') %>%
filter_trace_frequency(percentage = 0.2) %>%
resource_map()
Jayson Tatum was passed to most (9) by Derrick White but only by 1 more pass. Followed by Jrue Holiday passing to Derrick White (8). Handover work is generally even across the most industrious players in Boston
Dallas Mavericks
%>%
nba_final_g5_eventlog filter(teams_play == 'Dallas Mavericks') %>%
filter_trace_frequency(percentage = 0.2) %>%
resource_map()
Comparative to Boston, Dallas rotated the basketball less. The highest handover score between players being 3.
For at least 20% of the most frequent plays, Boston moved the ball more, this is almost always a good sign in basketball, it often equates to: - Improved player engagement - Players better attuned to game tempo - Players warm up quicker, essential for a sport which has constant substitutions
Conclusion
Game
Game 5 of the NBA 2024 finals had high variance in plays, with Boston winning the game. Boston was more industrious, their star players were more effective, were quicker at getting the shot up after defensive rebounds, moved the ball quicker, and had better rotation of the ball than Dallas.
Dallas really were never in the game, being thwarted whenever they mounted an attack. Dallas were unable to get Irving into the game, relied too much on Dončić, and were second best.
Method
Process mining applied to basketball has shown clear and easily digestible insights into the behaviours of both teams. That being said there is no open dataset well fitted towards this analysis for public use. I had to heavily enrich traditional play-by-play data. This time resource intensive approach, albeit as a personal project, means I will unlikely gather more data despite the analysis benefitting from it.
Further analysis into evasion sports like basketabll could look into including off the ball activity for process mining analysis, in this vein, perhaps individual sports would prove easier requiring less data to generate insight. Either way, the automating of data collection specific and open for the purpose of process mining would spearhead this type of analysis.
Specific to this case, having included, as an activity, when players don’t move with the ball. This would have been interesting to delve into. I suspect you would see the time Celtics held the ball increase in quarter 4 as they attempt to run the clock.
Acknowledgements
r-bloggers.com for the reach, platform, and content
Packages and package maintainer(s):
- tidyverse | Hadley Wickham
- bupaverse | Gert Janssenswillen
- patchwork | Thomas Lin Pedersen
- viridis | Simon Garnier
- DT | Joe Cheng
- psmineR | Gert Janssenswillen