The goal of fcscrapR is to allow R users quick access to the commentary for each soccer game available on ESPN. The commentary data includes basic events such as shot attempts, substitutions, fouls, cards, corners, and video reviews along with information about the players involved. The data can be accessed in-game as ESPN updates their match commentary. This package was created to help get data in the hands of soccer fans to do their own analysis and contribute to reproducible metrics.
You can install fcscrapR
from github with:
# install.packages("devtools")
devtools::install_github("ryurko/fcscrapR")
Here’s an example of how to scrape a game using fcscrapR
. The
workhorse function of the package is scrape_commentary()
which takes
in a game id. This game id is located in the url for a game, such as the
group stage match between Serbia and Costa Rica in the 2018 World Cup:
http://www.espn.com/soccer/commentary?gameId=498194
Using this game id, we can easily grab the commentary data frame:
library(fcscrapR)
#> Loading required package: magrittr
srb_crc_commentary <- scrape_commentary(498194)
Check out the documentation for scrape_commentary()
for a description
of all of the columns in the commentary data:
colnames(srb_crc_commentary)
#> [1] "game_id" "commentary"
#> [3] "match_time" "team_one"
#> [5] "team_two" "team_one_score"
#> [7] "team_two_score" "half_end"
#> [9] "match_end" "half_begins"
#> [11] "shot_attempt" "penalty_shot"
#> [13] "shot_result" "shot_by_player"
#> [15] "shot_by_team" "shot_with"
#> [17] "shot_where" "net_location"
#> [19] "assist_by_player" "foul"
#> [21] "foul_by_player" "foul_by_team"
#> [23] "follow_set_piece" "assist_type"
#> [25] "follow_corner" "offside"
#> [27] "offside_team" "offside_player"
#> [29] "offside_pass_from" "shown_card"
#> [31] "card_type" "card_player"
#> [33] "card_team" "video_review"
#> [35] "video_review_event" "video_review_result"
#> [37] "delay_in_match" "delay_team"
#> [39] "free_kick_won" "free_kick_player"
#> [41] "free_kick_team" "free_kick_where"
#> [43] "corner" "corner_team"
#> [45] "corner_conceded_by" "substitution"
#> [47] "sub_injury" "sub_team"
#> [49] "sub_player" "replaced_player"
#> [51] "penalty" "team_drew_penalty"
#> [53] "player_drew_penalty" "player_conceded_penalty"
#> [55] "team_conceded_penalty" "half"
#> [57] "comment_id" "stoppage_time"
#> [59] "team_one_penalty_score" "team_two_penalty_score"
#> [61] "match_time_numeric"
Can quickly make a chart showing the difference in shot attempts for each team by the outcome:
# install.packages("ggplot2")
library(ggplot2)
srb_crc_commentary %>%
dplyr::filter(!is.na(shot_result)) %>%
ggplot(aes(x = shot_by_team, fill = shot_result))
geom_bar() labs(x = "Team", y = "Count",
fill = "Shot result",
title = "Distribution of shot attempts for Costa Rica vs Serbia by result",
caption = "Data from ESPN, accessed with fcscrapR")
scale_fill_manual(values = c("darkorange", "darkblue", "darkred", "darkcyan"))
theme_bw()
The only function available currently to get game ids is
scrape_scoreboard_ids()
which pulls the game ids for all soccer
matches on ESPN’s soccer scoreboard given a league or tournament. You
must use a league or tournament that has an associated url in the
league_url_data
table provided in fcscrapR
:
# install.packages(pander)
league_url_data %>%
head() %>%
pander::pander()
name |
---|
show all leagues |
fifa world cup |
uefa champions league |
uefa europa league |
english premier league |
spanish primera división |
Table continues below
Here’s an example of grabbing the World Cup games from June 20th, 2018:
scrape_scoreboard_ids(scoreboard_name = "fifa world cup",
game_date = "2018-06-20") %>%
pander::pander()
#> Loading required package: XML
#> Loading required package: RCurl
#> Loading required package: bitops
game_id | team_one | team_two |
---|---|---|
498185 | Portugal | Morocco |
498184 | Uruguay | Saudi Arabia |
498183 | Iran | Spain |
Many thanks to the sports analytics community on Twitter for guiding me to various resources of soccer data. Big thanks to Brendan Kent for pointing me to the commentary data.