Transform data from web pages into usable information
Automate the process
rvest
+ xml2
: Easy Web Scrapingread_html
gets the full set of HTML markup from a URLlibrary(rvest)url <- "https://www.nytimes.com/elections/2016/results/iowa"html <- read_html(url)html
## {html_document}## <html lang="en" itemscope="" xmlns:og="//opengraphprotocol.org/schema/" itemtype="//schema.org/NewsArticle">## [1] <head>\n<title>Iowa Election Results 2016 – The New Y ...## [2] <body class="eln-general-state-results eln-state-iowa ...
html_attr
, html_node
, html_table
, and html_text
to extract useful information from the markuphtml_table
extracts all tables from the sourced html into a list of data frames:
tables <- html %>% html_table(fill=TRUE)tables %>% purrr::map(glimpse)
## Rows: 11## Columns: 7## $ Candidate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ Candidate <chr> "Trump\n \n \n …## $ Party <chr> "Republican\n Rep.", "Democrat\…## $ Votes <chr> "800,983", "653,669", "59,186", "19,992"…## $ Pct. <chr> "51.1%", "41.7%", "3.8%", "1.3%", "0.8%"…## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ E.V. <chr> "6", "—", "—", "—", "—", "—", "—", "—", …## Rows: 99## Columns: 3## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson…## $ Trump <chr> "93,492", "48,390", "39,149", "21…## $ Clinton <chr> "119,804", "58,935", "40,440", "5…## Rows: 6## Columns: 6## $ Candidate <lgl> NA, NA, NA, NA, NA, NA## $ Candidate <chr> "Grassley*\n \n \n…## $ Party <chr> "Republican\n Rep.", "Democrat\…## $ Votes <chr> "926,007", "549,460", "41,794", "17,649"…## $ Pct. <chr> "60.2%", "35.7%", "2.7%", "1.1%", "0.3%"…## $ `` <lgl> NA, NA, NA, NA, NA, NA## Rows: 99## Columns: 3## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson…## $ Grassley <chr> "118,164", "62,737", "46,415", "2…## $ Judge <chr> "100,317", "47,635", "34,503", "4…## Rows: 4## Columns: 5## $ `District\n Dist.` <int> 1, 2, 3, 4## $ Leader <chr> "54%Blum*\n Rep."…## $ `` <chr> "46%Vernon\n Dem.…## $ Rpt. <chr> "100%", "100%", "100%"…## $ `` <lgl> NA, NA, NA, NA## Rows: 25## Columns: 5## $ `Seat\n Seat` <int> 2, 4, 6, 8, 10, 12, 14, 16,…## $ Leader <chr> "0%Feenstra*\n Rep.", …## $ `` <chr> "Uncontested", "39%Bangert\…## $ Rpt. <chr> "", "100%", "100%", "100%",…## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA,…## Rows: 100## Columns: 5## $ `District\n Dist.` <int> 1, 2, 3, 4, 5, 6, 7, 8…## $ Leader <chr> "0%Wills*\n Rep."…## $ `` <chr> "Uncontested", "Uncont…## $ Rpt. <chr> "", "", "100%", "100%"…## $ `` <lgl> NA, NA, NA, NA, NA, NA…## Rows: 3## Columns: 5## $ Question <chr> "Retain Brent Appel", "Retain Daryl Hecht…## $ Yes <chr> "64%Yes", "64%Yes", "65%Yes"## $ No <chr> "36%No", "36%No", "35%No"## $ Rpt. <chr> "100%", "100%", "100%"## $ `` <lgl> NA, NA, NA
## [[1]]## # A tibble: 11 × 7## Candidate Candidate Party Votes Pct. `` E.V. ## <lgl> <chr> <chr> <chr> <chr> <lgl> <chr>## 1 NA "Trump\n … "Rep… 800,… 51.1% NA 6 ## 2 NA "Clinton\n … "Dem… 653,… 41.7% NA — ## 3 NA "Johnson\n … "Lib… 59,1… 3.8% NA — ## 4 NA "Others\n … "Ind… 19,9… 1.3% NA — ## 5 NA "McMullin\n … "Pet… 12,3… 0.8% NA — ## 6 NA "Stein\n … "Gre… 11,4… 0.7% NA — ## 7 NA "Castle\n … "Con… 5,335 0.3% NA — ## 8 NA "Kahn\n … "Ind… 2,247 0.1% NA — ## 9 NA "De La Fuente\n … "Pet… 451 0.0% NA — ## 10 NA "La Riva\n … "P.S… 323 0.0% NA — ## 11 NA "Others\n … "" 32,2… 2.1% NA — ## ## [[2]]## # A tibble: 99 × 3## `Vote by county` Trump Clinton## <chr> <chr> <chr> ## 1 Polk 93,492 119,804## 2 Linn 48,390 58,935 ## 3 Scott 39,149 40,440 ## 4 Johnson 21,044 50,200 ## 5 Black Hawk 27,476 32,233 ## 6 Story 19,458 25,709 ## 7 Dubuque 23,460 22,850 ## 8 Woodbury 24,727 16,210 ## 9 Pottawattamie 24,447 15,355 ## 10 Dallas 19,339 15,701 ## # ℹ 89 more rows## ## [[3]]## # A tibble: 6 × 6## Candidate Candidate Party Votes Pct. `` ## <lgl> <chr> <chr> <chr> <chr> <lgl>## 1 NA "Grassley*\n … "Rep… 926,… 60.2% NA ## 2 NA "Judge\n \n … "Dem… 549,… 35.7% NA ## 3 NA "Aldrich\n \… "Lib… 41,7… 2.7% NA ## 4 NA "Hennager\n … "Ind… 17,6… 1.1% NA ## 5 NA "Luick-Thrams\n … "Pet… 4,441 0.3% NA ## 6 NA "Others\n \n… "" 22,0… 1.4% NA ## ## [[4]]## # A tibble: 99 × 3## `Vote by county` Grassley Judge ## <chr> <chr> <chr> ## 1 Polk 118,164 100,317## 2 Linn 62,737 47,635 ## 3 Scott 46,415 34,503 ## 4 Johnson 28,914 42,699 ## 5 Black Hawk 33,884 27,245 ## 6 Story 25,475 21,472 ## 7 Dubuque 27,348 19,291 ## 8 Woodbury 27,166 13,909 ## 9 Pottawattamie 25,721 12,943 ## 10 Dallas 24,374 11,876 ## # ℹ 89 more rows## ## [[5]]## # A tibble: 4 × 5## `District\n Dist.` Leader `` Rpt. `` ## <int> <chr> <chr> <chr> <lgl>## 1 1 "54%Blum*\n… "46%… 100% NA ## 2 2 "54%Loebsac… "46%… 100% NA ## 3 3 "54%Young*\… "40%… 100% NA ## 4 4 "61%King*\n… "39%… 100% NA ## ## [[6]]## # A tibble: 25 × 5## `Seat\n Seat` Leader `` Rpt. `` ## <int> <chr> <chr> <chr> <lgl>## 1 2 "0%Feenstra*\n … "Unc… "" NA ## 2 4 "61%Guth*\n … "39%… "100… NA ## 3 6 "83%Segebart*\n… "17%… "100… NA ## 4 8 "54%Dawson\n … "46%… "100… NA ## 5 10 "67%Chapman*\n … "33%… "100… NA ## 6 12 "78%Costello*\n… "22%… "100… NA ## 7 14 "74%Sinclair*\n… "26%… "100… NA ## 8 16 "60%Boulton\n … "35%… "100… NA ## 9 18 "0%Petersen*\n … "Unc… "" NA ## 10 20 "60%Zaun*\n … "41%… "100… NA ## # ℹ 15 more rows## ## [[7]]## # A tibble: 100 × 5## `District\n Dist.` Leader `` Rpt. `` ## <int> <chr> <chr> <chr> <lgl>## 1 1 "0%Wills*\… "Unc… "" NA ## 2 2 "0%Jones*\… "Unc… "" NA ## 3 3 "81%Husema… "19%… "100… NA ## 4 4 "63%Wheele… "37%… "100… NA ## 5 5 "77%Holz*\… "23%… "100… NA ## 6 6 "66%Carlin… "35%… "100… NA ## 7 7 "63%Gassma… "37%… "100… NA ## 8 8 "68%Baxter… "32%… "100… NA ## 9 9 "57%Miller… "43%… "100… NA ## 10 10 "0%Sexton*… "Unc… "" NA ## # ℹ 90 more rows## ## [[8]]## # A tibble: 3 × 5## Question Yes No Rpt. `` ## <chr> <chr> <chr> <chr> <lgl>## 1 Retain Brent Appel 64%Yes 36%No 100% NA ## 2 Retain Daryl Hecht 64%Yes 36%No 100% NA ## 3 Retain Mark Cady 65%Yes 35%No 100% NA
Most tables need a bit of clean-up:
ia_results <- tables[[2]] %>% mutate( Trump = parse_number(Trump), Clinton = parse_number(Clinton))ia_results
## # A tibble: 99 × 3## `Vote by county` Trump Clinton## <chr> <dbl> <dbl>## 1 Polk 93492 119804## 2 Linn 48390 58935## 3 Scott 39149 40440## 4 Johnson 21044 50200## 5 Black Hawk 27476 32233## 6 Story 19458 25709## 7 Dubuque 23460 22850## 8 Woodbury 24727 16210## 9 Pottawattamie 24447 15355## 10 Dallas 19339 15701## # ℹ 89 more rows
Connect to the website of the NY Times election results at https://www.nytimes.com/elections/2016/results/president
Pick your favorite state (by clicking on the state on the map)
Use rvest
to download the election results for the 2016 Presidential Election for all counties of your state.
Clean up the data (numbers should be numbers).
Convert the numbers into row-wise percentages (decide on either %Republican or %Democrat).
How well does the web-scraping work over time? Results for the 2020 Election are available from https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html
Again, pick your favorite state and apply your previous code.
How many things have changed? Which format do you prefer? Why?
ia2020 <- "https://www.nytimes.com/interactive/2020/11/03/us/elections/results-iowa.html"html <- read_html(ia2020)tables <- html %>% html_table(fill=TRUE)tables %>% purrr::map(glimpse)
## Rows: 12## Columns: 6## $ X1 <chr> "", "", "", "", "", "", "", "", "", "", "", "Vi…## $ X2 <chr> "Trump* Donald J. Trump* Winner", "Biden …## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria…## $ X4 <chr> "897,672", "759,061", "19,637", "4,337", "3,210…## $ X5 <chr> "53.1%", "44.9%", "1.2%", "0.3%", "0.2%", "0.2%…## $ X6 <chr> "6", "—", "—", "—", "—", "—", "—", "—", "—", "—…## Rows: 100## Columns: 6## $ County <chr> "Allamakee", "Polk", "Linn",…## $ Margin <chr> "Trump +29", "Biden +15", "B…## $ `2016 margin` <chr> "R+24.2", "D+11.4", "D+9", "…## $ `Est. votes reported` <chr> "99%", "100%", "100%", "100%…## $ `Total votes` <chr> "7,422", "258,755", "127,458…## $ Absentee <chr> "4,816", "140,547", "81,691"…## Rows: 12## Columns: 5## $ Candidate <chr> "", "", "", "", "", ""…## $ Candidate <chr> "Biden", "Trump", "Jor…## $ `Absentee/early votesVotes` <chr> "568,294", "410,030", …## $ Pct. <chr> "57.2%", "41.3%", "0.9…## $ `` <chr> "", "", "", "", "", ""…## Rows: 5## Columns: 5## $ X1 <lgl> NA, NA, NA, NA, NA## $ X2 <chr> "Ernst* Joni Ernst* Winner", "Greenfield …## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria…## $ X4 <chr> "864,997", "754,859", "36,961", "13,800", "1,67…## $ X5 <chr> "51.8%", "45.2%", "2.2%", "0.8%", ""## Rows: 9## Columns: 2## $ X1 <chr> "1st District ›", "Representative Abby Finkenau…## $ X2 <chr> "WinnerHinson +3", "Representative Abby Finkena…## Rows: 4## Columns: 6## $ `District Dist.` <int> 1, 2, 3, 4## $ Margin <chr> "R+3", "R <0.01", "D+1.4", "R+24"## $ Candidates <chr> "Hinson Rep. Winner", "Miller-Mee…## $ `` <chr> "Finkenauer* Dem.", "Hart Dem.", …## $ `Est. rpt.` <chr> "100%", "100%", "100%", "100%"## $ `` <chr> "›", "›", "›", "›"## Rows: 3## Columns: 2## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> …## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> …## Rows: 26## Columns: 3## $ `District Dist.` <chr> "2", "4", "6", "8", "10", "12", "…## $ Candidates <chr> "Taylor* Rep. 0% Winner", "Guth* …## $ `` <chr> "Uncontested", "Uncontested", "Pe…## Rows: 101## Columns: 3## $ `District Dist.` <chr> "1", "2", "3", "4", "5", "6", "7"…## $ Candidates <chr> "Wills* Rep. 0% Winner", "Jones* …## $ `` <chr> "Uncontested", "Uncontested", "Un…## Rows: 5## Columns: 3## $ Question <chr> "Retain Bower", "Retain May", "Retain Sch…## $ `` <chr> "Yes 73% Winner", "Yes 73% Winner", "Yes …## $ `` <chr> "No 27%", "No 27%", "No 24%", "No 26%", "…## Rows: 5## Columns: 3## $ `Race Race` <chr> "Retain Christensen", "Retain Mansfiel…## $ `` <chr> "Yes 73% Winner", "Yes 69% Winner", "Y…## $ `` <chr> "No 27%", "No 31%", "No 29%", "No 30%"…
## [[1]]## # A tibble: 12 × 6## X1 X2 X3 X4 X5 X6 ## <chr> <chr> <chr> <chr> <chr> <chr>## 1 "" Trum… "Rep… 897,… "53.… "6" ## 2 "" Bide… "Dem… 759,… "44.… "—" ## 3 "" Jorg… "Lib… 19,6… "1.2… "—" ## 4 "" Writ… "" 4,337 "0.3… "—" ## 5 "" West… "Ind… 3,210 "0.2… "—" ## 6 "" Hawk… "Gre… 3,075 "0.2… "—" ## 7 "" Blan… "Con… 1,707 "0.1… "—" ## 8 "" De L… "Oth… 1,082 "0.1… "—" ## 9 "" King… "Ind… 546 "<0.… "—" ## 10 "" Pier… "Ind… 544 "<0.… "—" ## 11 "" Tota… "" 1,69… "" "" ## 12 "View all candidates Colla… View… "Vie… View… "Vie… "Vie…## ## [[2]]## # A tibble: 100 × 6## County Margin `2016 margin` `Est. votes reported`## <chr> <chr> <chr> <chr> ## 1 Allamakee Trump … R+24.2 99% ## 2 Polk Biden … D+11.4 100% ## 3 Linn Biden … D+9 100% ## 4 Scott Biden … D+1.5 100% ## 5 Johnson Biden … D+37.9 100% ## 6 Black Hawk Biden … D+7.4 100% ## 7 Dallas Trump … R+9.5 100% ## 8 Dubuque Trump … R+1.2 100% ## 9 Story Biden … D+12.3 100% ## 10 Pottawattamie Trump … R+21.3 100% ## # ℹ 90 more rows## # ℹ 2 more variables: `Total votes` <chr>, Absentee <chr>## ## [[3]]## # A tibble: 12 × 5## Candidate Candidate Absentee/early votes…¹ Pct. `` ## <chr> <chr> <chr> <chr> <chr>## 1 "" Biden 568,294 "57.… "" ## 2 "" Trump 410,030 "41.… "" ## 3 "" Jorgensen 8,463 "0.9… "" ## 4 "" Hawkins 1,713 "0.2… "" ## 5 "" West 1,405 "0.1… "" ## 6 "" Blankens… 1,062 "0.1… "" ## 7 "" De La Fu… 587 "0.1… "" ## 8 "" Write-ins 522 "0.1… "" ## 9 "" King 341 "<0.… "" ## 10 "" Pierce 305 "<0.… "" ## 11 "" Total re… 992,722 "" "" ## 12 "View all c… View all… View all candidates C… "Vie… "Vie…## # ℹ abbreviated name: ¹`Absentee/early votesVotes`## ## [[4]]## # A tibble: 5 × 5## X1 X2 X3 X4 X5 ## <lgl> <chr> <chr> <chr> <chr>## 1 NA Ernst* Joni Ernst* Winner "Republ… 864,… "51.…## 2 NA Greenfield Theresa Greenfield "Democr… 754,… "45.…## 3 NA Stewart Rick Stewart "Libert… 36,9… "2.2…## 4 NA Herzog Suzanne Herzog "Indepe… 13,8… "0.8…## 5 NA Total reported Total reported "" 1,67… "" ## ## [[5]]## # A tibble: 9 × 2## X1 X2 ## <chr> <chr>## 1 1st District › Winn…## 2 Representative Abby Finkenauer, a Democrat, flipped… Repr…## 3 100% reported 100%…## 4 2nd District › Winn…## 5 Democrats were hoping to hang on to the seat being … Demo…## 6 100% reported 100%…## 7 4th District › Winn…## 8 J.D. Scholten, a former minor league pitcher turned… J.D.…## 9 100% reported 100%…## ## [[6]]## # A tibble: 4 × 6## `District Dist.` Margin Candidates `` `Est. rpt.` `` ## <int> <chr> <chr> <chr> <chr> <chr>## 1 1 R+3 Hinson Re… Fink… 100% › ## 2 2 R <0.… Miller-Me… Hart… 100% › ## 3 3 D+1.4 Axne* Dem… Youn… 100% › ## 4 4 R+24 Feenstra … Scho… 100% › ## ## [[7]]## # A tibble: 3 × 2## Constitutional Convention Questio…¹ Constitutional Conve…²## <chr> <chr> ## 1 "" "" ## 2 "Yes 30%" "No 70% Winner" ## 3 "1,381,676 votes reported" "1,381,676 votes repo…## # ℹ abbreviated names:## # ¹`Constitutional Convention Question: Explore Proposals for Changes to Constitution`,## # ²`Constitutional Convention Question: Explore Proposals for Changes to Constitution`## ## [[8]]## # A tibble: 26 × 3## `District Dist.` Candidates `` ## <chr> <chr> <chr> ## 1 2 Taylor* Rep. 0% Winner Uncontested ## 2 4 Guth* Rep. 0% Winner Uncontested ## 3 6 Williams Rep. 69% Winner Petersen Dem.…## 4 8 Dawson* Rep. 52% Winner Gorman Dem. 4…## 5 10 Chapman* Rep. 63% Winner Varley Dem. 3…## 6 12 Costello* Rep. 69% Winner Norris Dem. 3…## 7 14 Sinclair* Rep. 0% Winner Uncontested ## 8 16 Boulton* Dem. 77% Winner Johnson Lib. …## 9 18 Petersen* Dem. 0% Winner Uncontested ## 10 20 Zaun* Rep. 51% Winner Martin Dem. 4…## # ℹ 16 more rows## ## [[9]]## # A tibble: 101 × 3## `District Dist.` Candidates `` ## <chr> <chr> <chr> ## 1 1 Wills* Rep. 0% Winner Uncontested ## 2 2 Jones* Rep. 0% Winner Uncontested ## 3 3 Bush Rep. 0% Winner Uncontested ## 4 4 Wheeler* Rep. 82% Winner Johnson Dem. 1…## 5 5 Jeneary* Rep. 0% Winner Uncontested ## 6 6 Bossman* Rep. 0% Winner Uncontested ## 7 7 Stone Rep. 62% Winner Jensen Dem. 38%## 8 8 Baxter* Rep. 75% Winner Kiss Dem. 25% ## 9 9 Meyer* Rep. 57% Winner Clayton Dem. 4…## 10 10 Sexton* Rep. 76% Winner Schultes Dem. …## # ℹ 91 more rows## ## [[10]]## # A tibble: 5 × 3## Question `` `` ## <chr> <chr> <chr> ## 1 Retain Bower Yes 73% Winner No 27%## 2 Retain May Yes 73% Winner No 27%## 3 Retain Schumacher Yes 76% Winner No 24%## 4 Retain Soorholtz Greer Yes 74% Winner No 26%## 5 + View all – Collapse all + View all – Collapse all + Vie…## ## [[11]]## # A tibble: 5 × 3## `Race Race` `` `` ## <chr> <chr> <chr> ## 1 Retain Christensen Yes 73% Winner No 27%## 2 Retain Mansfield Yes 69% Winner No 31%## 3 Retain McDonald Yes 71% Winner No 29%## 4 Retain Waterman Yes 70% Winner No 30%## 5 + View all – Collapse all + View all – Collapse all + Vie…
ia_results_2020 <- tables[[2]][,-4] %>% mutate( `Total votes` = parse_number(`Total votes`), Absentee = parse_number(Absentee), Winner = gsub("(.*) +.*", "\\1", Margin), `Margin 2020` = gsub("(.*) (+.*)", "\\2", Margin), `Margin 2020` = ifelse(Winner=="Biden", paste0("D",`Margin 2020`), paste0("R",`Margin 2020`)))
## Warning: There were 2 warnings in `mutate()`.## The first warning was:## ℹ In argument: `Total votes = parse_number(`Total votes`)`.## Caused by warning:## ! 1 parsing failure.## row col expected actual## 100 -- a number View all Collapse## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining## warning.
ia_results_2020
## # A tibble: 100 × 7## County Margin `2016 margin` `Total votes` Absentee Winner## <chr> <chr> <chr> <dbl> <dbl> <chr> ## 1 Allam… Trump… R+24.2 7422 4816 Trump ## 2 Polk Biden… D+11.4 258755 140547 Biden ## 3 Linn Biden… D+9 127458 81691 Biden ## 4 Scott Biden… D+1.5 92599 63662 Biden ## 5 Johns… Biden… D+37.9 83851 60517 Biden ## 6 Black… Biden… D+7.4 66593 40870 Biden ## 7 Dallas Trump… R+9.5 56022 34152 Trump ## 8 Dubuq… Trump… R+1.2 53926 34224 Trump ## 9 Story Biden… D+12.3 51038 30224 Biden ## 10 Potta… Trump… R+21.3 45744 27443 Trump ## # ℹ 90 more rows## # ℹ 1 more variable: `Margin 2020` <chr>
Can you get the data you just scraped onto a map of the counties? - Try to map the percentage you calculated to the fill-color of the counties' polygons.
You will need to join
the percentages with the county polygons:
counties <- map_data("county")head(counties)
## long lat group order region subregion## 1 -86.50517 32.34920 1 1 alabama autauga## 2 -86.53382 32.35493 1 2 alabama autauga## 3 -86.54527 32.36639 1 3 alabama autauga## 4 -86.55673 32.37785 1 4 alabama autauga## 5 -86.57966 32.38357 1 5 alabama autauga## 6 -86.59111 32.37785 1 6 alabama autauga
ia_results <- ia_results %>% mutate( PercDem = Clinton/(Trump + Clinton)*100, subregion = tolower(`Vote by county`)) # Which counties don't work?anti_join(ia_results, counties %>% filter(region=="iowa"), by="subregion")
## # A tibble: 1 × 5## `Vote by county` Trump Clinton PercDem subregion## <chr> <dbl> <dbl> <dbl> <chr> ## 1 O'Brien 5752 1315 18.6 o'brien
anti_join(counties %>% filter(region=="iowa"), ia_results, by="subregion")
## long lat group order region subregion## 1 -95.86156 43.26404 825 26075 iowa obrien## 2 -95.39174 43.26404 825 26076 iowa obrien## 3 -95.39747 42.91454 825 26077 iowa obrien## 4 -95.85583 42.90881 825 26078 iowa obrien## 5 -95.86156 43.26404 825 26079 iowa obrien
ia_results <- ia_results %>% mutate( subregion = replace(subregion, subregion=="o'brien", "obrien")) ia_full <- left_join(ia_results, counties %>% filter(region=="iowa"), by="subregion")
ia_full %>% ggplot(aes( x = long, y = lat, group = subregion)) + geom_polygon(aes(fill = PercDem), color = "black") + scale_fill_gradient2(midpoint=50, mid = "white") + ggthemes::theme_map()
Sometimes data on the web is not structured as nicely... e.g. let's assume we want to get a list of all recently active baseball players from Baseball reference
SelectorGadget is a javascript bookmarklet to determine the css selectors of pieces of a website we want to extract.
Bookmark the SelectorGadget link, then click on it to use it (or add the chrome extension)
When SelectorGadget is active, pieces of the website are highlighted in orange/green/red.
Use SelectorGadget on http://www.baseball-reference.com/players/ .
Read more details on vignette("selectorgadget")
(or on the rvest website)
If you prefer, you can also read the HTML code and create your own CSS or XPATH selectors.
Select all active baseball players with a last name starting with 'a'
url <- "http://www.baseball-reference.com/players/a/"html <- read_html(url)html %>% html_elements("b") %>% html_text()
## [1] "Fernando Abad (2010-2021)" ## [2] "Cory Abbott (2021-2022)" ## [3] "CJ Abrams (2022-2023)" ## [4] "Albert Abreu (2020-2023)" ## [5] "Bryan Abreu (2019-2023)" ## [6] "José Abreu (2014-2023)" ## [7] "Domingo Acevedo (2021-2023)" ## [8] "Ronald Acuna Jr. (2018-2023)" ## [9] "Jason Adam (2018-2023)" ## [10] "Willy Adames (2018-2023)" ## [11] "Austin Adams (2017-2022)" ## [12] "Matt Adams (2012-2021)" ## [13] "Riley Adams (2021-2022)" ## [14] "Jo Adell (2020-2022)" ## [15] "Joan Adon (2021-2022)" ## [16] "Ehire Adrianza (2013-2022)" ## [17] "Jesús Aguilar (2014-2023)" ## [18] "Ryan Aguilar (2022-2022)" ## [19] "Nick Ahmed (2014-2023)" ## [20] "Keegan Akin (2020-2023)" ## [21] "Hanser Alberto (2015-2023)" ## [22] "Ozzie Albies (2017-2023)" ## [23] "Jorge Alcala (2019-2023)" ## [24] "Sandy Alcantara (2017-2023)" ## [25] "Sergio Alcántara (2020-2022)" ## [26] "Jason Alexander (2022-2022)" ## [27] "Scott Alexander (2015-2023)" ## [28] "Tyler Alexander (2019-2023)" ## [29] "A.J. Alexy (2021-2022)" ## [30] "Jorge Alfaro (2016-2022)" ## [31] "Kolby Allard (2018-2022)" ## [32] "Cam Alldred (2022-2022)" ## [33] "Austin Allen (2019-2022)" ## [34] "Greg Allen (2017-2022)" ## [35] "Logan Allen (2019-2022)" ## [36] "Nick Allen (2022-2023)" ## [37] "Nick Allgeyer (2021-2021)" ## [38] "Abraham Almonte (2013-2022)" ## [39] "Yency Almonte (2018-2023)" ## [40] "Pete Alonso (2019-2023)" ## [41] "Dan Altavilla (2016-2021)" ## [42] "Jose Altuve (2011-2022)" ## [43] "José Alvarado (2017-2023)" ## [44] "Eddy Alvarez (2020-2022)" ## [45] "Francisco Álvarez (2022-2022)"## [46] "Jose Alvarez (2013-2022)" ## [47] "Yordan Alvarez (2019-2023)" ## [48] "Adbert Alzolay (2019-2023)" ## [49] "Brian Anderson (2017-2023)" ## [50] "Chase Anderson (2014-2022)" ## [51] "Ian Anderson (2020-2022)" ## [52] "Nick Anderson (2019-2023)" ## [53] "Tim Anderson (2016-2023)" ## [54] "Tyler Anderson (2016-2023)" ## [55] "Matt Andriese (2015-2021)" ## [56] "Elvis Andrus (2009-2023)" ## [57] "Miguel Andujar (2017-2022)" ## [58] "Tejay Antone (2020-2021)" ## [59] "Jonathan Aranda (2022-2022)" ## [60] "Víctor Arano (2017-2022)" ## [61] "Jonathan Araúz (2020-2022)" ## [62] "Francisco Arcia (2018-2018)" ## [63] "Orlando Arcia (2016-2023)" ## [64] "Nolan Arenado (2013-2023)" ## [65] "Gabriel Arias (2022-2023)" ## [66] "Shawn Armstrong (2015-2022)" ## [67] "Randy Arozarena (2019-2023)" ## [68] "Luis Arraez (2019-2023)" ## [69] "Christian Arroyo (2017-2023)" ## [70] "Aaron Ashby (2021-2022)" ## [71] "Graham Ashcraft (2022-2023)" ## [72] "Javier Assad (2022-2023)" ## [73] "Pedro Avila (2019-2022)" ## [74] "José Azocar (2022-2023)"
We are, in fact, not just interested in the names of the players, but also in the links to each player's website
html_attr
let's us access an attribute of an html node
html_attrs
extracts all attributes of an html node
html %>% html_elements("b a") %>% html_attr(name="href")
## [1] "/players/a/abadfe01.shtml" ## [2] "/players/a/abbotco01.shtml"## [3] "/players/a/abramcj01.shtml"## [4] "/players/a/abreual01.shtml"## [5] "/players/a/abreubr01.shtml"## [6] "/players/a/abreujo02.shtml"## [7] "/players/a/acevedo01.shtml"## [8] "/players/a/acunaro01.shtml"## [9] "/players/a/adamja01.shtml" ## [10] "/players/a/adamewi01.shtml"## [11] "/players/a/adamsau02.shtml"## [12] "/players/a/adamsma01.shtml"## [13] "/players/a/adamsri03.shtml"## [14] "/players/a/adelljo01.shtml"## [15] "/players/a/adonjo01.shtml" ## [16] "/players/a/adriaeh01.shtml"## [17] "/players/a/aguilje01.shtml"## [18] "/players/a/aguilry01.shtml"## [19] "/players/a/ahmedni01.shtml"## [20] "/players/a/akinke01.shtml" ## [21] "/players/a/alberha01.shtml"## [22] "/players/a/albieoz01.shtml"## [23] "/players/a/alcaljo01.shtml"## [24] "/players/a/alcansa01.shtml"## [25] "/players/a/alcanse01.shtml"## [26] "/players/a/alexaja01.shtml"## [27] "/players/a/alexasc02.shtml"## [28] "/players/a/alexaty01.shtml"## [29] "/players/a/alexyaj01.shtml"## [30] "/players/a/alfarjo01.shtml"## [31] "/players/a/allarko01.shtml"## [32] "/players/a/alldrca01.shtml"## [33] "/players/a/allenau01.shtml"## [34] "/players/a/allengr01.shtml"## [35] "/players/a/allenlo01.shtml"## [36] "/players/a/allenni02.shtml"## [37] "/players/a/allgeni01.shtml"## [38] "/players/a/almonab01.shtml"## [39] "/players/a/almonye01.shtml"## [40] "/players/a/alonspe01.shtml"## [41] "/players/a/altavda01.shtml"## [42] "/players/a/altuvjo01.shtml"## [43] "/players/a/alvarjo03.shtml"## [44] "/players/a/alvared01.shtml"## [45] "/players/a/alvarfr01.shtml"## [46] "/players/a/alvarjo02.shtml"## [47] "/players/a/alvaryo01.shtml"## [48] "/players/a/alzolad01.shtml"## [49] "/players/a/anderbr06.shtml"## [50] "/players/a/anderch01.shtml"## [51] "/players/a/anderia01.shtml"## [52] "/players/a/anderni01.shtml"## [53] "/players/a/anderti01.shtml"## [54] "/players/a/anderty01.shtml"## [55] "/players/a/andrima01.shtml"## [56] "/players/a/andruel01.shtml"## [57] "/players/a/andujmi01.shtml"## [58] "/players/a/antonte01.shtml"## [59] "/players/a/arandjo01.shtml"## [60] "/players/a/aranovi01.shtml"## [61] "/players/a/arauzjo01.shtml"## [62] "/players/a/arciafr01.shtml"## [63] "/players/a/arciaor01.shtml"## [64] "/players/a/arenano01.shtml"## [65] "/players/a/ariasga01.shtml"## [66] "/players/a/armstsh01.shtml"## [67] "/players/a/arozara01.shtml"## [68] "/players/a/arraelu01.shtml"## [69] "/players/a/arroych01.shtml"## [70] "/players/a/ashbyaa01.shtml"## [71] "/players/a/ashcrgr01.shtml"## [72] "/players/a/assadja01.shtml"## [73] "/players/a/avilape01.shtml"## [74] "/players/a/azocajo01.shtml"
Use the SelectorGadget on the website for Fernando Abad
Find the css selector to extract his career statistics and load them into your R session.
Does the same code work to extract career statistics for (some) of the other active players?
What other information do we need to know? - and how can we get to that?
url <- "https://www.baseball-reference.com/players/a/abadfe01.shtml"html <- read_html(url)html %>% html_elements(".stats_pullout") %>% html_text()
## [1] "\n\nSUMMARY\n\nCareer\n\n\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n\n\nG\n400\n\n\nGS\n6\n\n\nSV\n2\n\n\nIP\n348.1\n\n\nSO\n290\n\n\nWHIP\n1.306\n\n\n\n"
html %>% html_elements(".stats_pullout .poptip") %>% html_text()
## [1] "WAR" "W" "L" "ERA" "G" "GS" "SV" "IP" ## [9] "SO" "WHIP"
html %>% html_elements(".p3 p , .p2 p, .p1 p, .stats_pullout strong") %>% html_text()
## [1] "SUMMARY" "Career" "WAR" "3.1" "W" ## [6] "8" "L" "29" "ERA" "3.77" ## [11] "G" "400" "GS" "6" "SV" ## [16] "2" "IP" "348.1" "SO" "290" ## [21] "WHIP" "1.306"
html %>% html_elements(".p1") %>% html_text()
## [1] "\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n"
It's sometimes easier (for data munging after extracting) to extract multiple pieces rather than everything in one go.
(stats <- html %>% html_elements("span strong") %>% html_text())
## [1] "SUMMARY" "WAR" "W" "L" "ERA" ## [6] "G" "GS" "SV" "IP" "SO" ## [11] "WHIP"
(season <- html %>% html_elements(".stats_pullout p:nth-child(2)") %>% html_text())
## [1] "Career" "3.1" "8" "29" "3.77" "400" ## [7] "6" "2" "348.1" "290" "1.306"
(career <- html %>% html_elements(".stats_pullout p:nth-child(3)") %>% html_text())
## character(0)
Clean up the code for extracting the career statistics and write a function getStats
that takes a url and returns a dataset (or tibble).
Does the same code work to extract career statistics for (some) of the other active players?
getStats <- function(url) { html <- read_html(url) stats <- html %>% html_nodes("span strong") %>% html_text() season <- html %>% html_nodes(".stats_pullout p:nth-child(2)") %>% html_text() career <- html %>% html_nodes(".stats_pullout p:nth-child(3)") %>% html_text() data.frame(Statistics = stats[-1], Season= parse_number(season[-1]), Career = ifelse(is.null(career), NA, parse_number(career[-1])))}url <- "http://www.baseball-reference.com/players/a/"html <- read_html(url)players <- html %>% html_nodes("strong a") %>% html_text()links <- html %>% html_nodes("strong a") %>% html_attr(name="href")getStats("https://www.baseball-reference.com//players/a/abreujo02.shtml") %>% str()
## Warning: 1 parsing failure.## row col expected actual## 1 -- a number -
## 'data.frame': 12 obs. of 3 variables:## $ Statistics: chr "WAR" "AB" "H" "HR" ...## $ Season : num NA 21 8 0 0.381 1 3 0 0.458 0.429 ...## ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)## .. ..$ row : int 1## .. ..$ col : int NA## .. ..$ expected: chr "a number"## .. ..$ actual : chr "-"## $ Career : num 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ...
getStats(file.path("https://www.baseball-reference.com", links[2])) %>% str()
## Warning: 1 parsing failure.## row col expected actual## 1 -- a number -
## 'data.frame': 12 obs. of 3 variables:## $ Statistics: chr "WAR" "AB" "H" "HR" ...## $ Season : num NA 21 8 0 0.381 1 3 0 0.458 0.429 ...## ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)## .. ..$ row : int 1## .. ..$ col : int NA## .. ..$ expected: chr "a number"## .. ..$ actual : chr "-"## $ Career : num 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ...
# getStats(file.path("https://www.baseball-reference.com", links[3])) %>% str() # gives an errorgetStats(file.path("https://www.baseball-reference.com", links[4])) %>% str()
## 'data.frame': 12 obs. of 3 variables:## $ Statistics: chr "WAR" "AB" "H" "HR" ...## $ Season : num 4.9 2421 624 118 0.258 ...## $ Career : num NA NA NA NA NA NA NA NA NA NA ...
Now apply to other players (with 'a' as starting letter) - first we get everything tidied up in a dataset
url <- "http://www.baseball-reference.com/players/a/"html <- read_html(url)players <- html %>% html_nodes("strong a") %>% html_text()links <- html %>% html_nodes("strong a") %>% html_attr(name="href")bb <- tibble(players=players, links = links)head(bb)
## # A tibble: 6 × 2## players links ## <chr> <chr> ## 1 Fernando Abad /players/a/abadfe01.shtml ## 2 José Abreu /players/a/abreujo02.shtml## 3 Ronald Acuna Jr. /players/a/acunaro01.shtml## 4 Matt Adams /players/a/adamsma01.shtml## 5 Ehire Adrianza /players/a/adriaeh01.shtml## 6 Jesús Aguilar /players/a/aguilje01.shtml
bb_head <- bb[1:5,] %>% mutate( data = links %>% purrr::map(.f = function(link) { cat(link) cat("\n") getStats(file.path("https://www.baseball-reference.com", link)) }))
## /players/a/abadfe01.shtml## /players/a/abreujo02.shtml## /players/a/acunaro01.shtml## /players/a/adamsma01.shtml## /players/a/adriaeh01.shtml
## Warning: There were 2 warnings in `mutate()`.## The first warning was:## ℹ In argument: `data = `%>%`(...)`.## Caused by warning:## ! 1 parsing failure.## row col expected actual## 1 -- a number -## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining## warning.
Why does
getStats(file.path("https://www.baseball-reference.com", "/players/a/adlemti01.shtml"))
have NAs in the Career?
Is this fixable?
How do we get all of the active players? (not just the ones with last names starting with 'a')
rvest
The session
suite of commands allows to simulate an html session without a browser.
Create a session with session(url)
Navigate: session_jump_to()
Follow a link: session_follow_link()
.
navigate back and forward with session_back()
and session_forward()
.
... and extract the pieces you are interested in using read_html
, html_element
, html_elements
Transform data from web pages into usable information
Automate the process
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |