+ - 0:00:00
Notes for current slide
Notes for next slide

Stat 585 - Web Scraping

Heike Hofmann and Susan Vanderplas

1 / 24

Web Scraping

  • Transform data from web pages into usable information

  • Automate the process

2 / 24

rvest + xml2: Easy Web Scraping

  • read_html gets the full set of HTML markup from a URL
library(rvest)
url <- "https://www.nytimes.com/elections/2016/results/iowa"
html <- read_html(url)
html
## {html_document}
## <html lang="en" itemscope="" xmlns:og="//opengraphprotocol.org/schema/" itemtype="//schema.org/NewsArticle">
## [1] <head>\n<title>Iowa Election Results 2016 – The New Y ...
## [2] <body class="eln-general-state-results eln-state-iowa ...
  • Use html_attr, html_node, html_table, and html_text to extract useful information from the markup
3 / 24

Get a table from an online source

html_table extracts all tables from the sourced html into a list of data frames:

tables <- html %>% html_table(fill=TRUE)
tables %>% purrr::map(glimpse)
## Rows: 11
## Columns: 7
## $ Candidate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Candidate <chr> "Trump\n \n \n …
## $ Party <chr> "Republican\n Rep.", "Democrat\…
## $ Votes <chr> "800,983", "653,669", "59,186", "19,992"…
## $ Pct. <chr> "51.1%", "41.7%", "3.8%", "1.3%", "0.8%"…
## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ E.V. <chr> "6", "—", "—", "—", "—", "—", "—", "—", …
## Rows: 99
## Columns: 3
## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson…
## $ Trump <chr> "93,492", "48,390", "39,149", "21…
## $ Clinton <chr> "119,804", "58,935", "40,440", "5…
## Rows: 6
## Columns: 6
## $ Candidate <lgl> NA, NA, NA, NA, NA, NA
## $ Candidate <chr> "Grassley*\n \n \n…
## $ Party <chr> "Republican\n Rep.", "Democrat\…
## $ Votes <chr> "926,007", "549,460", "41,794", "17,649"…
## $ Pct. <chr> "60.2%", "35.7%", "2.7%", "1.1%", "0.3%"…
## $ `` <lgl> NA, NA, NA, NA, NA, NA
## Rows: 99
## Columns: 3
## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson…
## $ Grassley <chr> "118,164", "62,737", "46,415", "2…
## $ Judge <chr> "100,317", "47,635", "34,503", "4…
## Rows: 4
## Columns: 5
## $ `District\n Dist.` <int> 1, 2, 3, 4
## $ Leader <chr> "54%Blum*\n Rep."…
## $ `` <chr> "46%Vernon\n Dem.…
## $ Rpt. <chr> "100%", "100%", "100%"…
## $ `` <lgl> NA, NA, NA, NA
## Rows: 25
## Columns: 5
## $ `Seat\n Seat` <int> 2, 4, 6, 8, 10, 12, 14, 16,…
## $ Leader <chr> "0%Feenstra*\n Rep.", …
## $ `` <chr> "Uncontested", "39%Bangert\…
## $ Rpt. <chr> "", "100%", "100%", "100%",…
## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA,…
## Rows: 100
## Columns: 5
## $ `District\n Dist.` <int> 1, 2, 3, 4, 5, 6, 7, 8…
## $ Leader <chr> "0%Wills*\n Rep."…
## $ `` <chr> "Uncontested", "Uncont…
## $ Rpt. <chr> "", "", "100%", "100%"…
## $ `` <lgl> NA, NA, NA, NA, NA, NA…
## Rows: 3
## Columns: 5
## $ Question <chr> "Retain Brent Appel", "Retain Daryl Hecht…
## $ Yes <chr> "64%Yes", "64%Yes", "65%Yes"
## $ No <chr> "36%No", "36%No", "35%No"
## $ Rpt. <chr> "100%", "100%", "100%"
## $ `` <lgl> NA, NA, NA
## [[1]]
## # A tibble: 11 × 7
## Candidate Candidate Party Votes Pct. `` E.V.
## <lgl> <chr> <chr> <chr> <chr> <lgl> <chr>
## 1 NA "Trump\n … "Rep… 800,… 51.1% NA 6
## 2 NA "Clinton\n … "Dem… 653,… 41.7% NA —
## 3 NA "Johnson\n … "Lib… 59,1… 3.8% NA —
## 4 NA "Others\n … "Ind… 19,9… 1.3% NA —
## 5 NA "McMullin\n … "Pet… 12,3… 0.8% NA —
## 6 NA "Stein\n … "Gre… 11,4… 0.7% NA —
## 7 NA "Castle\n … "Con… 5,335 0.3% NA —
## 8 NA "Kahn\n … "Ind… 2,247 0.1% NA —
## 9 NA "De La Fuente\n … "Pet… 451 0.0% NA —
## 10 NA "La Riva\n … "P.S… 323 0.0% NA —
## 11 NA "Others\n … "" 32,2… 2.1% NA —
##
## [[2]]
## # A tibble: 99 × 3
## `Vote by county` Trump Clinton
## <chr> <chr> <chr>
## 1 Polk 93,492 119,804
## 2 Linn 48,390 58,935
## 3 Scott 39,149 40,440
## 4 Johnson 21,044 50,200
## 5 Black Hawk 27,476 32,233
## 6 Story 19,458 25,709
## 7 Dubuque 23,460 22,850
## 8 Woodbury 24,727 16,210
## 9 Pottawattamie 24,447 15,355
## 10 Dallas 19,339 15,701
## # ℹ 89 more rows
##
## [[3]]
## # A tibble: 6 × 6
## Candidate Candidate Party Votes Pct. ``
## <lgl> <chr> <chr> <chr> <chr> <lgl>
## 1 NA "Grassley*\n … "Rep… 926,… 60.2% NA
## 2 NA "Judge\n \n … "Dem… 549,… 35.7% NA
## 3 NA "Aldrich\n \… "Lib… 41,7… 2.7% NA
## 4 NA "Hennager\n … "Ind… 17,6… 1.1% NA
## 5 NA "Luick-Thrams\n … "Pet… 4,441 0.3% NA
## 6 NA "Others\n \n… "" 22,0… 1.4% NA
##
## [[4]]
## # A tibble: 99 × 3
## `Vote by county` Grassley Judge
## <chr> <chr> <chr>
## 1 Polk 118,164 100,317
## 2 Linn 62,737 47,635
## 3 Scott 46,415 34,503
## 4 Johnson 28,914 42,699
## 5 Black Hawk 33,884 27,245
## 6 Story 25,475 21,472
## 7 Dubuque 27,348 19,291
## 8 Woodbury 27,166 13,909
## 9 Pottawattamie 25,721 12,943
## 10 Dallas 24,374 11,876
## # ℹ 89 more rows
##
## [[5]]
## # A tibble: 4 × 5
## `District\n Dist.` Leader `` Rpt. ``
## <int> <chr> <chr> <chr> <lgl>
## 1 1 "54%Blum*\n… "46%… 100% NA
## 2 2 "54%Loebsac… "46%… 100% NA
## 3 3 "54%Young*\… "40%… 100% NA
## 4 4 "61%King*\n… "39%… 100% NA
##
## [[6]]
## # A tibble: 25 × 5
## `Seat\n Seat` Leader `` Rpt. ``
## <int> <chr> <chr> <chr> <lgl>
## 1 2 "0%Feenstra*\n … "Unc… "" NA
## 2 4 "61%Guth*\n … "39%… "100… NA
## 3 6 "83%Segebart*\n… "17%… "100… NA
## 4 8 "54%Dawson\n … "46%… "100… NA
## 5 10 "67%Chapman*\n … "33%… "100… NA
## 6 12 "78%Costello*\n… "22%… "100… NA
## 7 14 "74%Sinclair*\n… "26%… "100… NA
## 8 16 "60%Boulton\n … "35%… "100… NA
## 9 18 "0%Petersen*\n … "Unc… "" NA
## 10 20 "60%Zaun*\n … "41%… "100… NA
## # ℹ 15 more rows
##
## [[7]]
## # A tibble: 100 × 5
## `District\n Dist.` Leader `` Rpt. ``
## <int> <chr> <chr> <chr> <lgl>
## 1 1 "0%Wills*\… "Unc… "" NA
## 2 2 "0%Jones*\… "Unc… "" NA
## 3 3 "81%Husema… "19%… "100… NA
## 4 4 "63%Wheele… "37%… "100… NA
## 5 5 "77%Holz*\… "23%… "100… NA
## 6 6 "66%Carlin… "35%… "100… NA
## 7 7 "63%Gassma… "37%… "100… NA
## 8 8 "68%Baxter… "32%… "100… NA
## 9 9 "57%Miller… "43%… "100… NA
## 10 10 "0%Sexton*… "Unc… "" NA
## # ℹ 90 more rows
##
## [[8]]
## # A tibble: 3 × 5
## Question Yes No Rpt. ``
## <chr> <chr> <chr> <chr> <lgl>
## 1 Retain Brent Appel 64%Yes 36%No 100% NA
## 2 Retain Daryl Hecht 64%Yes 36%No 100% NA
## 3 Retain Mark Cady 65%Yes 35%No 100% NA
4 / 24

Data Munging

Most tables need a bit of clean-up:

ia_results <- tables[[2]] %>% mutate(
Trump = parse_number(Trump),
Clinton = parse_number(Clinton)
)
ia_results
## # A tibble: 99 × 3
## `Vote by county` Trump Clinton
## <chr> <dbl> <dbl>
## 1 Polk 93492 119804
## 2 Linn 48390 58935
## 3 Scott 39149 40440
## 4 Johnson 21044 50200
## 5 Black Hawk 27476 32233
## 6 Story 19458 25709
## 7 Dubuque 23460 22850
## 8 Woodbury 24727 16210
## 9 Pottawattamie 24447 15355
## 10 Dallas 19339 15701
## # ℹ 89 more rows
5 / 24

Your Turn

Connect to the website of the NY Times election results at https://www.nytimes.com/elections/2016/results/president

  • Pick your favorite state (by clicking on the state on the map)

  • Use rvest to download the election results for the 2016 Presidential Election for all counties of your state.

  • Clean up the data (numbers should be numbers).

  • Convert the numbers into row-wise percentages (decide on either %Republican or %Democrat).

6 / 24

Your Turn

How well does the web-scraping work over time? Results for the 2020 Election are available from https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html

Again, pick your favorite state and apply your previous code.

How many things have changed? Which format do you prefer? Why?

7 / 24

Your Turn - Solutions

ia2020 <- "https://www.nytimes.com/interactive/2020/11/03/us/elections/results-iowa.html"
html <- read_html(ia2020)
tables <- html %>% html_table(fill=TRUE)
tables %>% purrr::map(glimpse)
## Rows: 12
## Columns: 6
## $ X1 <chr> "", "", "", "", "", "", "", "", "", "", "", "Vi…
## $ X2 <chr> "Trump* Donald J. Trump* Winner", "Biden …
## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria…
## $ X4 <chr> "897,672", "759,061", "19,637", "4,337", "3,210…
## $ X5 <chr> "53.1%", "44.9%", "1.2%", "0.3%", "0.2%", "0.2%…
## $ X6 <chr> "6", "—", "—", "—", "—", "—", "—", "—", "—", "—…
## Rows: 100
## Columns: 6
## $ County <chr> "Allamakee", "Polk", "Linn",…
## $ Margin <chr> "Trump +29", "Biden +15", "B…
## $ `2016 margin` <chr> "R+24.2", "D+11.4", "D+9", "…
## $ `Est. votes reported` <chr> "99%", "100%", "100%", "100%…
## $ `Total votes` <chr> "7,422", "258,755", "127,458…
## $ Absentee <chr> "4,816", "140,547", "81,691"…
## Rows: 12
## Columns: 5
## $ Candidate <chr> "", "", "", "", "", ""…
## $ Candidate <chr> "Biden", "Trump", "Jor…
## $ `Absentee/early votesVotes` <chr> "568,294", "410,030", …
## $ Pct. <chr> "57.2%", "41.3%", "0.9…
## $ `` <chr> "", "", "", "", "", ""…
## Rows: 5
## Columns: 5
## $ X1 <lgl> NA, NA, NA, NA, NA
## $ X2 <chr> "Ernst* Joni Ernst* Winner", "Greenfield …
## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria…
## $ X4 <chr> "864,997", "754,859", "36,961", "13,800", "1,67…
## $ X5 <chr> "51.8%", "45.2%", "2.2%", "0.8%", ""
## Rows: 9
## Columns: 2
## $ X1 <chr> "1st District&nbsp;›", "Representative Abby Finkenau…
## $ X2 <chr> "WinnerHinson +3", "Representative Abby Finkena…
## Rows: 4
## Columns: 6
## $ `District Dist.` <int> 1, 2, 3, 4
## $ Margin <chr> "R+3", "R <0.01", "D+1.4", "R+24"
## $ Candidates <chr> "Hinson Rep. Winner", "Miller-Mee…
## $ `` <chr> "Finkenauer* Dem.", "Hart Dem.", …
## $ `Est. rpt.` <chr> "100%", "100%", "100%", "100%"
## $ `` <chr> "›", "›", "›", "›"
## Rows: 3
## Columns: 2
## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> …
## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> …
## Rows: 26
## Columns: 3
## $ `District Dist.` <chr> "2", "4", "6", "8", "10", "12", "…
## $ Candidates <chr> "Taylor* Rep. 0% Winner", "Guth* …
## $ `` <chr> "Uncontested", "Uncontested", "Pe…
## Rows: 101
## Columns: 3
## $ `District Dist.` <chr> "1", "2", "3", "4", "5", "6", "7"…
## $ Candidates <chr> "Wills* Rep. 0% Winner", "Jones* …
## $ `` <chr> "Uncontested", "Uncontested", "Un…
## Rows: 5
## Columns: 3
## $ Question <chr> "Retain Bower", "Retain May", "Retain Sch…
## $ `` <chr> "Yes 73% Winner", "Yes 73% Winner", "Yes …
## $ `` <chr> "No 27%", "No 27%", "No 24%", "No 26%", "…
## Rows: 5
## Columns: 3
## $ `Race Race` <chr> "Retain Christensen", "Retain Mansfiel…
## $ `` <chr> "Yes 73% Winner", "Yes 69% Winner", "Y…
## $ `` <chr> "No 27%", "No 31%", "No 29%", "No 30%"…
## [[1]]
## # A tibble: 12 × 6
## X1 X2 X3 X4 X5 X6
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "" Trum… "Rep… 897,… "53.… "6"
## 2 "" Bide… "Dem… 759,… "44.… "—"
## 3 "" Jorg… "Lib… 19,6… "1.2… "—"
## 4 "" Writ… "" 4,337 "0.3… "—"
## 5 "" West… "Ind… 3,210 "0.2… "—"
## 6 "" Hawk… "Gre… 3,075 "0.2… "—"
## 7 "" Blan… "Con… 1,707 "0.1… "—"
## 8 "" De L… "Oth… 1,082 "0.1… "—"
## 9 "" King… "Ind… 546 "<0.… "—"
## 10 "" Pier… "Ind… 544 "<0.… "—"
## 11 "" Tota… "" 1,69… "" ""
## 12 "View all candidates Colla… View… "Vie… View… "Vie… "Vie…
##
## [[2]]
## # A tibble: 100 × 6
## County Margin `2016 margin` `Est. votes reported`
## <chr> <chr> <chr> <chr>
## 1 Allamakee Trump … R+24.2 99%
## 2 Polk Biden … D+11.4 100%
## 3 Linn Biden … D+9 100%
## 4 Scott Biden … D+1.5 100%
## 5 Johnson Biden … D+37.9 100%
## 6 Black Hawk Biden … D+7.4 100%
## 7 Dallas Trump … R+9.5 100%
## 8 Dubuque Trump … R+1.2 100%
## 9 Story Biden … D+12.3 100%
## 10 Pottawattamie Trump … R+21.3 100%
## # ℹ 90 more rows
## # ℹ 2 more variables: `Total votes` <chr>, Absentee <chr>
##
## [[3]]
## # A tibble: 12 × 5
## Candidate Candidate Absentee/early votes…¹ Pct. ``
## <chr> <chr> <chr> <chr> <chr>
## 1 "" Biden 568,294 "57.… ""
## 2 "" Trump 410,030 "41.… ""
## 3 "" Jorgensen 8,463 "0.9… ""
## 4 "" Hawkins 1,713 "0.2… ""
## 5 "" West 1,405 "0.1… ""
## 6 "" Blankens… 1,062 "0.1… ""
## 7 "" De La Fu… 587 "0.1… ""
## 8 "" Write-ins 522 "0.1… ""
## 9 "" King 341 "<0.… ""
## 10 "" Pierce 305 "<0.… ""
## 11 "" Total re… 992,722 "" ""
## 12 "View all c… View all… View all candidates C… "Vie… "Vie…
## # ℹ abbreviated name: ¹​`Absentee/early votesVotes`
##
## [[4]]
## # A tibble: 5 × 5
## X1 X2 X3 X4 X5
## <lgl> <chr> <chr> <chr> <chr>
## 1 NA Ernst* Joni Ernst* Winner "Republ… 864,… "51.…
## 2 NA Greenfield Theresa Greenfield "Democr… 754,… "45.…
## 3 NA Stewart Rick Stewart "Libert… 36,9… "2.2…
## 4 NA Herzog Suzanne Herzog "Indepe… 13,8… "0.8…
## 5 NA Total reported Total reported "" 1,67… ""
##
## [[5]]
## # A tibble: 9 × 2
## X1 X2
## <chr> <chr>
## 1 1st District&nbsp;› Winn…
## 2 Representative Abby Finkenauer, a Democrat, flipped… Repr…
## 3 100% reported 100%…
## 4 2nd District&nbsp;› Winn…
## 5 Democrats were hoping to hang on to the seat being … Demo…
## 6 100% reported 100%…
## 7 4th District&nbsp;› Winn…
## 8 J.D. Scholten, a former minor league pitcher turned… J.D.…
## 9 100% reported 100%…
##
## [[6]]
## # A tibble: 4 × 6
## `District Dist.` Margin Candidates `` `Est. rpt.` ``
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 R+3 Hinson Re… Fink… 100% ›
## 2 2 R <0.… Miller-Me… Hart… 100% ›
## 3 3 D+1.4 Axne* Dem… Youn… 100% ›
## 4 4 R+24 Feenstra … Scho… 100% ›
##
## [[7]]
## # A tibble: 3 × 2
## Constitutional Convention Questio…¹ Constitutional Conve…²
## <chr> <chr>
## 1 "" ""
## 2 "Yes 30%" "No 70% Winner"
## 3 "1,381,676 votes reported" "1,381,676 votes repo…
## # ℹ abbreviated names:
## # ¹​`Constitutional Convention Question: Explore Proposals for Changes to Constitution`,
## # ²​`Constitutional Convention Question: Explore Proposals for Changes to Constitution`
##
## [[8]]
## # A tibble: 26 × 3
## `District Dist.` Candidates ``
## <chr> <chr> <chr>
## 1 2 Taylor* Rep. 0% Winner Uncontested
## 2 4 Guth* Rep. 0% Winner Uncontested
## 3 6 Williams Rep. 69% Winner Petersen Dem.…
## 4 8 Dawson* Rep. 52% Winner Gorman Dem. 4…
## 5 10 Chapman* Rep. 63% Winner Varley Dem. 3…
## 6 12 Costello* Rep. 69% Winner Norris Dem. 3…
## 7 14 Sinclair* Rep. 0% Winner Uncontested
## 8 16 Boulton* Dem. 77% Winner Johnson Lib. …
## 9 18 Petersen* Dem. 0% Winner Uncontested
## 10 20 Zaun* Rep. 51% Winner Martin Dem. 4…
## # ℹ 16 more rows
##
## [[9]]
## # A tibble: 101 × 3
## `District Dist.` Candidates ``
## <chr> <chr> <chr>
## 1 1 Wills* Rep. 0% Winner Uncontested
## 2 2 Jones* Rep. 0% Winner Uncontested
## 3 3 Bush Rep. 0% Winner Uncontested
## 4 4 Wheeler* Rep. 82% Winner Johnson Dem. 1…
## 5 5 Jeneary* Rep. 0% Winner Uncontested
## 6 6 Bossman* Rep. 0% Winner Uncontested
## 7 7 Stone Rep. 62% Winner Jensen Dem. 38%
## 8 8 Baxter* Rep. 75% Winner Kiss Dem. 25%
## 9 9 Meyer* Rep. 57% Winner Clayton Dem. 4…
## 10 10 Sexton* Rep. 76% Winner Schultes Dem. …
## # ℹ 91 more rows
##
## [[10]]
## # A tibble: 5 × 3
## Question `` ``
## <chr> <chr> <chr>
## 1 Retain Bower Yes 73% Winner No 27%
## 2 Retain May Yes 73% Winner No 27%
## 3 Retain Schumacher Yes 76% Winner No 24%
## 4 Retain Soorholtz Greer Yes 74% Winner No 26%
## 5 + View all – Collapse all + View all – Collapse all + Vie…
##
## [[11]]
## # A tibble: 5 × 3
## `Race Race` `` ``
## <chr> <chr> <chr>
## 1 Retain Christensen Yes 73% Winner No 27%
## 2 Retain Mansfield Yes 69% Winner No 31%
## 3 Retain McDonald Yes 71% Winner No 29%
## 4 Retain Waterman Yes 70% Winner No 30%
## 5 + View all – Collapse all + View all – Collapse all + Vie…
ia_results_2020 <- tables[[2]][,-4] %>% mutate(
`Total votes` = parse_number(`Total votes`),
Absentee = parse_number(Absentee),
Winner = gsub("(.*) +.*", "\\1", Margin),
`Margin 2020` = gsub("(.*) (+.*)", "\\2", Margin),
`Margin 2020` = ifelse(Winner=="Biden", paste0("D",`Margin 2020`), paste0("R",`Margin 2020`))
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Total votes = parse_number(`Total votes`)`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
## 100 -- a number View all Collapse
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining
## warning.
ia_results_2020
## # A tibble: 100 × 7
## County Margin `2016 margin` `Total votes` Absentee Winner
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Allam… Trump… R+24.2 7422 4816 Trump
## 2 Polk Biden… D+11.4 258755 140547 Biden
## 3 Linn Biden… D+9 127458 81691 Biden
## 4 Scott Biden… D+1.5 92599 63662 Biden
## 5 Johns… Biden… D+37.9 83851 60517 Biden
## 6 Black… Biden… D+7.4 66593 40870 Biden
## 7 Dallas Trump… R+9.5 56022 34152 Trump
## 8 Dubuq… Trump… R+1.2 53926 34224 Trump
## 9 Story Biden… D+12.3 51038 30224 Biden
## 10 Potta… Trump… R+21.3 45744 27443 Trump
## # ℹ 90 more rows
## # ℹ 1 more variable: `Margin 2020` <chr>
8 / 24

Your Turn

Can you get the data you just scraped onto a map of the counties? - Try to map the percentage you calculated to the fill-color of the counties' polygons.

You will need to join the percentages with the county polygons:

counties <- map_data("county")
head(counties)
## long lat group order region subregion
## 1 -86.50517 32.34920 1 1 alabama autauga
## 2 -86.53382 32.35493 1 2 alabama autauga
## 3 -86.54527 32.36639 1 3 alabama autauga
## 4 -86.55673 32.37785 1 4 alabama autauga
## 5 -86.57966 32.38357 1 5 alabama autauga
## 6 -86.59111 32.37785 1 6 alabama autauga
9 / 24

Your Turn - Solution

ia_results <- ia_results %>% mutate(
PercDem = Clinton/(Trump + Clinton)*100,
subregion = tolower(`Vote by county`)
)
# Which counties don't work?
anti_join(ia_results, counties %>% filter(region=="iowa"), by="subregion")
## # A tibble: 1 × 5
## `Vote by county` Trump Clinton PercDem subregion
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 O'Brien 5752 1315 18.6 o'brien
anti_join(counties %>% filter(region=="iowa"), ia_results, by="subregion")
## long lat group order region subregion
## 1 -95.86156 43.26404 825 26075 iowa obrien
## 2 -95.39174 43.26404 825 26076 iowa obrien
## 3 -95.39747 42.91454 825 26077 iowa obrien
## 4 -95.85583 42.90881 825 26078 iowa obrien
## 5 -95.86156 43.26404 825 26079 iowa obrien
10 / 24

Your Turn - Solution (cont'd)

ia_results <- ia_results %>% mutate(
subregion = replace(subregion, subregion=="o'brien", "obrien")
)
ia_full <- left_join(ia_results,
counties %>% filter(region=="iowa"),
by="subregion")
11 / 24

Your Turn - Solution (cont'd)

ia_full %>%
ggplot(aes( x = long, y = lat, group = subregion)) +
geom_polygon(aes(fill = PercDem), color = "black") +
scale_fill_gradient2(midpoint=50, mid = "white") +
ggthemes::theme_map()

12 / 24

Beyond tables

Sometimes data on the web is not structured as nicely... e.g. let's assume we want to get a list of all recently active baseball players from Baseball reference

13 / 24

SelectorGadget

  • SelectorGadget is a javascript bookmarklet to determine the css selectors of pieces of a website we want to extract.

  • Bookmark the SelectorGadget link, then click on it to use it (or add the chrome extension)

  • When SelectorGadget is active, pieces of the website are highlighted in orange/green/red.

  • Use SelectorGadget on http://www.baseball-reference.com/players/ .

  • Read more details on vignette("selectorgadget") (or on the rvest website)

If you prefer, you can also read the HTML code and create your own CSS or XPATH selectors.

14 / 24

SelectorGadget Result

Select all active baseball players with a last name starting with 'a'

url <- "http://www.baseball-reference.com/players/a/"
html <- read_html(url)
html %>% html_elements("b") %>% html_text()
## [1] "Fernando Abad (2010-2021)"
## [2] "Cory Abbott (2021-2022)"
## [3] "CJ Abrams (2022-2023)"
## [4] "Albert Abreu (2020-2023)"
## [5] "Bryan Abreu (2019-2023)"
## [6] "José Abreu (2014-2023)"
## [7] "Domingo Acevedo (2021-2023)"
## [8] "Ronald Acuna Jr. (2018-2023)"
## [9] "Jason Adam (2018-2023)"
## [10] "Willy Adames (2018-2023)"
## [11] "Austin Adams (2017-2022)"
## [12] "Matt Adams (2012-2021)"
## [13] "Riley Adams (2021-2022)"
## [14] "Jo Adell (2020-2022)"
## [15] "Joan Adon (2021-2022)"
## [16] "Ehire Adrianza (2013-2022)"
## [17] "Jesús Aguilar (2014-2023)"
## [18] "Ryan Aguilar (2022-2022)"
## [19] "Nick Ahmed (2014-2023)"
## [20] "Keegan Akin (2020-2023)"
## [21] "Hanser Alberto (2015-2023)"
## [22] "Ozzie Albies (2017-2023)"
## [23] "Jorge Alcala (2019-2023)"
## [24] "Sandy Alcantara (2017-2023)"
## [25] "Sergio Alcántara (2020-2022)"
## [26] "Jason Alexander (2022-2022)"
## [27] "Scott Alexander (2015-2023)"
## [28] "Tyler Alexander (2019-2023)"
## [29] "A.J. Alexy (2021-2022)"
## [30] "Jorge Alfaro (2016-2022)"
## [31] "Kolby Allard (2018-2022)"
## [32] "Cam Alldred (2022-2022)"
## [33] "Austin Allen (2019-2022)"
## [34] "Greg Allen (2017-2022)"
## [35] "Logan Allen (2019-2022)"
## [36] "Nick Allen (2022-2023)"
## [37] "Nick Allgeyer (2021-2021)"
## [38] "Abraham Almonte (2013-2022)"
## [39] "Yency Almonte (2018-2023)"
## [40] "Pete Alonso (2019-2023)"
## [41] "Dan Altavilla (2016-2021)"
## [42] "Jose Altuve (2011-2022)"
## [43] "José Alvarado (2017-2023)"
## [44] "Eddy Alvarez (2020-2022)"
## [45] "Francisco Álvarez (2022-2022)"
## [46] "Jose Alvarez (2013-2022)"
## [47] "Yordan Alvarez (2019-2023)"
## [48] "Adbert Alzolay (2019-2023)"
## [49] "Brian Anderson (2017-2023)"
## [50] "Chase Anderson (2014-2022)"
## [51] "Ian Anderson (2020-2022)"
## [52] "Nick Anderson (2019-2023)"
## [53] "Tim Anderson (2016-2023)"
## [54] "Tyler Anderson (2016-2023)"
## [55] "Matt Andriese (2015-2021)"
## [56] "Elvis Andrus (2009-2023)"
## [57] "Miguel Andujar (2017-2022)"
## [58] "Tejay Antone (2020-2021)"
## [59] "Jonathan Aranda (2022-2022)"
## [60] "Víctor Arano (2017-2022)"
## [61] "Jonathan Araúz (2020-2022)"
## [62] "Francisco Arcia (2018-2018)"
## [63] "Orlando Arcia (2016-2023)"
## [64] "Nolan Arenado (2013-2023)"
## [65] "Gabriel Arias (2022-2023)"
## [66] "Shawn Armstrong (2015-2022)"
## [67] "Randy Arozarena (2019-2023)"
## [68] "Luis Arraez (2019-2023)"
## [69] "Christian Arroyo (2017-2023)"
## [70] "Aaron Ashby (2021-2022)"
## [71] "Graham Ashcraft (2022-2023)"
## [72] "Javier Assad (2022-2023)"
## [73] "Pedro Avila (2019-2022)"
## [74] "José Azocar (2022-2023)"
15 / 24

Example, varied

We are, in fact, not just interested in the names of the players, but also in the links to each player's website

  • html_attr let's us access an attribute of an html node

  • html_attrs extracts all attributes of an html node

html %>% html_elements("b a") %>% html_attr(name="href")
## [1] "/players/a/abadfe01.shtml"
## [2] "/players/a/abbotco01.shtml"
## [3] "/players/a/abramcj01.shtml"
## [4] "/players/a/abreual01.shtml"
## [5] "/players/a/abreubr01.shtml"
## [6] "/players/a/abreujo02.shtml"
## [7] "/players/a/acevedo01.shtml"
## [8] "/players/a/acunaro01.shtml"
## [9] "/players/a/adamja01.shtml"
## [10] "/players/a/adamewi01.shtml"
## [11] "/players/a/adamsau02.shtml"
## [12] "/players/a/adamsma01.shtml"
## [13] "/players/a/adamsri03.shtml"
## [14] "/players/a/adelljo01.shtml"
## [15] "/players/a/adonjo01.shtml"
## [16] "/players/a/adriaeh01.shtml"
## [17] "/players/a/aguilje01.shtml"
## [18] "/players/a/aguilry01.shtml"
## [19] "/players/a/ahmedni01.shtml"
## [20] "/players/a/akinke01.shtml"
## [21] "/players/a/alberha01.shtml"
## [22] "/players/a/albieoz01.shtml"
## [23] "/players/a/alcaljo01.shtml"
## [24] "/players/a/alcansa01.shtml"
## [25] "/players/a/alcanse01.shtml"
## [26] "/players/a/alexaja01.shtml"
## [27] "/players/a/alexasc02.shtml"
## [28] "/players/a/alexaty01.shtml"
## [29] "/players/a/alexyaj01.shtml"
## [30] "/players/a/alfarjo01.shtml"
## [31] "/players/a/allarko01.shtml"
## [32] "/players/a/alldrca01.shtml"
## [33] "/players/a/allenau01.shtml"
## [34] "/players/a/allengr01.shtml"
## [35] "/players/a/allenlo01.shtml"
## [36] "/players/a/allenni02.shtml"
## [37] "/players/a/allgeni01.shtml"
## [38] "/players/a/almonab01.shtml"
## [39] "/players/a/almonye01.shtml"
## [40] "/players/a/alonspe01.shtml"
## [41] "/players/a/altavda01.shtml"
## [42] "/players/a/altuvjo01.shtml"
## [43] "/players/a/alvarjo03.shtml"
## [44] "/players/a/alvared01.shtml"
## [45] "/players/a/alvarfr01.shtml"
## [46] "/players/a/alvarjo02.shtml"
## [47] "/players/a/alvaryo01.shtml"
## [48] "/players/a/alzolad01.shtml"
## [49] "/players/a/anderbr06.shtml"
## [50] "/players/a/anderch01.shtml"
## [51] "/players/a/anderia01.shtml"
## [52] "/players/a/anderni01.shtml"
## [53] "/players/a/anderti01.shtml"
## [54] "/players/a/anderty01.shtml"
## [55] "/players/a/andrima01.shtml"
## [56] "/players/a/andruel01.shtml"
## [57] "/players/a/andujmi01.shtml"
## [58] "/players/a/antonte01.shtml"
## [59] "/players/a/arandjo01.shtml"
## [60] "/players/a/aranovi01.shtml"
## [61] "/players/a/arauzjo01.shtml"
## [62] "/players/a/arciafr01.shtml"
## [63] "/players/a/arciaor01.shtml"
## [64] "/players/a/arenano01.shtml"
## [65] "/players/a/ariasga01.shtml"
## [66] "/players/a/armstsh01.shtml"
## [67] "/players/a/arozara01.shtml"
## [68] "/players/a/arraelu01.shtml"
## [69] "/players/a/arroych01.shtml"
## [70] "/players/a/ashbyaa01.shtml"
## [71] "/players/a/ashcrgr01.shtml"
## [72] "/players/a/assadja01.shtml"
## [73] "/players/a/avilape01.shtml"
## [74] "/players/a/azocajo01.shtml"
16 / 24

Your Turn

Use the SelectorGadget on the website for Fernando Abad

Find the css selector to extract his career statistics and load them into your R session.

Does the same code work to extract career statistics for (some) of the other active players?

What other information do we need to know? - and how can we get to that?

17 / 24

Your Turn - Solution

url <- "https://www.baseball-reference.com/players/a/abadfe01.shtml"
html <- read_html(url)
html %>% html_elements(".stats_pullout") %>% html_text()
## [1] "\n\nSUMMARY\n\nCareer\n\n\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n\n\nG\n400\n\n\nGS\n6\n\n\nSV\n2\n\n\nIP\n348.1\n\n\nSO\n290\n\n\nWHIP\n1.306\n\n\n\n"
html %>% html_elements(".stats_pullout .poptip") %>% html_text()
## [1] "WAR" "W" "L" "ERA" "G" "GS" "SV" "IP"
## [9] "SO" "WHIP"
html %>% html_elements(".p3 p , .p2 p, .p1 p, .stats_pullout strong") %>% html_text()
## [1] "SUMMARY" "Career" "WAR" "3.1" "W"
## [6] "8" "L" "29" "ERA" "3.77"
## [11] "G" "400" "GS" "6" "SV"
## [16] "2" "IP" "348.1" "SO" "290"
## [21] "WHIP" "1.306"
html %>% html_elements(".p1") %>% html_text()
## [1] "\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n"
18 / 24

Your Turn - Solution (cont'd)

It's sometimes easier (for data munging after extracting) to extract multiple pieces rather than everything in one go.

(stats <- html %>% html_elements("span strong") %>% html_text())
## [1] "SUMMARY" "WAR" "W" "L" "ERA"
## [6] "G" "GS" "SV" "IP" "SO"
## [11] "WHIP"
(season <- html %>% html_elements(".stats_pullout p:nth-child(2)") %>% html_text())
## [1] "Career" "3.1" "8" "29" "3.77" "400"
## [7] "6" "2" "348.1" "290" "1.306"
(career <- html %>% html_elements(".stats_pullout p:nth-child(3)") %>% html_text())
## character(0)
19 / 24

Your Turn

Clean up the code for extracting the career statistics and write a function getStats that takes a url and returns a dataset (or tibble).

Does the same code work to extract career statistics for (some) of the other active players?

20 / 24

Your Turn - Solution

getStats <- function(url) {
html <- read_html(url)
stats <- html %>% html_nodes("span strong") %>% html_text()
season <- html %>% html_nodes(".stats_pullout p:nth-child(2)") %>% html_text()
career <- html %>% html_nodes(".stats_pullout p:nth-child(3)") %>% html_text()
data.frame(Statistics = stats[-1],
Season= parse_number(season[-1]),
Career = ifelse(is.null(career), NA, parse_number(career[-1])))
}
url <- "http://www.baseball-reference.com/players/a/"
html <- read_html(url)
players <- html %>% html_nodes("strong a") %>% html_text()
links <- html %>% html_nodes("strong a") %>% html_attr(name="href")
getStats("https://www.baseball-reference.com//players/a/abreujo02.shtml") %>% str()
## Warning: 1 parsing failure.
## row col expected actual
## 1 -- a number -
## 'data.frame': 12 obs. of 3 variables:
## $ Statistics: chr "WAR" "AB" "H" "HR" ...
## $ Season : num NA 21 8 0 0.381 1 3 0 0.458 0.429 ...
## ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
## .. ..$ row : int 1
## .. ..$ col : int NA
## .. ..$ expected: chr "a number"
## .. ..$ actual : chr "-"
## $ Career : num 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ...
getStats(file.path("https://www.baseball-reference.com", links[2])) %>% str()
## Warning: 1 parsing failure.
## row col expected actual
## 1 -- a number -
## 'data.frame': 12 obs. of 3 variables:
## $ Statistics: chr "WAR" "AB" "H" "HR" ...
## $ Season : num NA 21 8 0 0.381 1 3 0 0.458 0.429 ...
## ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
## .. ..$ row : int 1
## .. ..$ col : int NA
## .. ..$ expected: chr "a number"
## .. ..$ actual : chr "-"
## $ Career : num 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ...
# getStats(file.path("https://www.baseball-reference.com", links[3])) %>% str() # gives an error
getStats(file.path("https://www.baseball-reference.com", links[4])) %>% str()
## 'data.frame': 12 obs. of 3 variables:
## $ Statistics: chr "WAR" "AB" "H" "HR" ...
## $ Season : num 4.9 2421 624 118 0.258 ...
## $ Career : num NA NA NA NA NA NA NA NA NA NA ...
21 / 24

Your Turn - Solution (cont'd)

Now apply to other players (with 'a' as starting letter) - first we get everything tidied up in a dataset

url <- "http://www.baseball-reference.com/players/a/"
html <- read_html(url)
players <- html %>% html_nodes("strong a") %>% html_text()
links <- html %>% html_nodes("strong a") %>% html_attr(name="href")
bb <- tibble(players=players, links = links)
head(bb)
## # A tibble: 6 × 2
## players links
## <chr> <chr>
## 1 Fernando Abad /players/a/abadfe01.shtml
## 2 José Abreu /players/a/abreujo02.shtml
## 3 Ronald Acuna Jr. /players/a/acunaro01.shtml
## 4 Matt Adams /players/a/adamsma01.shtml
## 5 Ehire Adrianza /players/a/adriaeh01.shtml
## 6 Jesús Aguilar /players/a/aguilje01.shtml
bb_head <- bb[1:5,] %>% mutate(
data = links %>% purrr::map(.f = function(link) {
cat(link)
cat("\n")
getStats(file.path("https://www.baseball-reference.com", link))
})
)
## /players/a/abadfe01.shtml
## /players/a/abreujo02.shtml
## /players/a/acunaro01.shtml
## /players/a/adamsma01.shtml
## /players/a/adriaeh01.shtml
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `data = `%>%`(...)`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
## 1 -- a number -
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining
## warning.
22 / 24

Your Turn

Why does getStats(file.path("https://www.baseball-reference.com", "/players/a/adlemti01.shtml")) have NAs in the Career?

Is this fixable?

How do we get all of the active players? (not just the ones with last names starting with 'a')

23 / 24

Package rvest

The session suite of commands allows to simulate an html session without a browser.

Create a session with session(url)

Navigate: session_jump_to() Follow a link: session_follow_link().

navigate back and forward with session_back() and session_forward().

... and extract the pieces you are interested in using read_html, html_element, html_elements

24 / 24

Web Scraping

  • Transform data from web pages into usable information

  • Automate the process

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow