class: center, middle, inverse, title-slide .title[ # Stat 585 - Web Scraping ] .author[ ### Heike Hofmann and Susan Vanderplas ] --- ## Web Scraping - Transform data from web pages into usable information - Automate the process ![](http://webdata-scraping.com/media/2015/01/Web-Scraping-Process.png) --- ## `rvest` + `xml2`: Easy Web Scraping - `read_html` gets the full set of HTML markup from a URL ```r library(rvest) url <- "https://www.nytimes.com/elections/2016/results/iowa" html <- read_html(url) html ``` ``` ## {html_document} ## <html lang="en" itemscope="" xmlns:og="//opengraphprotocol.org/schema/" itemtype="//schema.org/NewsArticle"> ## [1] <head>\n<title>Iowa Election Results 2016 – The New Y ... ## [2] <body class="eln-general-state-results eln-state-iowa ... ``` - Use `html_attr`, `html_node`, `html_table`, and `html_text` to extract useful information from the markup --- ## Get a *table* from an online source `html_table` extracts all tables from the sourced html into a list of data frames: ```r tables <- html %>% html_table(fill=TRUE) tables %>% purrr::map(glimpse) ``` ``` ## Rows: 11 ## Columns: 7 ## $ Candidate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ Candidate <chr> "Trump\n \n \n … ## $ Party <chr> "Republican\n Rep.", "Democrat\… ## $ Votes <chr> "800,983", "653,669", "59,186", "19,992"… ## $ Pct. <chr> "51.1%", "41.7%", "3.8%", "1.3%", "0.8%"… ## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ E.V. <chr> "6", "—", "—", "—", "—", "—", "—", "—", … ## Rows: 99 ## Columns: 3 ## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson… ## $ Trump <chr> "93,492", "48,390", "39,149", "21… ## $ Clinton <chr> "119,804", "58,935", "40,440", "5… ## Rows: 6 ## Columns: 6 ## $ Candidate <lgl> NA, NA, NA, NA, NA, NA ## $ Candidate <chr> "Grassley*\n \n \n… ## $ Party <chr> "Republican\n Rep.", "Democrat\… ## $ Votes <chr> "926,007", "549,460", "41,794", "17,649"… ## $ Pct. <chr> "60.2%", "35.7%", "2.7%", "1.1%", "0.3%"… ## $ `` <lgl> NA, NA, NA, NA, NA, NA ## Rows: 99 ## Columns: 3 ## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson… ## $ Grassley <chr> "118,164", "62,737", "46,415", "2… ## $ Judge <chr> "100,317", "47,635", "34,503", "4… ## Rows: 4 ## Columns: 5 ## $ `District\n Dist.` <int> 1, 2, 3, 4 ## $ Leader <chr> "54%Blum*\n Rep."… ## $ `` <chr> "46%Vernon\n Dem.… ## $ Rpt. <chr> "100%", "100%", "100%"… ## $ `` <lgl> NA, NA, NA, NA ## Rows: 25 ## Columns: 5 ## $ `Seat\n Seat` <int> 2, 4, 6, 8, 10, 12, 14, 16,… ## $ Leader <chr> "0%Feenstra*\n Rep.", … ## $ `` <chr> "Uncontested", "39%Bangert\… ## $ Rpt. <chr> "", "100%", "100%", "100%",… ## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA,… ## Rows: 100 ## Columns: 5 ## $ `District\n Dist.` <int> 1, 2, 3, 4, 5, 6, 7, 8… ## $ Leader <chr> "0%Wills*\n Rep."… ## $ `` <chr> "Uncontested", "Uncont… ## $ Rpt. <chr> "", "", "100%", "100%"… ## $ `` <lgl> NA, NA, NA, NA, NA, NA… ## Rows: 3 ## Columns: 5 ## $ Question <chr> "Retain Brent Appel", "Retain Daryl Hecht… ## $ Yes <chr> "64%Yes", "64%Yes", "65%Yes" ## $ No <chr> "36%No", "36%No", "35%No" ## $ Rpt. <chr> "100%", "100%", "100%" ## $ `` <lgl> NA, NA, NA ``` ``` ## [[1]] ## # A tibble: 11 × 7 ## Candidate Candidate Party Votes Pct. `` E.V. ## <lgl> <chr> <chr> <chr> <chr> <lgl> <chr> ## 1 NA "Trump\n … "Rep… 800,… 51.1% NA 6 ## 2 NA "Clinton\n … "Dem… 653,… 41.7% NA — ## 3 NA "Johnson\n … "Lib… 59,1… 3.8% NA — ## 4 NA "Others\n … "Ind… 19,9… 1.3% NA — ## 5 NA "McMullin\n … "Pet… 12,3… 0.8% NA — ## 6 NA "Stein\n … "Gre… 11,4… 0.7% NA — ## 7 NA "Castle\n … "Con… 5,335 0.3% NA — ## 8 NA "Kahn\n … "Ind… 2,247 0.1% NA — ## 9 NA "De La Fuente\n … "Pet… 451 0.0% NA — ## 10 NA "La Riva\n … "P.S… 323 0.0% NA — ## 11 NA "Others\n … "" 32,2… 2.1% NA — ## ## [[2]] ## # A tibble: 99 × 3 ## `Vote by county` Trump Clinton ## <chr> <chr> <chr> ## 1 Polk 93,492 119,804 ## 2 Linn 48,390 58,935 ## 3 Scott 39,149 40,440 ## 4 Johnson 21,044 50,200 ## 5 Black Hawk 27,476 32,233 ## 6 Story 19,458 25,709 ## 7 Dubuque 23,460 22,850 ## 8 Woodbury 24,727 16,210 ## 9 Pottawattamie 24,447 15,355 ## 10 Dallas 19,339 15,701 ## # ℹ 89 more rows ## ## [[3]] ## # A tibble: 6 × 6 ## Candidate Candidate Party Votes Pct. `` ## <lgl> <chr> <chr> <chr> <chr> <lgl> ## 1 NA "Grassley*\n … "Rep… 926,… 60.2% NA ## 2 NA "Judge\n \n … "Dem… 549,… 35.7% NA ## 3 NA "Aldrich\n \… "Lib… 41,7… 2.7% NA ## 4 NA "Hennager\n … "Ind… 17,6… 1.1% NA ## 5 NA "Luick-Thrams\n … "Pet… 4,441 0.3% NA ## 6 NA "Others\n \n… "" 22,0… 1.4% NA ## ## [[4]] ## # A tibble: 99 × 3 ## `Vote by county` Grassley Judge ## <chr> <chr> <chr> ## 1 Polk 118,164 100,317 ## 2 Linn 62,737 47,635 ## 3 Scott 46,415 34,503 ## 4 Johnson 28,914 42,699 ## 5 Black Hawk 33,884 27,245 ## 6 Story 25,475 21,472 ## 7 Dubuque 27,348 19,291 ## 8 Woodbury 27,166 13,909 ## 9 Pottawattamie 25,721 12,943 ## 10 Dallas 24,374 11,876 ## # ℹ 89 more rows ## ## [[5]] ## # A tibble: 4 × 5 ## `District\n Dist.` Leader `` Rpt. `` ## <int> <chr> <chr> <chr> <lgl> ## 1 1 "54%Blum*\n… "46%… 100% NA ## 2 2 "54%Loebsac… "46%… 100% NA ## 3 3 "54%Young*\… "40%… 100% NA ## 4 4 "61%King*\n… "39%… 100% NA ## ## [[6]] ## # A tibble: 25 × 5 ## `Seat\n Seat` Leader `` Rpt. `` ## <int> <chr> <chr> <chr> <lgl> ## 1 2 "0%Feenstra*\n … "Unc… "" NA ## 2 4 "61%Guth*\n … "39%… "100… NA ## 3 6 "83%Segebart*\n… "17%… "100… NA ## 4 8 "54%Dawson\n … "46%… "100… NA ## 5 10 "67%Chapman*\n … "33%… "100… NA ## 6 12 "78%Costello*\n… "22%… "100… NA ## 7 14 "74%Sinclair*\n… "26%… "100… NA ## 8 16 "60%Boulton\n … "35%… "100… NA ## 9 18 "0%Petersen*\n … "Unc… "" NA ## 10 20 "60%Zaun*\n … "41%… "100… NA ## # ℹ 15 more rows ## ## [[7]] ## # A tibble: 100 × 5 ## `District\n Dist.` Leader `` Rpt. `` ## <int> <chr> <chr> <chr> <lgl> ## 1 1 "0%Wills*\… "Unc… "" NA ## 2 2 "0%Jones*\… "Unc… "" NA ## 3 3 "81%Husema… "19%… "100… NA ## 4 4 "63%Wheele… "37%… "100… NA ## 5 5 "77%Holz*\… "23%… "100… NA ## 6 6 "66%Carlin… "35%… "100… NA ## 7 7 "63%Gassma… "37%… "100… NA ## 8 8 "68%Baxter… "32%… "100… NA ## 9 9 "57%Miller… "43%… "100… NA ## 10 10 "0%Sexton*… "Unc… "" NA ## # ℹ 90 more rows ## ## [[8]] ## # A tibble: 3 × 5 ## Question Yes No Rpt. `` ## <chr> <chr> <chr> <chr> <lgl> ## 1 Retain Brent Appel 64%Yes 36%No 100% NA ## 2 Retain Daryl Hecht 64%Yes 36%No 100% NA ## 3 Retain Mark Cady 65%Yes 35%No 100% NA ``` --- ## Data Munging Most tables need a bit of clean-up: ```r ia_results <- tables[[2]] %>% mutate( Trump = parse_number(Trump), Clinton = parse_number(Clinton) ) ia_results ``` ``` ## # A tibble: 99 × 3 ## `Vote by county` Trump Clinton ## <chr> <dbl> <dbl> ## 1 Polk 93492 119804 ## 2 Linn 48390 58935 ## 3 Scott 39149 40440 ## 4 Johnson 21044 50200 ## 5 Black Hawk 27476 32233 ## 6 Story 19458 25709 ## 7 Dubuque 23460 22850 ## 8 Woodbury 24727 16210 ## 9 Pottawattamie 24447 15355 ## 10 Dallas 19339 15701 ## # ℹ 89 more rows ``` --- class:inverse ## Your Turn Connect to the website of the NY Times election results at https://www.nytimes.com/elections/2016/results/president - Pick your favorite state (by clicking on the state on the map) - Use `rvest` to download the election results for the 2016 Presidential Election for all counties of your state. - Clean up the data (numbers should be numbers). - Convert the numbers into row-wise percentages (decide on either %Republican or %Democrat). --- class:inverse ## Your Turn How well does the web-scraping work over time? Results for the 2020 Election are available from https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html Again, pick your favorite state and apply your previous code. How many things have changed? Which format do you prefer? Why? --- ## Your Turn - Solutions ```r ia2020 <- "https://www.nytimes.com/interactive/2020/11/03/us/elections/results-iowa.html" html <- read_html(ia2020) tables <- html %>% html_table(fill=TRUE) tables %>% purrr::map(glimpse) ``` ``` ## Rows: 12 ## Columns: 6 ## $ X1 <chr> "", "", "", "", "", "", "", "", "", "", "", "Vi… ## $ X2 <chr> "Trump* Donald J. Trump* Winner", "Biden … ## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria… ## $ X4 <chr> "897,672", "759,061", "19,637", "4,337", "3,210… ## $ X5 <chr> "53.1%", "44.9%", "1.2%", "0.3%", "0.2%", "0.2%… ## $ X6 <chr> "6", "—", "—", "—", "—", "—", "—", "—", "—", "—… ## Rows: 100 ## Columns: 6 ## $ County <chr> "Allamakee", "Polk", "Linn",… ## $ Margin <chr> "Trump +29", "Biden +15", "B… ## $ `2016 margin` <chr> "R+24.2", "D+11.4", "D+9", "… ## $ `Est. votes reported` <chr> "99%", "100%", "100%", "100%… ## $ `Total votes` <chr> "7,422", "258,755", "127,458… ## $ Absentee <chr> "4,816", "140,547", "81,691"… ## Rows: 12 ## Columns: 5 ## $ Candidate <chr> "", "", "", "", "", ""… ## $ Candidate <chr> "Biden", "Trump", "Jor… ## $ `Absentee/early votesVotes` <chr> "568,294", "410,030", … ## $ Pct. <chr> "57.2%", "41.3%", "0.9… ## $ `` <chr> "", "", "", "", "", ""… ## Rows: 5 ## Columns: 5 ## $ X1 <lgl> NA, NA, NA, NA, NA ## $ X2 <chr> "Ernst* Joni Ernst* Winner", "Greenfield … ## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria… ## $ X4 <chr> "864,997", "754,859", "36,961", "13,800", "1,67… ## $ X5 <chr> "51.8%", "45.2%", "2.2%", "0.8%", "" ## Rows: 9 ## Columns: 2 ## $ X1 <chr> "1st District ›", "Representative Abby Finkenau… ## $ X2 <chr> "WinnerHinson +3", "Representative Abby Finkena… ## Rows: 4 ## Columns: 6 ## $ `District Dist.` <int> 1, 2, 3, 4 ## $ Margin <chr> "R+3", "R <0.01", "D+1.4", "R+24" ## $ Candidates <chr> "Hinson Rep. Winner", "Miller-Mee… ## $ `` <chr> "Finkenauer* Dem.", "Hart Dem.", … ## $ `Est. rpt.` <chr> "100%", "100%", "100%", "100%" ## $ `` <chr> "›", "›", "›", "›" ## Rows: 3 ## Columns: 2 ## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> … ## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> … ## Rows: 26 ## Columns: 3 ## $ `District Dist.` <chr> "2", "4", "6", "8", "10", "12", "… ## $ Candidates <chr> "Taylor* Rep. 0% Winner", "Guth* … ## $ `` <chr> "Uncontested", "Uncontested", "Pe… ## Rows: 101 ## Columns: 3 ## $ `District Dist.` <chr> "1", "2", "3", "4", "5", "6", "7"… ## $ Candidates <chr> "Wills* Rep. 0% Winner", "Jones* … ## $ `` <chr> "Uncontested", "Uncontested", "Un… ## Rows: 5 ## Columns: 3 ## $ Question <chr> "Retain Bower", "Retain May", "Retain Sch… ## $ `` <chr> "Yes 73% Winner", "Yes 73% Winner", "Yes … ## $ `` <chr> "No 27%", "No 27%", "No 24%", "No 26%", "… ## Rows: 5 ## Columns: 3 ## $ `Race Race` <chr> "Retain Christensen", "Retain Mansfiel… ## $ `` <chr> "Yes 73% Winner", "Yes 69% Winner", "Y… ## $ `` <chr> "No 27%", "No 31%", "No 29%", "No 30%"… ``` ``` ## [[1]] ## # A tibble: 12 × 6 ## X1 X2 X3 X4 X5 X6 ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 "" Trum… "Rep… 897,… "53.… "6" ## 2 "" Bide… "Dem… 759,… "44.… "—" ## 3 "" Jorg… "Lib… 19,6… "1.2… "—" ## 4 "" Writ… "" 4,337 "0.3… "—" ## 5 "" West… "Ind… 3,210 "0.2… "—" ## 6 "" Hawk… "Gre… 3,075 "0.2… "—" ## 7 "" Blan… "Con… 1,707 "0.1… "—" ## 8 "" De L… "Oth… 1,082 "0.1… "—" ## 9 "" King… "Ind… 546 "<0.… "—" ## 10 "" Pier… "Ind… 544 "<0.… "—" ## 11 "" Tota… "" 1,69… "" "" ## 12 "View all candidates Colla… View… "Vie… View… "Vie… "Vie… ## ## [[2]] ## # A tibble: 100 × 6 ## County Margin `2016 margin` `Est. votes reported` ## <chr> <chr> <chr> <chr> ## 1 Allamakee Trump … R+24.2 99% ## 2 Polk Biden … D+11.4 100% ## 3 Linn Biden … D+9 100% ## 4 Scott Biden … D+1.5 100% ## 5 Johnson Biden … D+37.9 100% ## 6 Black Hawk Biden … D+7.4 100% ## 7 Dallas Trump … R+9.5 100% ## 8 Dubuque Trump … R+1.2 100% ## 9 Story Biden … D+12.3 100% ## 10 Pottawattamie Trump … R+21.3 100% ## # ℹ 90 more rows ## # ℹ 2 more variables: `Total votes` <chr>, Absentee <chr> ## ## [[3]] ## # A tibble: 12 × 5 ## Candidate Candidate Absentee/early votes…¹ Pct. `` ## <chr> <chr> <chr> <chr> <chr> ## 1 "" Biden 568,294 "57.… "" ## 2 "" Trump 410,030 "41.… "" ## 3 "" Jorgensen 8,463 "0.9… "" ## 4 "" Hawkins 1,713 "0.2… "" ## 5 "" West 1,405 "0.1… "" ## 6 "" Blankens… 1,062 "0.1… "" ## 7 "" De La Fu… 587 "0.1… "" ## 8 "" Write-ins 522 "0.1… "" ## 9 "" King 341 "<0.… "" ## 10 "" Pierce 305 "<0.… "" ## 11 "" Total re… 992,722 "" "" ## 12 "View all c… View all… View all candidates C… "Vie… "Vie… ## # ℹ abbreviated name: ¹`Absentee/early votesVotes` ## ## [[4]] ## # A tibble: 5 × 5 ## X1 X2 X3 X4 X5 ## <lgl> <chr> <chr> <chr> <chr> ## 1 NA Ernst* Joni Ernst* Winner "Republ… 864,… "51.… ## 2 NA Greenfield Theresa Greenfield "Democr… 754,… "45.… ## 3 NA Stewart Rick Stewart "Libert… 36,9… "2.2… ## 4 NA Herzog Suzanne Herzog "Indepe… 13,8… "0.8… ## 5 NA Total reported Total reported "" 1,67… "" ## ## [[5]] ## # A tibble: 9 × 2 ## X1 X2 ## <chr> <chr> ## 1 1st District › Winn… ## 2 Representative Abby Finkenauer, a Democrat, flipped… Repr… ## 3 100% reported 100%… ## 4 2nd District › Winn… ## 5 Democrats were hoping to hang on to the seat being … Demo… ## 6 100% reported 100%… ## 7 4th District › Winn… ## 8 J.D. Scholten, a former minor league pitcher turned… J.D.… ## 9 100% reported 100%… ## ## [[6]] ## # A tibble: 4 × 6 ## `District Dist.` Margin Candidates `` `Est. rpt.` `` ## <int> <chr> <chr> <chr> <chr> <chr> ## 1 1 R+3 Hinson Re… Fink… 100% › ## 2 2 R <0.… Miller-Me… Hart… 100% › ## 3 3 D+1.4 Axne* Dem… Youn… 100% › ## 4 4 R+24 Feenstra … Scho… 100% › ## ## [[7]] ## # A tibble: 3 × 2 ## Constitutional Convention Questio…¹ Constitutional Conve…² ## <chr> <chr> ## 1 "" "" ## 2 "Yes 30%" "No 70% Winner" ## 3 "1,381,676 votes reported" "1,381,676 votes repo… ## # ℹ abbreviated names: ## # ¹`Constitutional Convention Question: Explore Proposals for Changes to Constitution`, ## # ²`Constitutional Convention Question: Explore Proposals for Changes to Constitution` ## ## [[8]] ## # A tibble: 26 × 3 ## `District Dist.` Candidates `` ## <chr> <chr> <chr> ## 1 2 Taylor* Rep. 0% Winner Uncontested ## 2 4 Guth* Rep. 0% Winner Uncontested ## 3 6 Williams Rep. 69% Winner Petersen Dem.… ## 4 8 Dawson* Rep. 52% Winner Gorman Dem. 4… ## 5 10 Chapman* Rep. 63% Winner Varley Dem. 3… ## 6 12 Costello* Rep. 69% Winner Norris Dem. 3… ## 7 14 Sinclair* Rep. 0% Winner Uncontested ## 8 16 Boulton* Dem. 77% Winner Johnson Lib. … ## 9 18 Petersen* Dem. 0% Winner Uncontested ## 10 20 Zaun* Rep. 51% Winner Martin Dem. 4… ## # ℹ 16 more rows ## ## [[9]] ## # A tibble: 101 × 3 ## `District Dist.` Candidates `` ## <chr> <chr> <chr> ## 1 1 Wills* Rep. 0% Winner Uncontested ## 2 2 Jones* Rep. 0% Winner Uncontested ## 3 3 Bush Rep. 0% Winner Uncontested ## 4 4 Wheeler* Rep. 82% Winner Johnson Dem. 1… ## 5 5 Jeneary* Rep. 0% Winner Uncontested ## 6 6 Bossman* Rep. 0% Winner Uncontested ## 7 7 Stone Rep. 62% Winner Jensen Dem. 38% ## 8 8 Baxter* Rep. 75% Winner Kiss Dem. 25% ## 9 9 Meyer* Rep. 57% Winner Clayton Dem. 4… ## 10 10 Sexton* Rep. 76% Winner Schultes Dem. … ## # ℹ 91 more rows ## ## [[10]] ## # A tibble: 5 × 3 ## Question `` `` ## <chr> <chr> <chr> ## 1 Retain Bower Yes 73% Winner No 27% ## 2 Retain May Yes 73% Winner No 27% ## 3 Retain Schumacher Yes 76% Winner No 24% ## 4 Retain Soorholtz Greer Yes 74% Winner No 26% ## 5 + View all – Collapse all + View all – Collapse all + Vie… ## ## [[11]] ## # A tibble: 5 × 3 ## `Race Race` `` `` ## <chr> <chr> <chr> ## 1 Retain Christensen Yes 73% Winner No 27% ## 2 Retain Mansfield Yes 69% Winner No 31% ## 3 Retain McDonald Yes 71% Winner No 29% ## 4 Retain Waterman Yes 70% Winner No 30% ## 5 + View all – Collapse all + View all – Collapse all + Vie… ``` ```r ia_results_2020 <- tables[[2]][,-4] %>% mutate( `Total votes` = parse_number(`Total votes`), Absentee = parse_number(Absentee), Winner = gsub("(.*) +.*", "\\1", Margin), `Margin 2020` = gsub("(.*) (+.*)", "\\2", Margin), `Margin 2020` = ifelse(Winner=="Biden", paste0("D",`Margin 2020`), paste0("R",`Margin 2020`)) ) ``` ``` ## Warning: There were 2 warnings in `mutate()`. ## The first warning was: ## ℹ In argument: `Total votes = parse_number(`Total votes`)`. ## Caused by warning: ## ! 1 parsing failure. ## row col expected actual ## 100 -- a number View all Collapse ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining ## warning. ``` ```r ia_results_2020 ``` ``` ## # A tibble: 100 × 7 ## County Margin `2016 margin` `Total votes` Absentee Winner ## <chr> <chr> <chr> <dbl> <dbl> <chr> ## 1 Allam… Trump… R+24.2 7422 4816 Trump ## 2 Polk Biden… D+11.4 258755 140547 Biden ## 3 Linn Biden… D+9 127458 81691 Biden ## 4 Scott Biden… D+1.5 92599 63662 Biden ## 5 Johns… Biden… D+37.9 83851 60517 Biden ## 6 Black… Biden… D+7.4 66593 40870 Biden ## 7 Dallas Trump… R+9.5 56022 34152 Trump ## 8 Dubuq… Trump… R+1.2 53926 34224 Trump ## 9 Story Biden… D+12.3 51038 30224 Biden ## 10 Potta… Trump… R+21.3 45744 27443 Trump ## # ℹ 90 more rows ## # ℹ 1 more variable: `Margin 2020` <chr> ``` --- class:inverse ## Your Turn Can you get the data you just scraped onto a map of the counties? - Try to map the percentage you calculated to the fill-color of the counties' polygons. You will need to `join` the percentages with the county polygons: ```r counties <- map_data("county") head(counties) ``` ``` ## long lat group order region subregion ## 1 -86.50517 32.34920 1 1 alabama autauga ## 2 -86.53382 32.35493 1 2 alabama autauga ## 3 -86.54527 32.36639 1 3 alabama autauga ## 4 -86.55673 32.37785 1 4 alabama autauga ## 5 -86.57966 32.38357 1 5 alabama autauga ## 6 -86.59111 32.37785 1 6 alabama autauga ``` --- ## Your Turn - Solution ```r ia_results <- ia_results %>% mutate( PercDem = Clinton/(Trump + Clinton)*100, subregion = tolower(`Vote by county`) ) # Which counties don't work? anti_join(ia_results, counties %>% filter(region=="iowa"), by="subregion") ``` ``` ## # A tibble: 1 × 5 ## `Vote by county` Trump Clinton PercDem subregion ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 O'Brien 5752 1315 18.6 o'brien ``` ```r anti_join(counties %>% filter(region=="iowa"), ia_results, by="subregion") ``` ``` ## long lat group order region subregion ## 1 -95.86156 43.26404 825 26075 iowa obrien ## 2 -95.39174 43.26404 825 26076 iowa obrien ## 3 -95.39747 42.91454 825 26077 iowa obrien ## 4 -95.85583 42.90881 825 26078 iowa obrien ## 5 -95.86156 43.26404 825 26079 iowa obrien ``` --- ## Your Turn - Solution (cont'd) ```r ia_results <- ia_results %>% mutate( subregion = replace(subregion, subregion=="o'brien", "obrien") ) ia_full <- left_join(ia_results, counties %>% filter(region=="iowa"), by="subregion") ``` --- ## Your Turn - Solution (cont'd) ```r ia_full %>% ggplot(aes( x = long, y = lat, group = subregion)) + geom_polygon(aes(fill = PercDem), color = "black") + scale_fill_gradient2(midpoint=50, mid = "white") + ggthemes::theme_map() ``` <img src="01_web-scraping_files/figure-html/unnamed-chunk-8-1.png" width="75%" /> --- ## Beyond tables Sometimes data on the web is not structured as nicely... e.g. let's assume we want to get a list of all recently active baseball players from [Baseball reference](http://www.baseball-reference.com/players/) .center[![:scale 80%](baseball_reference.png)] --- ## SelectorGadget - SelectorGadget is a javascript bookmarklet to determine the css selectors of pieces of a website we want to extract. - Bookmark the [SelectorGadget](https://selectorgadget.com/) link, then click on it to use it (or add the chrome extension) - When SelectorGadget is active, pieces of the website are highlighted in orange/green/red. - Use SelectorGadget on http://www.baseball-reference.com/players/ . - Read more details on `vignette("selectorgadget")` (or on the [rvest website](https://rvest.tidyverse.org/articles/selectorgadget.html)) If you prefer, you can also read the HTML code and create your own [CSS](https://www.w3schools.com/cssref/css_selectors.asp) or [XPATH](https://www.w3schools.com/xml/xpath_syntax.asp) selectors. --- ## SelectorGadget Result *Select all active baseball players with a last name starting with 'a'* ```r url <- "http://www.baseball-reference.com/players/a/" html <- read_html(url) html %>% html_elements("b") %>% html_text() ``` ``` ## [1] "Fernando Abad (2010-2021)" ## [2] "Cory Abbott (2021-2022)" ## [3] "CJ Abrams (2022-2023)" ## [4] "Albert Abreu (2020-2023)" ## [5] "Bryan Abreu (2019-2023)" ## [6] "José Abreu (2014-2023)" ## [7] "Domingo Acevedo (2021-2023)" ## [8] "Ronald Acuna Jr. (2018-2023)" ## [9] "Jason Adam (2018-2023)" ## [10] "Willy Adames (2018-2023)" ## [11] "Austin Adams (2017-2022)" ## [12] "Matt Adams (2012-2021)" ## [13] "Riley Adams (2021-2022)" ## [14] "Jo Adell (2020-2022)" ## [15] "Joan Adon (2021-2022)" ## [16] "Ehire Adrianza (2013-2022)" ## [17] "Jesús Aguilar (2014-2023)" ## [18] "Ryan Aguilar (2022-2022)" ## [19] "Nick Ahmed (2014-2023)" ## [20] "Keegan Akin (2020-2023)" ## [21] "Hanser Alberto (2015-2023)" ## [22] "Ozzie Albies (2017-2023)" ## [23] "Jorge Alcala (2019-2023)" ## [24] "Sandy Alcantara (2017-2023)" ## [25] "Sergio Alcántara (2020-2022)" ## [26] "Jason Alexander (2022-2022)" ## [27] "Scott Alexander (2015-2023)" ## [28] "Tyler Alexander (2019-2023)" ## [29] "A.J. Alexy (2021-2022)" ## [30] "Jorge Alfaro (2016-2022)" ## [31] "Kolby Allard (2018-2022)" ## [32] "Cam Alldred (2022-2022)" ## [33] "Austin Allen (2019-2022)" ## [34] "Greg Allen (2017-2022)" ## [35] "Logan Allen (2019-2022)" ## [36] "Nick Allen (2022-2023)" ## [37] "Nick Allgeyer (2021-2021)" ## [38] "Abraham Almonte (2013-2022)" ## [39] "Yency Almonte (2018-2023)" ## [40] "Pete Alonso (2019-2023)" ## [41] "Dan Altavilla (2016-2021)" ## [42] "Jose Altuve (2011-2022)" ## [43] "José Alvarado (2017-2023)" ## [44] "Eddy Alvarez (2020-2022)" ## [45] "Francisco Álvarez (2022-2022)" ## [46] "Jose Alvarez (2013-2022)" ## [47] "Yordan Alvarez (2019-2023)" ## [48] "Adbert Alzolay (2019-2023)" ## [49] "Brian Anderson (2017-2023)" ## [50] "Chase Anderson (2014-2022)" ## [51] "Ian Anderson (2020-2022)" ## [52] "Nick Anderson (2019-2023)" ## [53] "Tim Anderson (2016-2023)" ## [54] "Tyler Anderson (2016-2023)" ## [55] "Matt Andriese (2015-2021)" ## [56] "Elvis Andrus (2009-2023)" ## [57] "Miguel Andujar (2017-2022)" ## [58] "Tejay Antone (2020-2021)" ## [59] "Jonathan Aranda (2022-2022)" ## [60] "Víctor Arano (2017-2022)" ## [61] "Jonathan Araúz (2020-2022)" ## [62] "Francisco Arcia (2018-2018)" ## [63] "Orlando Arcia (2016-2023)" ## [64] "Nolan Arenado (2013-2023)" ## [65] "Gabriel Arias (2022-2023)" ## [66] "Shawn Armstrong (2015-2022)" ## [67] "Randy Arozarena (2019-2023)" ## [68] "Luis Arraez (2019-2023)" ## [69] "Christian Arroyo (2017-2023)" ## [70] "Aaron Ashby (2021-2022)" ## [71] "Graham Ashcraft (2022-2023)" ## [72] "Javier Assad (2022-2023)" ## [73] "Pedro Avila (2019-2022)" ## [74] "José Azocar (2022-2023)" ``` --- ## Example, varied We are, in fact, not just interested in the *names of the players*, but also in the *links* to each player's website - `html_attr` let's us access an attribute of an html node - `html_attrs` extracts all attributes of an html node ```r html %>% html_elements("b a") %>% html_attr(name="href") ``` ``` ## [1] "/players/a/abadfe01.shtml" ## [2] "/players/a/abbotco01.shtml" ## [3] "/players/a/abramcj01.shtml" ## [4] "/players/a/abreual01.shtml" ## [5] "/players/a/abreubr01.shtml" ## [6] "/players/a/abreujo02.shtml" ## [7] "/players/a/acevedo01.shtml" ## [8] "/players/a/acunaro01.shtml" ## [9] "/players/a/adamja01.shtml" ## [10] "/players/a/adamewi01.shtml" ## [11] "/players/a/adamsau02.shtml" ## [12] "/players/a/adamsma01.shtml" ## [13] "/players/a/adamsri03.shtml" ## [14] "/players/a/adelljo01.shtml" ## [15] "/players/a/adonjo01.shtml" ## [16] "/players/a/adriaeh01.shtml" ## [17] "/players/a/aguilje01.shtml" ## [18] "/players/a/aguilry01.shtml" ## [19] "/players/a/ahmedni01.shtml" ## [20] "/players/a/akinke01.shtml" ## [21] "/players/a/alberha01.shtml" ## [22] "/players/a/albieoz01.shtml" ## [23] "/players/a/alcaljo01.shtml" ## [24] "/players/a/alcansa01.shtml" ## [25] "/players/a/alcanse01.shtml" ## [26] "/players/a/alexaja01.shtml" ## [27] "/players/a/alexasc02.shtml" ## [28] "/players/a/alexaty01.shtml" ## [29] "/players/a/alexyaj01.shtml" ## [30] "/players/a/alfarjo01.shtml" ## [31] "/players/a/allarko01.shtml" ## [32] "/players/a/alldrca01.shtml" ## [33] "/players/a/allenau01.shtml" ## [34] "/players/a/allengr01.shtml" ## [35] "/players/a/allenlo01.shtml" ## [36] "/players/a/allenni02.shtml" ## [37] "/players/a/allgeni01.shtml" ## [38] "/players/a/almonab01.shtml" ## [39] "/players/a/almonye01.shtml" ## [40] "/players/a/alonspe01.shtml" ## [41] "/players/a/altavda01.shtml" ## [42] "/players/a/altuvjo01.shtml" ## [43] "/players/a/alvarjo03.shtml" ## [44] "/players/a/alvared01.shtml" ## [45] "/players/a/alvarfr01.shtml" ## [46] "/players/a/alvarjo02.shtml" ## [47] "/players/a/alvaryo01.shtml" ## [48] "/players/a/alzolad01.shtml" ## [49] "/players/a/anderbr06.shtml" ## [50] "/players/a/anderch01.shtml" ## [51] "/players/a/anderia01.shtml" ## [52] "/players/a/anderni01.shtml" ## [53] "/players/a/anderti01.shtml" ## [54] "/players/a/anderty01.shtml" ## [55] "/players/a/andrima01.shtml" ## [56] "/players/a/andruel01.shtml" ## [57] "/players/a/andujmi01.shtml" ## [58] "/players/a/antonte01.shtml" ## [59] "/players/a/arandjo01.shtml" ## [60] "/players/a/aranovi01.shtml" ## [61] "/players/a/arauzjo01.shtml" ## [62] "/players/a/arciafr01.shtml" ## [63] "/players/a/arciaor01.shtml" ## [64] "/players/a/arenano01.shtml" ## [65] "/players/a/ariasga01.shtml" ## [66] "/players/a/armstsh01.shtml" ## [67] "/players/a/arozara01.shtml" ## [68] "/players/a/arraelu01.shtml" ## [69] "/players/a/arroych01.shtml" ## [70] "/players/a/ashbyaa01.shtml" ## [71] "/players/a/ashcrgr01.shtml" ## [72] "/players/a/assadja01.shtml" ## [73] "/players/a/avilape01.shtml" ## [74] "/players/a/azocajo01.shtml" ``` --- class:inverse ## Your Turn Use the SelectorGadget on the website for [Fernando Abad](https://www.baseball-reference.com/players/a/abadfe01.shtml) Find the css selector to extract his career statistics and load them into your R session. Does the same code work to extract career statistics for (some) of the other active players? What other information do we need to know? - and how can we get to that? --- ## Your Turn - Solution ```r url <- "https://www.baseball-reference.com/players/a/abadfe01.shtml" html <- read_html(url) html %>% html_elements(".stats_pullout") %>% html_text() ``` ``` ## [1] "\n\nSUMMARY\n\nCareer\n\n\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n\n\nG\n400\n\n\nGS\n6\n\n\nSV\n2\n\n\nIP\n348.1\n\n\nSO\n290\n\n\nWHIP\n1.306\n\n\n\n" ``` ```r html %>% html_elements(".stats_pullout .poptip") %>% html_text() ``` ``` ## [1] "WAR" "W" "L" "ERA" "G" "GS" "SV" "IP" ## [9] "SO" "WHIP" ``` ```r html %>% html_elements(".p3 p , .p2 p, .p1 p, .stats_pullout strong") %>% html_text() ``` ``` ## [1] "SUMMARY" "Career" "WAR" "3.1" "W" ## [6] "8" "L" "29" "ERA" "3.77" ## [11] "G" "400" "GS" "6" "SV" ## [16] "2" "IP" "348.1" "SO" "290" ## [21] "WHIP" "1.306" ``` ```r html %>% html_elements(".p1") %>% html_text() ``` ``` ## [1] "\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n" ``` --- ## Your Turn - Solution (cont'd) It's sometimes easier (for data munging after extracting) to extract multiple pieces rather than everything in one go. ```r (stats <- html %>% html_elements("span strong") %>% html_text()) ``` ``` ## [1] "SUMMARY" "WAR" "W" "L" "ERA" ## [6] "G" "GS" "SV" "IP" "SO" ## [11] "WHIP" ``` ```r (season <- html %>% html_elements(".stats_pullout p:nth-child(2)") %>% html_text()) ``` ``` ## [1] "Career" "3.1" "8" "29" "3.77" "400" ## [7] "6" "2" "348.1" "290" "1.306" ``` ```r (career <- html %>% html_elements(".stats_pullout p:nth-child(3)") %>% html_text()) ``` ``` ## character(0) ``` --- class:inverse ## Your Turn Clean up the code for extracting the career statistics and write a function `getStats` that takes a url and returns a dataset (or tibble). Does the same code work to extract career statistics for (some) of the other active players? --- ## Your Turn - Solution ```r getStats <- function(url) { html <- read_html(url) stats <- html %>% html_nodes("span strong") %>% html_text() season <- html %>% html_nodes(".stats_pullout p:nth-child(2)") %>% html_text() career <- html %>% html_nodes(".stats_pullout p:nth-child(3)") %>% html_text() data.frame(Statistics = stats[-1], Season= parse_number(season[-1]), Career = ifelse(is.null(career), NA, parse_number(career[-1]))) } url <- "http://www.baseball-reference.com/players/a/" html <- read_html(url) players <- html %>% html_nodes("strong a") %>% html_text() links <- html %>% html_nodes("strong a") %>% html_attr(name="href") getStats("https://www.baseball-reference.com//players/a/abreujo02.shtml") %>% str() ``` ``` ## Warning: 1 parsing failure. ## row col expected actual ## 1 -- a number - ``` ``` ## 'data.frame': 12 obs. of 3 variables: ## $ Statistics: chr "WAR" "AB" "H" "HR" ... ## $ Season : num NA 21 8 0 0.381 1 3 0 0.458 0.429 ... ## ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame) ## .. ..$ row : int 1 ## .. ..$ col : int NA ## .. ..$ expected: chr "a number" ## .. ..$ actual : chr "-" ## $ Career : num 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ... ``` ```r getStats(file.path("https://www.baseball-reference.com", links[2])) %>% str() ``` ``` ## Warning: 1 parsing failure. ## row col expected actual ## 1 -- a number - ``` ``` ## 'data.frame': 12 obs. of 3 variables: ## $ Statistics: chr "WAR" "AB" "H" "HR" ... ## $ Season : num NA 21 8 0 0.381 1 3 0 0.458 0.429 ... ## ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame) ## .. ..$ row : int 1 ## .. ..$ col : int NA ## .. ..$ expected: chr "a number" ## .. ..$ actual : chr "-" ## $ Career : num 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ... ``` ```r # getStats(file.path("https://www.baseball-reference.com", links[3])) %>% str() # gives an error getStats(file.path("https://www.baseball-reference.com", links[4])) %>% str() ``` ``` ## 'data.frame': 12 obs. of 3 variables: ## $ Statistics: chr "WAR" "AB" "H" "HR" ... ## $ Season : num 4.9 2421 624 118 0.258 ... ## $ Career : num NA NA NA NA NA NA NA NA NA NA ... ``` --- ## Your Turn - Solution (cont'd) Now apply to other players (with 'a' as starting letter) - first we get everything tidied up in a dataset ```r url <- "http://www.baseball-reference.com/players/a/" html <- read_html(url) players <- html %>% html_nodes("strong a") %>% html_text() links <- html %>% html_nodes("strong a") %>% html_attr(name="href") bb <- tibble(players=players, links = links) head(bb) ``` ``` ## # A tibble: 6 × 2 ## players links ## <chr> <chr> ## 1 Fernando Abad /players/a/abadfe01.shtml ## 2 José Abreu /players/a/abreujo02.shtml ## 3 Ronald Acuna Jr. /players/a/acunaro01.shtml ## 4 Matt Adams /players/a/adamsma01.shtml ## 5 Ehire Adrianza /players/a/adriaeh01.shtml ## 6 Jesús Aguilar /players/a/aguilje01.shtml ``` ```r bb_head <- bb[1:5,] %>% mutate( data = links %>% purrr::map(.f = function(link) { cat(link) cat("\n") getStats(file.path("https://www.baseball-reference.com", link)) }) ) ``` ``` ## /players/a/abadfe01.shtml ## /players/a/abreujo02.shtml ## /players/a/acunaro01.shtml ## /players/a/adamsma01.shtml ## /players/a/adriaeh01.shtml ``` ``` ## Warning: There were 2 warnings in `mutate()`. ## The first warning was: ## ℹ In argument: `data = `%>%`(...)`. ## Caused by warning: ## ! 1 parsing failure. ## row col expected actual ## 1 -- a number - ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining ## warning. ``` --- class: inverse ## Your Turn Why does `getStats(file.path("https://www.baseball-reference.com", "/players/a/adlemti01.shtml"))` have NAs in the Career? Is this fixable? How do we get all of the active players? (not just the ones with last names starting with 'a') --- ## Package `rvest` The `session` suite of commands allows to simulate an html session without a browser. Create a session with `session(url)` Navigate: `session_jump_to()` Follow a link: `session_follow_link()`. navigate back and forward with `session_back()` and `session_forward()`. ... and extract the pieces you are interested in using `read_html`, `html_element`, `html_elements`