Stat 585 - Web Scraping

class: center, middle, inverse, title-slide

.title[
# Stat 585 - Web Scraping
]
.author[
### Heike Hofmann and Susan Vanderplas
]

---

## Web Scraping

- Transform data from web pages into usable information

- Automate the process

![](http://webdata-scraping.com/media/2015/01/Web-Scraping-Process.png)

---

## `rvest` + `xml2`: Easy Web Scraping
  
- `read_html` gets the full set of HTML markup from a URL

```r
library(rvest)
url <- "https://www.nytimes.com/elections/2016/results/iowa"
html <- read_html(url)
html
```

```
## {html_document}
## <html lang="en" itemscope="" xmlns:og="//opengraphprotocol.org/schema/" itemtype="//schema.org/NewsArticle">
## [1] <head>\n<title>Iowa Election Results 2016 – The New Y ...
## [2] <body class="eln-general-state-results eln-state-iowa ...
```

- Use `html_attr`, `html_node`, `html_table`, and `html_text` to extract useful information from the markup

---

## Get a *table* from an online source

`html_table` extracts all tables from the sourced html into a list of data frames:

```r
tables <- html %>% html_table(fill=TRUE)
tables %>% purrr::map(glimpse)
```

```
## Rows: 11
## Columns: 7
## $ Candidate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Candidate <chr> "Trump\n            \n            \n    …
## $ Party     <chr> "Republican\n          Rep.", "Democrat\…
## $ Votes     <chr> "800,983", "653,669", "59,186", "19,992"…
## $ Pct.      <chr> "51.1%", "41.7%", "3.8%", "1.3%", "0.8%"…
## $ ``        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ E.V.      <chr> "6", "—", "—", "—", "—", "—", "—", "—", …
## Rows: 99
## Columns: 3
## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson…
## $ Trump            <chr> "93,492", "48,390", "39,149", "21…
## $ Clinton          <chr> "119,804", "58,935", "40,440", "5…
## Rows: 6
## Columns: 6
## $ Candidate <lgl> NA, NA, NA, NA, NA, NA
## $ Candidate <chr> "Grassley*\n            \n            \n…
## $ Party     <chr> "Republican\n          Rep.", "Democrat\…
## $ Votes     <chr> "926,007", "549,460", "41,794", "17,649"…
## $ Pct.      <chr> "60.2%", "35.7%", "2.7%", "1.1%", "0.3%"…
## $ ``        <lgl> NA, NA, NA, NA, NA, NA
## Rows: 99
## Columns: 3
## $ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson…
## $ Grassley         <chr> "118,164", "62,737", "46,415", "2…
## $ Judge            <chr> "100,317", "47,635", "34,503", "4…
## Rows: 4
## Columns: 5
## $ `District\n          Dist.` <int> 1, 2, 3, 4
## $ Leader                      <chr> "54%Blum*\n      Rep."…
## $ ``                          <chr> "46%Vernon\n      Dem.…
## $ Rpt.                        <chr> "100%", "100%", "100%"…
## $ ``                          <lgl> NA, NA, NA, NA
## Rows: 25
## Columns: 5
## $ `Seat\n          Seat` <int> 2, 4, 6, 8, 10, 12, 14, 16,…
## $ Leader                 <chr> "0%Feenstra*\n      Rep.", …
## $ ``                     <chr> "Uncontested", "39%Bangert\…
## $ Rpt.                   <chr> "", "100%", "100%", "100%",…
## $ ``                     <lgl> NA, NA, NA, NA, NA, NA, NA,…
## Rows: 100
## Columns: 5
## $ `District\n          Dist.` <int> 1, 2, 3, 4, 5, 6, 7, 8…
## $ Leader                      <chr> "0%Wills*\n      Rep."…
## $ ``                          <chr> "Uncontested", "Uncont…
## $ Rpt.                        <chr> "", "", "100%", "100%"…
## $ ``                          <lgl> NA, NA, NA, NA, NA, NA…
## Rows: 3
## Columns: 5
## $ Question <chr> "Retain Brent Appel", "Retain Daryl Hecht…
## $ Yes      <chr> "64%Yes", "64%Yes", "65%Yes"
## $ No       <chr> "36%No", "36%No", "35%No"
## $ Rpt.     <chr> "100%", "100%", "100%"
## $ ``       <lgl> NA, NA, NA
```

```
## [[1]]
## # A tibble: 11 × 7
##    Candidate Candidate         Party Votes Pct.  ``    E.V. 
##    <lgl>     <chr>             <chr> <chr> <chr> <lgl> <chr>
##  1 NA        "Trump\n        … "Rep… 800,… 51.1% NA    6    
##  2 NA        "Clinton\n      … "Dem… 653,… 41.7% NA    —    
##  3 NA        "Johnson\n      … "Lib… 59,1… 3.8%  NA    —    
##  4 NA        "Others\n       … "Ind… 19,9… 1.3%  NA    —    
##  5 NA        "McMullin\n     … "Pet… 12,3… 0.8%  NA    —    
##  6 NA        "Stein\n        … "Gre… 11,4… 0.7%  NA    —    
##  7 NA        "Castle\n       … "Con… 5,335 0.3%  NA    —    
##  8 NA        "Kahn\n         … "Ind… 2,247 0.1%  NA    —    
##  9 NA        "De La Fuente\n … "Pet… 451   0.0%  NA    —    
## 10 NA        "La Riva\n      … "P.S… 323   0.0%  NA    —    
## 11 NA        "Others\n       … ""    32,2… 2.1%  NA    —    
## 
## [[2]]
## # A tibble: 99 × 3
##    `Vote by county` Trump  Clinton
##    <chr>            <chr>  <chr>  
##  1 Polk             93,492 119,804
##  2 Linn             48,390 58,935 
##  3 Scott            39,149 40,440 
##  4 Johnson          21,044 50,200 
##  5 Black Hawk       27,476 32,233 
##  6 Story            19,458 25,709 
##  7 Dubuque          23,460 22,850 
##  8 Woodbury         24,727 16,210 
##  9 Pottawattamie    24,447 15,355 
## 10 Dallas           19,339 15,701 
## # ℹ 89 more rows
## 
## [[3]]
## # A tibble: 6 × 6
##   Candidate Candidate                Party Votes Pct.  ``   
##   <lgl>     <chr>                    <chr> <chr> <chr> <lgl>
## 1 NA        "Grassley*\n           … "Rep… 926,… 60.2% NA   
## 2 NA        "Judge\n            \n … "Dem… 549,… 35.7% NA   
## 3 NA        "Aldrich\n            \… "Lib… 41,7… 2.7%  NA   
## 4 NA        "Hennager\n            … "Ind… 17,6… 1.1%  NA   
## 5 NA        "Luick-Thrams\n        … "Pet… 4,441 0.3%  NA   
## 6 NA        "Others\n            \n… ""    22,0… 1.4%  NA   
## 
## [[4]]
## # A tibble: 99 × 3
##    `Vote by county` Grassley Judge  
##    <chr>            <chr>    <chr>  
##  1 Polk             118,164  100,317
##  2 Linn             62,737   47,635 
##  3 Scott            46,415   34,503 
##  4 Johnson          28,914   42,699 
##  5 Black Hawk       33,884   27,245 
##  6 Story            25,475   21,472 
##  7 Dubuque          27,348   19,291 
##  8 Woodbury         27,166   13,909 
##  9 Pottawattamie    25,721   12,943 
## 10 Dallas           24,374   11,876 
## # ℹ 89 more rows
## 
## [[5]]
## # A tibble: 4 × 5
##   `District\n          Dist.` Leader       ``    Rpt.  ``   
##                         <int> <chr>        <chr> <chr> <lgl>
## 1                           1 "54%Blum*\n… "46%… 100%  NA   
## 2                           2 "54%Loebsac… "46%… 100%  NA   
## 3                           3 "54%Young*\… "40%… 100%  NA   
## 4                           4 "61%King*\n… "39%… 100%  NA   
## 
## [[6]]
## # A tibble: 25 × 5
##    `Seat\n          Seat` Leader           ``    Rpt.  ``   
##                     <int> <chr>            <chr> <chr> <lgl>
##  1                      2 "0%Feenstra*\n … "Unc… ""    NA   
##  2                      4 "61%Guth*\n    … "39%… "100… NA   
##  3                      6 "83%Segebart*\n… "17%… "100… NA   
##  4                      8 "54%Dawson\n   … "46%… "100… NA   
##  5                     10 "67%Chapman*\n … "33%… "100… NA   
##  6                     12 "78%Costello*\n… "22%… "100… NA   
##  7                     14 "74%Sinclair*\n… "26%… "100… NA   
##  8                     16 "60%Boulton\n  … "35%… "100… NA   
##  9                     18 "0%Petersen*\n … "Unc… ""    NA   
## 10                     20 "60%Zaun*\n    … "41%… "100… NA   
## # ℹ 15 more rows
## 
## [[7]]
## # A tibble: 100 × 5
##    `District\n          Dist.` Leader      ``    Rpt.  ``   
##                          <int> <chr>       <chr> <chr> <lgl>
##  1                           1 "0%Wills*\… "Unc… ""    NA   
##  2                           2 "0%Jones*\… "Unc… ""    NA   
##  3                           3 "81%Husema… "19%… "100… NA   
##  4                           4 "63%Wheele… "37%… "100… NA   
##  5                           5 "77%Holz*\… "23%… "100… NA   
##  6                           6 "66%Carlin… "35%… "100… NA   
##  7                           7 "63%Gassma… "37%… "100… NA   
##  8                           8 "68%Baxter… "32%… "100… NA   
##  9                           9 "57%Miller… "43%… "100… NA   
## 10                          10 "0%Sexton*… "Unc… ""    NA   
## # ℹ 90 more rows
## 
## [[8]]
## # A tibble: 3 × 5
##   Question           Yes    No    Rpt.  ``   
##   <chr>              <chr>  <chr> <chr> <lgl>
## 1 Retain Brent Appel 64%Yes 36%No 100%  NA   
## 2 Retain Daryl Hecht 64%Yes 36%No 100%  NA   
## 3 Retain Mark Cady   65%Yes 35%No 100%  NA
```

---

## Data Munging

Most tables need a bit of clean-up:

```r
ia_results <- tables[[2]] %>% mutate(
  Trump  = parse_number(Trump),
  Clinton = parse_number(Clinton)
)
ia_results
```

```
## # A tibble: 99 × 3
##    `Vote by county` Trump Clinton
##    <chr>            <dbl>   <dbl>
##  1 Polk             93492  119804
##  2 Linn             48390   58935
##  3 Scott            39149   40440
##  4 Johnson          21044   50200
##  5 Black Hawk       27476   32233
##  6 Story            19458   25709
##  7 Dubuque          23460   22850
##  8 Woodbury         24727   16210
##  9 Pottawattamie    24447   15355
## 10 Dallas           19339   15701
## # ℹ 89 more rows
```
---
class:inverse
## Your Turn

Connect to the website of the NY Times election results at https://www.nytimes.com/elections/2016/results/president
  
- Pick your favorite state (by clicking on the state on the map)

- Use `rvest` to download the election results for the 2016 Presidential Election for all counties of your state.

- Clean up the data (numbers should be numbers).

- Convert the numbers into row-wise percentages (decide on either %Republican or %Democrat).

---
class:inverse
## Your Turn

How well does the web-scraping work over time? 
Results for the 2020 Election are available from https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html

Again, pick your favorite state and apply your previous code.

How many things have changed? Which format do you prefer? Why?

---

## Your Turn - Solutions

```r
ia2020 <- "https://www.nytimes.com/interactive/2020/11/03/us/elections/results-iowa.html"
html <- read_html(ia2020)
tables <- html %>% html_table(fill=TRUE)
tables %>% purrr::map(glimpse)
```

```
## Rows: 12
## Columns: 6
## $ X1 <chr> "", "", "", "", "", "", "", "", "", "", "", "Vi…
## $ X2 <chr> "Trump*   Donald J. Trump*   Winner", "Biden   …
## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria…
## $ X4 <chr> "897,672", "759,061", "19,637", "4,337", "3,210…
## $ X5 <chr> "53.1%", "44.9%", "1.2%", "0.3%", "0.2%", "0.2%…
## $ X6 <chr> "6", "—", "—", "—", "—", "—", "—", "—", "—", "—…
## Rows: 100
## Columns: 6
## $ County                <chr> "Allamakee", "Polk", "Linn",…
## $ Margin                <chr> "Trump +29", "Biden +15", "B…
## $ `2016 margin`         <chr> "R+24.2", "D+11.4", "D+9", "…
## $ `Est. votes reported` <chr> "99%", "100%", "100%", "100%…
## $ `Total votes`         <chr> "7,422", "258,755", "127,458…
## $ Absentee              <chr> "4,816", "140,547", "81,691"…
## Rows: 12
## Columns: 5
## $ Candidate                   <chr> "", "", "", "", "", ""…
## $ Candidate                   <chr> "Biden", "Trump", "Jor…
## $ `Absentee/early votesVotes` <chr> "568,294", "410,030", …
## $ Pct.                        <chr> "57.2%", "41.3%", "0.9…
## $ ``                          <chr> "", "", "", "", "", ""…
## Rows: 5
## Columns: 5
## $ X1 <lgl> NA, NA, NA, NA, NA
## $ X2 <chr> "Ernst*   Joni Ernst*   Winner", "Greenfield   …
## $ X3 <chr> "Republican Rep.", "Democrat Dem.", "Libertaria…
## $ X4 <chr> "864,997", "754,859", "36,961", "13,800", "1,67…
## $ X5 <chr> "51.8%", "45.2%", "2.2%", "0.8%", ""
## Rows: 9
## Columns: 2
## $ X1 <chr> "1st District ›", "Representative Abby Finkenau…
## $ X2 <chr> "WinnerHinson +3", "Representative Abby Finkena…
## Rows: 4
## Columns: 6
## $ `District Dist.` <int> 1, 2, 3, 4
## $ Margin           <chr> "R+3", "R <0.01", "D+1.4", "R+24"
## $ Candidates       <chr> "Hinson Rep. Winner", "Miller-Mee…
## $ ``               <chr> "Finkenauer* Dem.", "Hart Dem.", …
## $ `Est. rpt.`      <chr> "100%", "100%", "100%", "100%"
## $ ``               <chr> "›", "›", "›", "›"
## Rows: 3
## Columns: 2
## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> …
## $ `Constitutional Convention Question: Explore Proposals for Changes to Constitution` <chr> …
## Rows: 26
## Columns: 3
## $ `District Dist.` <chr> "2", "4", "6", "8", "10", "12", "…
## $ Candidates       <chr> "Taylor* Rep. 0% Winner", "Guth* …
## $ ``               <chr> "Uncontested", "Uncontested", "Pe…
## Rows: 101
## Columns: 3
## $ `District Dist.` <chr> "1", "2", "3", "4", "5", "6", "7"…
## $ Candidates       <chr> "Wills* Rep. 0% Winner", "Jones* …
## $ ``               <chr> "Uncontested", "Uncontested", "Un…
## Rows: 5
## Columns: 3
## $ Question <chr> "Retain Bower", "Retain May", "Retain Sch…
## $ ``       <chr> "Yes 73% Winner", "Yes 73% Winner", "Yes …
## $ ``       <chr> "No 27%", "No 27%", "No 24%", "No 26%", "…
## Rows: 5
## Columns: 3
## $ `Race Race` <chr> "Retain Christensen", "Retain Mansfiel…
## $ ``          <chr> "Yes 73% Winner", "Yes 69% Winner", "Y…
## $ ``          <chr> "No 27%", "No 31%", "No 29%", "No 30%"…
```

```
## [[1]]
## # A tibble: 12 × 6
##    X1                          X2    X3    X4    X5    X6   
##    <chr>                       <chr> <chr> <chr> <chr> <chr>
##  1 ""                          Trum… "Rep… 897,… "53.… "6"  
##  2 ""                          Bide… "Dem… 759,… "44.… "—"  
##  3 ""                          Jorg… "Lib… 19,6… "1.2… "—"  
##  4 ""                          Writ… ""    4,337 "0.3… "—"  
##  5 ""                          West… "Ind… 3,210 "0.2… "—"  
##  6 ""                          Hawk… "Gre… 3,075 "0.2… "—"  
##  7 ""                          Blan… "Con… 1,707 "0.1… "—"  
##  8 ""                          De L… "Oth… 1,082 "0.1… "—"  
##  9 ""                          King… "Ind… 546   "<0.… "—"  
## 10 ""                          Pier… "Ind… 544   "<0.… "—"  
## 11 ""                          Tota… ""    1,69… ""    ""   
## 12 "View all candidates Colla… View… "Vie… View… "Vie… "Vie…
## 
## [[2]]
## # A tibble: 100 × 6
##    County        Margin  `2016 margin` `Est. votes reported`
##    <chr>         <chr>   <chr>         <chr>                
##  1 Allamakee     Trump … R+24.2        99%                  
##  2 Polk          Biden … D+11.4        100%                 
##  3 Linn          Biden … D+9           100%                 
##  4 Scott         Biden … D+1.5         100%                 
##  5 Johnson       Biden … D+37.9        100%                 
##  6 Black Hawk    Biden … D+7.4         100%                 
##  7 Dallas        Trump … R+9.5         100%                 
##  8 Dubuque       Trump … R+1.2         100%                 
##  9 Story         Biden … D+12.3        100%                 
## 10 Pottawattamie Trump … R+21.3        100%                 
## # ℹ 90 more rows
## # ℹ 2 more variables: `Total votes` <chr>, Absentee <chr>
## 
## [[3]]
## # A tibble: 12 × 5
##    Candidate    Candidate Absentee/early votes…¹ Pct.  ``   
##    <chr>        <chr>     <chr>                  <chr> <chr>
##  1 ""           Biden     568,294                "57.… ""   
##  2 ""           Trump     410,030                "41.… ""   
##  3 ""           Jorgensen 8,463                  "0.9… ""   
##  4 ""           Hawkins   1,713                  "0.2… ""   
##  5 ""           West      1,405                  "0.1… ""   
##  6 ""           Blankens… 1,062                  "0.1… ""   
##  7 ""           De La Fu… 587                    "0.1… ""   
##  8 ""           Write-ins 522                    "0.1… ""   
##  9 ""           King      341                    "<0.… ""   
## 10 ""           Pierce    305                    "<0.… ""   
## 11 ""           Total re… 992,722                ""    ""   
## 12 "View all c… View all… View all candidates C… "Vie… "Vie…
## # ℹ abbreviated name: ¹`Absentee/early votesVotes`
## 
## [[4]]
## # A tibble: 5 × 5
##   X1    X2                              X3       X4    X5   
##   <lgl> <chr>                           <chr>    <chr> <chr>
## 1 NA    Ernst*   Joni Ernst*   Winner   "Republ… 864,… "51.…
## 2 NA    Greenfield   Theresa Greenfield "Democr… 754,… "45.…
## 3 NA    Stewart   Rick Stewart          "Libert… 36,9… "2.2…
## 4 NA    Herzog   Suzanne Herzog         "Indepe… 13,8… "0.8…
## 5 NA    Total reported   Total reported ""       1,67… ""   
## 
## [[5]]
## # A tibble: 9 × 2
##   X1                                                   X2   
##   <chr>                                                <chr>
## 1 1st District ›                                       Winn…
## 2 Representative Abby Finkenauer, a Democrat, flipped… Repr…
## 3 100% reported                                        100%…
## 4 2nd District ›                                       Winn…
## 5 Democrats were hoping to hang on to the seat being … Demo…
## 6 100% reported                                        100%…
## 7 4th District ›                                       Winn…
## 8 J.D. Scholten, a former minor league pitcher turned… J.D.…
## 9 100% reported                                        100%…
## 
## [[6]]
## # A tibble: 4 × 6
##   `District Dist.` Margin Candidates ``    `Est. rpt.` ``   
##              <int> <chr>  <chr>      <chr> <chr>       <chr>
## 1                1 R+3    Hinson Re… Fink… 100%        ›    
## 2                2 R <0.… Miller-Me… Hart… 100%        ›    
## 3                3 D+1.4  Axne* Dem… Youn… 100%        ›    
## 4                4 R+24   Feenstra … Scho… 100%        ›    
## 
## [[7]]
## # A tibble: 3 × 2
##   Constitutional Convention Questio…¹ Constitutional Conve…²
##   <chr>                               <chr>                 
## 1 ""                                  ""                    
## 2 "Yes 30%"                           "No 70% Winner"       
## 3 "1,381,676 votes reported"          "1,381,676 votes repo…
## # ℹ abbreviated names:
## #   ¹`Constitutional Convention Question: Explore Proposals for Changes to Constitution`,
## #   ²`Constitutional Convention Question: Explore Proposals for Changes to Constitution`
## 
## [[8]]
## # A tibble: 26 × 3
##    `District Dist.` Candidates                ``            
##    <chr>            <chr>                     <chr>         
##  1 2                Taylor* Rep. 0% Winner    Uncontested   
##  2 4                Guth* Rep. 0% Winner      Uncontested   
##  3 6                Williams Rep. 69% Winner  Petersen Dem.…
##  4 8                Dawson* Rep. 52% Winner   Gorman Dem. 4…
##  5 10               Chapman* Rep. 63% Winner  Varley Dem. 3…
##  6 12               Costello* Rep. 69% Winner Norris Dem. 3…
##  7 14               Sinclair* Rep. 0% Winner  Uncontested   
##  8 16               Boulton* Dem. 77% Winner  Johnson Lib. …
##  9 18               Petersen* Dem. 0% Winner  Uncontested   
## 10 20               Zaun* Rep. 51% Winner     Martin Dem. 4…
## # ℹ 16 more rows
## 
## [[9]]
## # A tibble: 101 × 3
##    `District Dist.` Candidates               ``             
##    <chr>            <chr>                    <chr>          
##  1 1                Wills* Rep. 0% Winner    Uncontested    
##  2 2                Jones* Rep. 0% Winner    Uncontested    
##  3 3                Bush Rep. 0% Winner      Uncontested    
##  4 4                Wheeler* Rep. 82% Winner Johnson Dem. 1…
##  5 5                Jeneary* Rep. 0% Winner  Uncontested    
##  6 6                Bossman* Rep. 0% Winner  Uncontested    
##  7 7                Stone Rep. 62% Winner    Jensen Dem. 38%
##  8 8                Baxter* Rep. 75% Winner  Kiss Dem. 25%  
##  9 9                Meyer* Rep. 57% Winner   Clayton Dem. 4…
## 10 10               Sexton* Rep. 76% Winner  Schultes Dem. …
## # ℹ 91 more rows
## 
## [[10]]
## # A tibble: 5 × 3
##   Question                  ``                        ``    
##   <chr>                     <chr>                     <chr> 
## 1 Retain Bower              Yes 73% Winner            No 27%
## 2 Retain May                Yes 73% Winner            No 27%
## 3 Retain Schumacher         Yes 76% Winner            No 24%
## 4 Retain Soorholtz Greer    Yes 74% Winner            No 26%
## 5 + View all – Collapse all + View all – Collapse all + Vie…
## 
## [[11]]
## # A tibble: 5 × 3
##   `Race Race`               ``                        ``    
##   <chr>                     <chr>                     <chr> 
## 1 Retain Christensen        Yes 73% Winner            No 27%
## 2 Retain Mansfield          Yes 69% Winner            No 31%
## 3 Retain McDonald           Yes 71% Winner            No 29%
## 4 Retain Waterman           Yes 70% Winner            No 30%
## 5 + View all – Collapse all + View all – Collapse all + Vie…
```

```r
ia_results_2020 <- tables[[2]][,-4] %>% mutate(
  `Total votes` = parse_number(`Total votes`),
  Absentee = parse_number(Absentee),
  Winner = gsub("(.*) +.*", "\\1", Margin),
  `Margin 2020` = gsub("(.*) (+.*)", "\\2", Margin),
  `Margin 2020` = ifelse(Winner=="Biden", paste0("D",`Margin 2020`), paste0("R",`Margin 2020`))
)
```

```
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Total votes = parse_number(`Total votes`)`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected            actual
## 100  -- a number View all Collapse
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining
##   warning.
```

```r
ia_results_2020
```

```
## # A tibble: 100 × 7
##    County Margin `2016 margin` `Total votes` Absentee Winner
##    <chr>  <chr>  <chr>                 <dbl>    <dbl> <chr> 
##  1 Allam… Trump… R+24.2                 7422     4816 Trump 
##  2 Polk   Biden… D+11.4               258755   140547 Biden 
##  3 Linn   Biden… D+9                  127458    81691 Biden 
##  4 Scott  Biden… D+1.5                 92599    63662 Biden 
##  5 Johns… Biden… D+37.9                83851    60517 Biden 
##  6 Black… Biden… D+7.4                 66593    40870 Biden 
##  7 Dallas Trump… R+9.5                 56022    34152 Trump 
##  8 Dubuq… Trump… R+1.2                 53926    34224 Trump 
##  9 Story  Biden… D+12.3                51038    30224 Biden 
## 10 Potta… Trump… R+21.3                45744    27443 Trump 
## # ℹ 90 more rows
## # ℹ 1 more variable: `Margin 2020` <chr>
```

---
class:inverse
## Your Turn

Can you get the data you just scraped onto a map of the counties? - Try to map the percentage you calculated to the fill-color of the counties' polygons.

You will need to `join` the percentages with the county polygons:

```r
counties <- map_data("county")
head(counties)
```

```
##        long      lat group order  region subregion
## 1 -86.50517 32.34920     1     1 alabama   autauga
## 2 -86.53382 32.35493     1     2 alabama   autauga
## 3 -86.54527 32.36639     1     3 alabama   autauga
## 4 -86.55673 32.37785     1     4 alabama   autauga
## 5 -86.57966 32.38357     1     5 alabama   autauga
## 6 -86.59111 32.37785     1     6 alabama   autauga
```

---

## Your Turn - Solution

```r
ia_results <- ia_results %>% mutate(
  PercDem = Clinton/(Trump + Clinton)*100,
  subregion = tolower(`Vote by county`)
)

# Which counties don't work?
anti_join(ia_results, counties %>% filter(region=="iowa"), by="subregion")
```

```
## # A tibble: 1 × 5
##   `Vote by county` Trump Clinton PercDem subregion
##   <chr>            <dbl>   <dbl>   <dbl> <chr>    
## 1 O'Brien           5752    1315    18.6 o'brien
```

```r
anti_join(counties %>% filter(region=="iowa"), ia_results,  by="subregion")
```

```
##        long      lat group order region subregion
## 1 -95.86156 43.26404   825 26075   iowa    obrien
## 2 -95.39174 43.26404   825 26076   iowa    obrien
## 3 -95.39747 42.91454   825 26077   iowa    obrien
## 4 -95.85583 42.90881   825 26078   iowa    obrien
## 5 -95.86156 43.26404   825 26079   iowa    obrien
```

---

## Your Turn - Solution (cont'd)

```r
ia_results <- ia_results %>% mutate(
  subregion = replace(subregion, subregion=="o'brien", "obrien")
)

ia_full <- left_join(ia_results, 
                     counties %>% filter(region=="iowa"), 
                     by="subregion")
```

---

## Your Turn - Solution (cont'd)

```r
ia_full %>% 
  ggplot(aes( x = long, y = lat, group = subregion)) +
  geom_polygon(aes(fill = PercDem), color = "black") + 
  scale_fill_gradient2(midpoint=50, mid = "white") +
  ggthemes::theme_map()
```

---

## Beyond tables

Sometimes data on the web is not structured as nicely... e.g. let's assume we want to get a list of all recently active baseball players from [Baseball reference](http://www.baseball-reference.com/players/)

.center[![:scale 80%](baseball_reference.png)]

---

## SelectorGadget

- SelectorGadget is a javascript bookmarklet to determine the css selectors of pieces of a website we want to extract.

- Bookmark the [SelectorGadget](https://selectorgadget.com/) link, then click on it to use it (or add the chrome extension)

- When SelectorGadget is active, pieces of the website are highlighted in orange/green/red.

- Use SelectorGadget on http://www.baseball-reference.com/players/ .

- Read more details on `vignette("selectorgadget")` (or on the [rvest website](https://rvest.tidyverse.org/articles/selectorgadget.html))

If you prefer, you can also read the HTML code and create your own [CSS](https://www.w3schools.com/cssref/css_selectors.asp) or [XPATH](https://www.w3schools.com/xml/xpath_syntax.asp) selectors.

---

## SelectorGadget Result

*Select all active baseball players with a last name starting with 'a'*

```r
url <- "http://www.baseball-reference.com/players/a/"
html <- read_html(url)
html %>% html_elements("b") %>% html_text()
```

```
##  [1] "Fernando Abad  (2010-2021)"    
##  [2] "Cory Abbott  (2021-2022)"      
##  [3] "CJ Abrams  (2022-2023)"        
##  [4] "Albert Abreu  (2020-2023)"     
##  [5] "Bryan Abreu  (2019-2023)"      
##  [6] "José Abreu  (2014-2023)"       
##  [7] "Domingo Acevedo  (2021-2023)"  
##  [8] "Ronald Acuna Jr.  (2018-2023)" 
##  [9] "Jason Adam  (2018-2023)"       
## [10] "Willy Adames  (2018-2023)"     
## [11] "Austin Adams  (2017-2022)"     
## [12] "Matt Adams  (2012-2021)"       
## [13] "Riley Adams  (2021-2022)"      
## [14] "Jo Adell  (2020-2022)"         
## [15] "Joan Adon  (2021-2022)"        
## [16] "Ehire Adrianza  (2013-2022)"   
## [17] "Jesús Aguilar  (2014-2023)"    
## [18] "Ryan Aguilar  (2022-2022)"     
## [19] "Nick Ahmed  (2014-2023)"       
## [20] "Keegan Akin  (2020-2023)"      
## [21] "Hanser Alberto  (2015-2023)"   
## [22] "Ozzie Albies  (2017-2023)"     
## [23] "Jorge Alcala  (2019-2023)"     
## [24] "Sandy Alcantara  (2017-2023)"  
## [25] "Sergio Alcántara  (2020-2022)" 
## [26] "Jason Alexander  (2022-2022)"  
## [27] "Scott Alexander  (2015-2023)"  
## [28] "Tyler Alexander  (2019-2023)"  
## [29] "A.J. Alexy  (2021-2022)"       
## [30] "Jorge Alfaro  (2016-2022)"     
## [31] "Kolby Allard  (2018-2022)"     
## [32] "Cam Alldred  (2022-2022)"      
## [33] "Austin Allen  (2019-2022)"     
## [34] "Greg Allen  (2017-2022)"       
## [35] "Logan Allen  (2019-2022)"      
## [36] "Nick Allen  (2022-2023)"       
## [37] "Nick Allgeyer  (2021-2021)"    
## [38] "Abraham Almonte  (2013-2022)"  
## [39] "Yency Almonte  (2018-2023)"    
## [40] "Pete Alonso  (2019-2023)"      
## [41] "Dan Altavilla  (2016-2021)"    
## [42] "Jose Altuve  (2011-2022)"      
## [43] "José Alvarado  (2017-2023)"    
## [44] "Eddy Alvarez  (2020-2022)"     
## [45] "Francisco Álvarez  (2022-2022)"
## [46] "Jose Alvarez  (2013-2022)"     
## [47] "Yordan Alvarez  (2019-2023)"   
## [48] "Adbert Alzolay  (2019-2023)"   
## [49] "Brian Anderson  (2017-2023)"   
## [50] "Chase Anderson  (2014-2022)"   
## [51] "Ian Anderson  (2020-2022)"     
## [52] "Nick Anderson  (2019-2023)"    
## [53] "Tim Anderson  (2016-2023)"     
## [54] "Tyler Anderson  (2016-2023)"   
## [55] "Matt Andriese  (2015-2021)"    
## [56] "Elvis Andrus  (2009-2023)"     
## [57] "Miguel Andujar  (2017-2022)"   
## [58] "Tejay Antone  (2020-2021)"     
## [59] "Jonathan Aranda  (2022-2022)"  
## [60] "Víctor Arano  (2017-2022)"     
## [61] "Jonathan Araúz  (2020-2022)"   
## [62] "Francisco Arcia  (2018-2018)"  
## [63] "Orlando Arcia  (2016-2023)"    
## [64] "Nolan Arenado  (2013-2023)"    
## [65] "Gabriel Arias  (2022-2023)"    
## [66] "Shawn Armstrong  (2015-2022)"  
## [67] "Randy Arozarena  (2019-2023)"  
## [68] "Luis Arraez  (2019-2023)"      
## [69] "Christian Arroyo  (2017-2023)" 
## [70] "Aaron Ashby  (2021-2022)"      
## [71] "Graham Ashcraft  (2022-2023)"  
## [72] "Javier Assad  (2022-2023)"     
## [73] "Pedro Avila  (2019-2022)"      
## [74] "José Azocar  (2022-2023)"
```

---

## Example, varied

We are, in fact, not just interested in the *names of the players*, but also in the *links* to each player's website

- `html_attr` let's us access an attribute of an html node

- `html_attrs` extracts all attributes of an html node

```r
html %>% html_elements("b a") %>% html_attr(name="href")
```

```
##  [1] "/players/a/abadfe01.shtml" 
##  [2] "/players/a/abbotco01.shtml"
##  [3] "/players/a/abramcj01.shtml"
##  [4] "/players/a/abreual01.shtml"
##  [5] "/players/a/abreubr01.shtml"
##  [6] "/players/a/abreujo02.shtml"
##  [7] "/players/a/acevedo01.shtml"
##  [8] "/players/a/acunaro01.shtml"
##  [9] "/players/a/adamja01.shtml" 
## [10] "/players/a/adamewi01.shtml"
## [11] "/players/a/adamsau02.shtml"
## [12] "/players/a/adamsma01.shtml"
## [13] "/players/a/adamsri03.shtml"
## [14] "/players/a/adelljo01.shtml"
## [15] "/players/a/adonjo01.shtml" 
## [16] "/players/a/adriaeh01.shtml"
## [17] "/players/a/aguilje01.shtml"
## [18] "/players/a/aguilry01.shtml"
## [19] "/players/a/ahmedni01.shtml"
## [20] "/players/a/akinke01.shtml" 
## [21] "/players/a/alberha01.shtml"
## [22] "/players/a/albieoz01.shtml"
## [23] "/players/a/alcaljo01.shtml"
## [24] "/players/a/alcansa01.shtml"
## [25] "/players/a/alcanse01.shtml"
## [26] "/players/a/alexaja01.shtml"
## [27] "/players/a/alexasc02.shtml"
## [28] "/players/a/alexaty01.shtml"
## [29] "/players/a/alexyaj01.shtml"
## [30] "/players/a/alfarjo01.shtml"
## [31] "/players/a/allarko01.shtml"
## [32] "/players/a/alldrca01.shtml"
## [33] "/players/a/allenau01.shtml"
## [34] "/players/a/allengr01.shtml"
## [35] "/players/a/allenlo01.shtml"
## [36] "/players/a/allenni02.shtml"
## [37] "/players/a/allgeni01.shtml"
## [38] "/players/a/almonab01.shtml"
## [39] "/players/a/almonye01.shtml"
## [40] "/players/a/alonspe01.shtml"
## [41] "/players/a/altavda01.shtml"
## [42] "/players/a/altuvjo01.shtml"
## [43] "/players/a/alvarjo03.shtml"
## [44] "/players/a/alvared01.shtml"
## [45] "/players/a/alvarfr01.shtml"
## [46] "/players/a/alvarjo02.shtml"
## [47] "/players/a/alvaryo01.shtml"
## [48] "/players/a/alzolad01.shtml"
## [49] "/players/a/anderbr06.shtml"
## [50] "/players/a/anderch01.shtml"
## [51] "/players/a/anderia01.shtml"
## [52] "/players/a/anderni01.shtml"
## [53] "/players/a/anderti01.shtml"
## [54] "/players/a/anderty01.shtml"
## [55] "/players/a/andrima01.shtml"
## [56] "/players/a/andruel01.shtml"
## [57] "/players/a/andujmi01.shtml"
## [58] "/players/a/antonte01.shtml"
## [59] "/players/a/arandjo01.shtml"
## [60] "/players/a/aranovi01.shtml"
## [61] "/players/a/arauzjo01.shtml"
## [62] "/players/a/arciafr01.shtml"
## [63] "/players/a/arciaor01.shtml"
## [64] "/players/a/arenano01.shtml"
## [65] "/players/a/ariasga01.shtml"
## [66] "/players/a/armstsh01.shtml"
## [67] "/players/a/arozara01.shtml"
## [68] "/players/a/arraelu01.shtml"
## [69] "/players/a/arroych01.shtml"
## [70] "/players/a/ashbyaa01.shtml"
## [71] "/players/a/ashcrgr01.shtml"
## [72] "/players/a/assadja01.shtml"
## [73] "/players/a/avilape01.shtml"
## [74] "/players/a/azocajo01.shtml"
```

---
class:inverse
## Your Turn

Use the SelectorGadget on the website for [Fernando Abad](https://www.baseball-reference.com/players/a/abadfe01.shtml)

Find the css selector to extract his career statistics and load them into your R session.

Does the same code work to extract career statistics for (some)  of the other active players?

What other information do we need to know? - and how can we get to that?

---

## Your Turn  - Solution

```r
url <- "https://www.baseball-reference.com/players/a/abadfe01.shtml"
html <- read_html(url)
html %>% html_elements(".stats_pullout") %>% html_text()
```

```
## [1] "\n\nSUMMARY\n\nCareer\n\n\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n\n\nG\n400\n\n\nGS\n6\n\n\nSV\n2\n\n\nIP\n348.1\n\n\nSO\n290\n\n\nWHIP\n1.306\n\n\n\n"
```

```r
html %>% html_elements(".stats_pullout .poptip") %>% html_text()
```

```
##  [1] "WAR"  "W"    "L"    "ERA"  "G"    "GS"   "SV"   "IP"  
##  [9] "SO"   "WHIP"
```

```r
html %>% html_elements(".p3 p , .p2 p, .p1 p, .stats_pullout strong") %>% html_text()
```

```
##  [1] "SUMMARY" "Career"  "WAR"     "3.1"     "W"      
##  [6] "8"       "L"       "29"      "ERA"     "3.77"   
## [11] "G"       "400"     "GS"      "6"       "SV"     
## [16] "2"       "IP"      "348.1"   "SO"      "290"    
## [21] "WHIP"    "1.306"
```

```r
html %>% html_elements(".p1") %>% html_text()
```

```
## [1] "\n\nWAR\n3.1\n\n\nW\n8\n\n\nL\n29\n\n\nERA\n3.77\n"
```
---

## Your Turn - Solution (cont'd)

It's sometimes easier (for data munging after extracting) to extract multiple pieces rather than everything in one go.

```r
(stats <- html %>% html_elements("span strong") %>% html_text())
```

```
##  [1] "SUMMARY" "WAR"     "W"       "L"       "ERA"    
##  [6] "G"       "GS"      "SV"      "IP"      "SO"     
## [11] "WHIP"
```

```r
(season <- html %>% html_elements(".stats_pullout p:nth-child(2)") %>% html_text())
```

```
##  [1] "Career" "3.1"    "8"      "29"     "3.77"   "400"   
##  [7] "6"      "2"      "348.1"  "290"    "1.306"
```

```r
(career <- html %>% html_elements(".stats_pullout p:nth-child(3)") %>% html_text())
```

```
## character(0)
```

---
class:inverse
## Your Turn

Clean up the code for extracting the career statistics and write a function `getStats` that takes a url and returns a dataset (or tibble).

Does the same code work to extract career statistics for (some)  of the other active players?

---

## Your Turn  - Solution

```r
getStats <- function(url) {
  html <- read_html(url)
  stats <- html %>% html_nodes("span strong") %>% html_text()
  season <- html %>% html_nodes(".stats_pullout p:nth-child(2)") %>% html_text()
  career <- html %>% html_nodes(".stats_pullout p:nth-child(3)") %>% html_text()
  
  data.frame(Statistics = stats[-1], 
             Season= parse_number(season[-1]),
             Career = ifelse(is.null(career), NA, parse_number(career[-1])))
}

url <- "http://www.baseball-reference.com/players/a/"
html <- read_html(url)
players <- html %>% html_nodes("strong a") %>% html_text()
links <- html %>% html_nodes("strong a") %>% html_attr(name="href")

getStats("https://www.baseball-reference.com//players/a/abreujo02.shtml") %>% str()
```

```
## Warning: 1 parsing failure.
## row col expected actual
##   1  -- a number      -
```

```
## 'data.frame':	12 obs. of  3 variables:
##  $ Statistics: chr  "WAR" "AB" "H" "HR" ...
##  $ Season    : num  NA 21 8 0 0.381 1 3 0 0.458 0.429 ...
##   ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
##   .. ..$ row     : int 1
##   .. ..$ col     : int NA
##   .. ..$ expected: chr "a number"
##   .. ..$ actual  : chr "-"
##  $ Career    : num  31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 31.8 ...
```

```r
getStats(file.path("https://www.baseball-reference.com", links[2])) %>% str()
```

```
## Warning: 1 parsing failure.
## row col expected actual
##   1  -- a number      -
```

```r
# getStats(file.path("https://www.baseball-reference.com", links[3])) %>% str() # gives an error
getStats(file.path("https://www.baseball-reference.com", links[4])) %>% str()
```

```
## 'data.frame':	12 obs. of  3 variables:
##  $ Statistics: chr  "WAR" "AB" "H" "HR" ...
##  $ Season    : num  4.9 2421 624 118 0.258 ...
##  $ Career    : num  NA NA NA NA NA NA NA NA NA NA ...
```

---

## Your Turn  - Solution (cont'd)

Now apply to other players (with 'a' as starting letter) - first we get everything tidied up in a dataset

```r
url <- "http://www.baseball-reference.com/players/a/"
html <- read_html(url)
players <- html %>% html_nodes("strong a") %>% html_text()
links <- html %>% html_nodes("strong a") %>% html_attr(name="href")

bb <- tibble(players=players, links = links)
head(bb)
```

```
## # A tibble: 6 × 2
##   players          links                     
##   <chr>            <chr>                     
## 1 Fernando Abad    /players/a/abadfe01.shtml 
## 2 José Abreu       /players/a/abreujo02.shtml
## 3 Ronald Acuna Jr. /players/a/acunaro01.shtml
## 4 Matt Adams       /players/a/adamsma01.shtml
## 5 Ehire Adrianza   /players/a/adriaeh01.shtml
## 6 Jesús Aguilar    /players/a/aguilje01.shtml
```

```r
bb_head <- bb[1:5,] %>% mutate(
  data = links %>% purrr::map(.f = function(link) {
    cat(link)
    cat("\n")
    getStats(file.path("https://www.baseball-reference.com", link))
  })
)
```

```
## /players/a/abadfe01.shtml
## /players/a/abreujo02.shtml
## /players/a/acunaro01.shtml
## /players/a/adamsma01.shtml
## /players/a/adriaeh01.shtml
```

```
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `data = `%>%`(...)`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
##   1  -- a number      -
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining
##   warning.
```

---
class: inverse
## Your Turn

Why does 
`getStats(file.path("https://www.baseball-reference.com", "/players/a/adlemti01.shtml"))`
have NAs in the Career?

Is this fixable?

How do we get all of the active players? (not just the ones with last names starting with 'a')

---

## Package `rvest`

The `session` suite of commands allows to simulate an html session without a browser.

Create a session with `session(url)`

Navigate: `session_jump_to()` 
Follow a link: `session_follow_link()`.

navigate back and forward with `session_back()` and `session_forward()`.

... and extract the pieces you are interested in using `read_html`, `html_element`, `html_elements`