Web scraping etiquette …

Errors and warnings in packages
Author

Parvin Mohammadiarvejeh

Published

April 6, 2023

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources. Parvin’s answer: 1) I will always mention and clarify my intentions by using the data that I scarp and I will not scrap the data that I do not need. For example, suppose that I scrapped data from a web, if I am analyzing the relationship between two variables including exercise and lifetime, I will mention my goal in my github and also I will not scrap the data related to the diet which I do not use it in my analysis. 2) I will always try to give the credit back to the owner of the website. For example I will mention it in the paper as the reference. 3) I will always extract data from a web to create value, new, and useful information. In the other words, I will not copy the data to put it my webpage. For example, I will use the data to run a prediction model to publish insights.
  2. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents. Parvin’s answer: robots.txt file which is used in the “polite” package and it should be called in the process of the web scarping to introduce the client (who will use from the data by web scarping) to the host and get the permission to scrape. Also, it helps to take the information with slower rate and the client needs to ask the permission just one time.
  3. Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.
library(polite)
library(rvest)
library(purrr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
session = bow("https://www.nytimes.com/elections/2016/results/iowa") 
html = scrape(session)
html
{html_document}
<html lang="en" itemscope="" xmlns:og="//opengraphprotocol.org/schema/" itemtype="//schema.org/NewsArticle">
[1] <head>\n<title>Iowa Election Results 2016 – The New York Times</title>\n< ...
[2] <body class="eln-general-state-results eln-state-iowa">\n    <div id="she ...
tables <- html %>% html_table(fill=TRUE)
tables %>% purrr::map(glimpse)
Rows: 11
Columns: 7
$ Candidate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ Candidate <chr> "Trump\n            \n            \n              Donald J. …
$ Party     <chr> "Republican\n          Rep.", "Democrat\n          Dem.", "L…
$ Votes     <chr> "800,983", "653,669", "59,186", "19,992", "12,366", "11,479"…
$ Pct.      <chr> "51.1%", "41.7%", "3.8%", "1.3%", "0.8%", "0.7%", "0.3%", "0…
$ ``        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ E.V.      <chr> "6", "—", "—", "—", "—", "—", "—", "—", "—", "—", "—"
Rows: 99
Columns: 3
$ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson", "Black Hawk", "St…
$ Trump            <chr> "93,492", "48,390", "39,149", "21,044", "27,476", "19…
$ Clinton          <chr> "119,804", "58,935", "40,440", "50,200", "32,233", "2…
Rows: 6
Columns: 6
$ Candidate <lgl> NA, NA, NA, NA, NA, NA
$ Candidate <chr> "Grassley*\n            \n            \n              Charle…
$ Party     <chr> "Republican\n          Rep.", "Democrat\n          Dem.", "L…
$ Votes     <chr> "926,007", "549,460", "41,794", "17,649", "4,441", "22,090"
$ Pct.      <chr> "60.2%", "35.7%", "2.7%", "1.1%", "0.3%", "1.4%"
$ ``        <lgl> NA, NA, NA, NA, NA, NA
Rows: 99
Columns: 3
$ `Vote by county` <chr> "Polk", "Linn", "Scott", "Johnson", "Black Hawk", "St…
$ Grassley         <chr> "118,164", "62,737", "46,415", "28,914", "33,884", "2…
$ Judge            <chr> "100,317", "47,635", "34,503", "42,699", "27,245", "2…
Rows: 4
Columns: 5
$ `District\n          Dist.` <int> 1, 2, 3, 4
$ Leader                      <chr> "54%Blum*\n      Rep.", "54%Loebsack*\n   …
$ ``                          <chr> "46%Vernon\n      Dem.", "46%Peters\n     …
$ Rpt.                        <chr> "100%", "100%", "100%", "100%"
$ ``                          <lgl> NA, NA, NA, NA
Rows: 25
Columns: 5
$ `Seat\n          Seat` <int> 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26,…
$ Leader                 <chr> "0%Feenstra*\n      Rep.", "61%Guth*\n      Rep…
$ ``                     <chr> "Uncontested", "39%Bangert\n      Dem.", "17%Se…
$ Rpt.                   <chr> "", "100%", "100%", "100%", "100%", "100%", "10…
$ ``                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Rows: 100
Columns: 5
$ `District\n          Dist.` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ Leader                      <chr> "0%Wills*\n      Rep.", "0%Jones*\n      R…
$ ``                          <chr> "Uncontested", "Uncontested", "19%McCoy\n …
$ Rpt.                        <chr> "", "", "100%", "100%", "100%", "100%", "1…
$ ``                          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Rows: 3
Columns: 5
$ Question <chr> "Retain Brent Appel", "Retain Daryl Hecht", "Retain Mark Cady"
$ Yes      <chr> "64%Yes", "64%Yes", "65%Yes"
$ No       <chr> "36%No", "36%No", "35%No"
$ Rpt.     <chr> "100%", "100%", "100%"
$ ``       <lgl> NA, NA, NA
[[1]]
# A tibble: 11 × 7
   Candidate Candidate                             Party Votes Pct.  ``    E.V. 
   <lgl>     <chr>                                 <chr> <chr> <chr> <lgl> <chr>
 1 NA        "Trump\n            \n            \n… "Rep… 800,… 51.1% NA    6    
 2 NA        "Clinton\n            \n            … "Dem… 653,… 41.7% NA    —    
 3 NA        "Johnson\n            \n            … "Lib… 59,1… 3.8%  NA    —    
 4 NA        "Others\n            \n            \… "Ind… 19,9… 1.3%  NA    —    
 5 NA        "McMullin\n            \n           … "Pet… 12,3… 0.8%  NA    —    
 6 NA        "Stein\n            \n            \n… "Gre… 11,4… 0.7%  NA    —    
 7 NA        "Castle\n            \n            \… "Con… 5,335 0.3%  NA    —    
 8 NA        "Kahn\n            \n            \n … "Ind… 2,247 0.1%  NA    —    
 9 NA        "De La Fuente\n            \n       … "Pet… 451   0.0%  NA    —    
10 NA        "La Riva\n            \n            … "P.S… 323   0.0%  NA    —    
11 NA        "Others\n            \n            \… ""    32,2… 2.1%  NA    —    

[[2]]
# A tibble: 99 × 3
   `Vote by county` Trump  Clinton
   <chr>            <chr>  <chr>  
 1 Polk             93,492 119,804
 2 Linn             48,390 58,935 
 3 Scott            39,149 40,440 
 4 Johnson          21,044 50,200 
 5 Black Hawk       27,476 32,233 
 6 Story            19,458 25,709 
 7 Dubuque          23,460 22,850 
 8 Woodbury         24,727 16,210 
 9 Pottawattamie    24,447 15,355 
10 Dallas           19,339 15,701 
# … with 89 more rows

[[3]]
# A tibble: 6 × 6
  Candidate Candidate                                    Party Votes Pct.  ``   
  <lgl>     <chr>                                        <chr> <chr> <chr> <lgl>
1 NA        "Grassley*\n            \n            \n   … "Rep… 926,… 60.2% NA   
2 NA        "Judge\n            \n            \n       … "Dem… 549,… 35.7% NA   
3 NA        "Aldrich\n            \n            \n     … "Lib… 41,7… 2.7%  NA   
4 NA        "Hennager\n            \n            \n    … "Ind… 17,6… 1.1%  NA   
5 NA        "Luick-Thrams\n            \n            \n… "Pet… 4,441 0.3%  NA   
6 NA        "Others\n            \n            \n      … ""    22,0… 1.4%  NA   

[[4]]
# A tibble: 99 × 3
   `Vote by county` Grassley Judge  
   <chr>            <chr>    <chr>  
 1 Polk             118,164  100,317
 2 Linn             62,737   47,635 
 3 Scott            46,415   34,503 
 4 Johnson          28,914   42,699 
 5 Black Hawk       33,884   27,245 
 6 Story            25,475   21,472 
 7 Dubuque          27,348   19,291 
 8 Woodbury         27,166   13,909 
 9 Pottawattamie    25,721   12,943 
10 Dallas           24,374   11,876 
# … with 89 more rows

[[5]]
# A tibble: 4 × 5
  `District\n          Dist.` Leader                     ``          Rpt.  ``   
                        <int> <chr>                      <chr>       <chr> <lgl>
1                           1 "54%Blum*\n      Rep."     "46%Vernon… 100%  NA   
2                           2 "54%Loebsack*\n      Dem." "46%Peters… 100%  NA   
3                           3 "54%Young*\n      Rep."    "40%Mowrer… 100%  NA   
4                           4 "61%King*\n      Rep."     "39%Weaver… 100%  NA   

[[6]]
# A tibble: 25 × 5
   `Seat\n          Seat` Leader                     ``              Rpt.  ``   
                    <int> <chr>                      <chr>           <chr> <lgl>
 1                      2 "0%Feenstra*\n      Rep."  "Uncontested"   ""    NA   
 2                      4 "61%Guth*\n      Rep."     "39%Bangert\n … "100… NA   
 3                      6 "83%Segebart*\n      Rep." "17%Serianni\n… "100… NA   
 4                      8 "54%Dawson\n      Rep."    "46%Gronstal*\… "100… NA   
 5                     10 "67%Chapman*\n      Rep."  "33%Paladino\n… "100… NA   
 6                     12 "78%Costello*\n      Rep." "22%Brantz\n  … "100… NA   
 7                     14 "74%Sinclair*\n      Rep." "26%Smith\n   … "100… NA   
 8                     16 "60%Boulton\n      Dem."   "35%Pryor\n   … "100… NA   
 9                     18 "0%Petersen*\n      Dem."  "Uncontested"   ""    NA   
10                     20 "60%Zaun*\n      Rep."     "41%Hikiji\n  … "100… NA   
# … with 15 more rows

[[7]]
# A tibble: 100 × 5
   `District\n          Dist.` Leader                    ``          Rpt.  ``   
                         <int> <chr>                     <chr>       <chr> <lgl>
 1                           1 "0%Wills*\n      Rep."    "Uncontest… ""    NA   
 2                           2 "0%Jones*\n      Rep."    "Uncontest… ""    NA   
 3                           3 "81%Huseman*\n      Rep." "19%McCoy\… "100… NA   
 4                           4 "63%Wheeler\n      Rep."  "37%VanDer… "100… NA   
 5                           5 "77%Holz*\n      Rep."    "23%Ritz\n… "100… NA   
 6                           6 "66%Carlin\n      Rep."   "35%Alarco… "100… NA   
 7                           7 "63%Gassman*\n      Rep." "37%Grussi… "100… NA   
 8                           8 "68%Baxter*\n      Rep."  "32%Paule … "100… NA   
 9                           9 "57%Miller*\n      Dem."  "43%Waecht… "100… NA   
10                          10 "0%Sexton*\n      Rep."   "Uncontest… ""    NA   
# … with 90 more rows

[[8]]
# A tibble: 3 × 5
  Question           Yes    No    Rpt.  ``   
  <chr>              <chr>  <chr> <chr> <lgl>
1 Retain Brent Appel 64%Yes 36%No 100%  NA   
2 Retain Daryl Hecht 64%Yes 36%No 100%  NA   
3 Retain Mark Cady   65%Yes 35%No 100%  NA   

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)