Web scraping etiquette …

Web scraping etiquette …
Author

Ian Parzyszek

Published

April 6, 2023

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.

One of the first things I was unaware of was that webscraping can be a burden on the server. I did not think it was anymore cumbersome for the server than just visiting the website, but because this is not the case good practice is to conduct scraping during non-busy times.

Another good practice is to identify yourself. The websites owner may see some unusual activity, so it can be a good idea to give a string in your code to identify yourself, and maybe also let them know your intentions.

Lastly, you should give back to the website owner and give them credit. If you are using their data, cite their website/article. This will help give their website some more traffic.

I got these takeaways from :JAMI @ EMPIRICAL

  1. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.

A ROBOTS.TXT files is put in place to put limitations on what crawlers can access on their website. An example would be including a ROBOTS.TXT file to limit the amount of information a search engine can search on your website and include in their results. Sometimes you may want GOOGLE to avoid including PDFs or pictures, so you could include a ROBOTS.TXT file to prevent this.

  1. Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)

library(polite)
library(rvest)
session <-  bow("https://www.avca.org/polls/diii-men/2023/03-28-23.html")
scrape(session) %>%
  html_table()
No encoding supplied: defaulting to UTF-8.
[[1]]
# A tibble: 15 × 5
   Rank  `School (First-Place Votes Adjusted)` Total Points Adj…¹ Record Previ…²
   <chr> <chr>                                              <int> <chr>    <int>
 1 1     Stevens [20]                                         342 26-2         1
 2 2     Vassar [3]                                           325 17-1         2
 3 3     Juniata                                              294 22-2         3
 4 4     Springfield                                          276 20-2         4
 5 5     Messiah                                              251 20-2         5
 6 6     North Central (IL)                                   217 18-3         7
 7 T-7   NYU                                                  171 12-6         9
 8 T-7   SUNY New Paltz                                       171 18-8        11
 9 9     Carthage                                             168 14-5         6
10 10    Southern Virginia                                    167 13-2        10
11 11    St John Fisher                                       135 19-6         8
12 12    Wentworth                                             64 18-6        13
13 13    Nazareth                                              58 20-5        14
14 14    Marymount                                             56 17-5        12
15 15    Rutgers-Newark                                        31 11-5        15
# … with abbreviated variable names ¹​`Total Points Adjusted`, ²​`Previous Rank`
#Here is the Division 3 Men's Volleyball National Rankings...Messiah University is ranked 5! Go Falcons!!!!!