Web scraping etiquette …

Web scraping etiquette …
Author

Who wrote this

Published

April 6, 2023

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.

My main takeaways are James Densmore and JAMI @ EMPIRICAL. In general, try to be kind to one another. Don’t harm the website you’re scraping. 1) Respect the website’s guidelines: use a public API instead if available, respect the robots.txt file if the website has one, and respect the terms and conditions of the site. 2) Give credit where it’s due: if you scrape data and write something about it, make sure to credit the source of the data. 3) Don’t overload the site: space out requests and try to send requests during off-peak hours.

  1. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.

A robots.txt file tells the web-crawling software where it is allowed to go on the website (JAMI @ EMPIRICAL). For example, websites can use it to tell search engines like Google which pages should be crawled to understand the website and what search results it should appear in Introduction to robots.txt. Note that while the robots.txt file tells web crawlers where they are allowed to go, they don’t actually enforce those rules themselves.

  1. Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.

I scraped from https://www.nytimes.com/elections/2016/results/massachusetts.

library(polite)
library(rvest)
# similar to example from class
url <- "https://www.nytimes.com/elections/2016/results/massachusetts"
session <- bow(url)
html <- scrape(session)
tables <- html %>% html_table()

# tables %>% purrr::map(.f = pillar::glimpse)

ma_results <- tables[[2]] %>% dplyr::mutate(
  Trump  = readr::parse_number(Trump),
  Clinton = readr::parse_number(Clinton)
)
ma_results
# A tibble: 351 × 3
   `Vote by town` Clinton Trump
   <chr>            <dbl> <dbl>
 1 Boston          221093 38087
 2 Worcester        43084 17732
 3 Springfield      40341 11231
 4 Cambridge        46563  3323
 5 Newton           36463  7764
 6 Quincy           25477 13321
 7 Somerville       33740  4128
 8 Lowell           23555 10584
 9 Brockton         25593  8801
10 Lynn             22164  9311
# … with 341 more rows

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)