library(polite)
library(rvest)
Prompt:
With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.
Find sources on ethical web scraping - some readings that might help you get started with that are:
R package polite
After reading through some of the ethics essays write a blog post addressing the following questions:
- What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.
My main takeaways are James Densmore and JAMI @ EMPIRICAL. In general, try to be kind to one another. Don’t harm the website you’re scraping. 1) Respect the website’s guidelines: use a public API instead if available, respect the robots.txt
file if the website has one, and respect the terms and conditions of the site. 2) Give credit where it’s due: if you scrape data and write something about it, make sure to credit the source of the data. 3) Don’t overload the site: space out requests and try to send requests during off-peak hours.
- What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.
A robots.txt
file tells the web-crawling software where it is allowed to go on the website (JAMI @ EMPIRICAL). For example, websites can use it to tell search engines like Google which pages should be crawled to understand the website and what search results it should appear in Introduction to robots.txt. Note that while the robots.txt file
tells web crawlers where they are allowed to go, they don’t actually enforce those rules themselves.
- Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the
polite
package.
I scraped from https://www.nytimes.com/elections/2016/results/massachusetts.
# similar to example from class
<- "https://www.nytimes.com/elections/2016/results/massachusetts"
url <- bow(url)
session <- scrape(session)
html <- html %>% html_table()
tables
# tables %>% purrr::map(.f = pillar::glimpse)
<- tables[[2]] %>% dplyr::mutate(
ma_results Trump = readr::parse_number(Trump),
Clinton = readr::parse_number(Clinton)
) ma_results
# A tibble: 351 × 3
`Vote by town` Clinton Trump
<chr> <dbl> <dbl>
1 Boston 221093 38087
2 Worcester 43084 17732
3 Springfield 40341 11231
4 Cambridge 46563 3323
5 Newton 36463 7764
6 Quincy 25477 13321
7 Somerville 33740 4128
8 Lowell 23555 10584
9 Brockton 25593 8801
10 Lynn 22164 9311
# … with 341 more rows
Instructions:
Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)