Web scraping etiquette …

Errors and warnings in packages
Author

Sabrena Rutledge

Published

April 6, 2023

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources. Be considerate to the page that you are web-scraping: cite the author(s) and do not claim their data as your own. (https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) If there is an API on the website for its data, use the API to retrieve the data. (https://www.empiricaldata.org/dataladyblog/a-guide-to-ethical-web-scraping) “The three pillars of a polite session are seeking permission, taking slowly and never asking twice” (https://github.com/dmi3kno/polite)

  2. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents. The ROBOTS.TXT file defines what is allowed or not allowed within a website to web-crawling software (https://www.empiricaldata.org/dataladyblog/a-guide-to-ethical-web-scraping). It allows or prevents robots to access content based on the user’s specifications. There is no standard, so it is truly up to the website creater to state what is allowed (or probably more specifically NOT allowed).

  3. Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package. Using the example on the polite package’s github: www.cheese.com I successfully installed polite and ran the below code (the basic example from the github page).

library(polite)
library(rvest)

session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
  html_node("#main-body") %>% 
  html_nodes("h3") %>% 
  html_text()
head(result)

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)