Prompt:
With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.
Find sources on ethical web scraping - some readings that might help you get started with that are:
R package polite
After reading through some of the ethics essays write a blog post addressing the following questions:
- What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.
Main 3 takeaways for ethical webscaping: - One of the first takeaways for ethical webscraping that I noticed when reading Jami @ EMIRICAL was that the definition of ethical may be subject in a way and may not be universially agreed upon… the reading discussed that even if on paper everything seems to be in order and appropriate, it doesn’t mean that you may be compfortable with it. You don’t always know what you’re signing and it could be dangerous. - Expanding upon my first takeaway, the reading written by James Densmore on the ethics in web scraping continued to touch on the point that one of the largest issues regarding the ethics of this practice is that not everyone is agreeing on the terms. People are typically innately selfish so what someone may think of as ethical on one end (the scraper) may differ from one on the other end (the site ownder). - The third takeaway came from reviewing the R package [polite] … This brought everything together in terms of the process of scraping not being as scary as I origionally thought. I am typically afraid of getting in trouble or doing something wrong but through the help of this package there is a more clear way of seaking permisison and introducting yourself and following the steps that were laid out in the previous two writtings.
- What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.
According to Jami @ Empirical … a ROBOTS.TXT files is … “what indicates the web-crawling software where it is allowed (or not allowed) within the website. This is part of the Robots Exclusion Protocol (REP) which are a group of web standards created as a way to regulate how robots crawl the web.” … this file typically lives at the root of a site as a plain text file that provides instructions to web robots like search engine crawlers about locations within the website that robots are or are not allowed to crawl and index.
- Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the
polite
package.
To help work through some examples I utilized the following link: https://www.r-bloggers.com/2020/05/intro-to-polite-web-scraping-of-soccer-data-with-r/
Instructions:
Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)