Web scraping Blog.

Web scraping etiquette …
Author

AR

Published

April 6, 2023

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources. One thing that comes up often is that if the site has an API, it is important to use that instead of scraping. (Ethics in Web Scraping and A Guide to Ethical Web Scraping). Also, it is important to not use too much data, or do it during less busy times. We also want to follow the terms and conditions rules, and ask for premission if it is not specified. (A Guide to Ethical Web Scraping)

  2. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents. A ROBOTS.TXT file is what tells the web crawling software where it is allowed on the website. I think an example of this would be if Google had a link to a website that had information on particular people in a study. Maybe they had a file with user names and the robots.txt file would say that the web crawler could not access that folder with the user names.

  3. Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package. I think it would be interesting to web scrape Goodreads. It is a website that has book titles, authors and ratings.

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)