With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.
Find sources on ethical web scraping - some readings that might help you get started with that are:
After reading through some of the ethics essays write a blog post addressing the following questions:
What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.
One of the first things I was unaware of was that webscraping can be a burden on the server. I did not think it was anymore cumbersome for the server than just visiting the website, but because this is not the case good practice is to conduct scraping during non-busy times.
Another good practice is to identify yourself. The websites owner may see some unusual activity, so it can be a good idea to give a string in your code to identify yourself, and maybe also let them know your intentions.
Lastly, you should give back to the website owner and give them credit. If you are using their data, cite their website/article. This will help give their website some more traffic.
What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.
A ROBOTS.TXT files is put in place to put limitations on what crawlers can access on their website. An example would be including a ROBOTS.TXT file to limit the amount of information a search engine can search on your website and include in their results. Sometimes you may want GOOGLE to avoid including PDFs or pictures, so you could include a ROBOTS.TXT file to prevent this.
Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.
Instructions:
Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)