With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.
Find sources on ethical web scraping - some readings that might help you get started with that are:
After reading through some of the ethics essays write a blog post addressing the following questions:
What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.
Solution: The major takeaway for ethical web scraping :
Before any web-Scarping, it is important to check the website’s terms of use and robots.txt file to ensure you are not violating any rules.
Web-Scarping should not collect sensitive or private information without the consent of the website owner or the individuals concerned. It should follow data privacy and intellectual property like copyrighted material rights.
Web-scraping should not cause undue load on the website’s server or disrupt its performance. It limits the rate at which requests are made to the website’s server.
What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.
Solution: A robots.txt file is placed on a website’s server, instructing web crawlers how to crawl and index its pages. It is a plain text file specifying which parts of the website the web crawlers are allowed or not to access.
The User-agent: * line indicates what rules apply to all web-crawler. The Disallow lines specify directories or pages not to be accessed by the bots, while Allow lines allow the bot to access pages or directories by it.
This Disallow help to prevent the bot from crawling a page with sensitive information.
Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.
Solution: We are doing web-scrape for Active Civil Service List data, which consists of all candidate which passed the exam.
# A tibble: 9 × 2
list_agency_desc total_person
<chr> <int>
1 ADMINISTRATION FOR CHILDREN'S SERVICES 2
2 DEPARTMENT OF CITY PLANNING 1
3 DEPARTMENT OF EDUCATION 25
4 HRA/DEPARTMENT OF SOCIAL SERVICES 4
5 NYC EMPLOYEES' RETIREMENT SYSTEM 5
6 OFFICE OF THE COMPTROLLER 19
7 OPEN COMPETITIVE 932
8 POLICE DEPARTMENT 9
9 TEACHERS' RETIREMENT SYSTEM 3
Instructions:
Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)