Stat 585 - Web scraping

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.

Solution: The major takeaway for ethical web scraping :

Before any web-Scarping, it is important to check the website’s terms of use and robots.txt file to ensure you are not violating any rules.
Web-Scarping should not collect sensitive or private information without the consent of the website owner or the individuals concerned. It should follow data privacy and intellectual property like copyrighted material rights.
Web-scraping should not cause undue load on the website’s server or disrupt its performance. It limits the rate at which requests are made to the website’s server.

References:

website’s terms of use and robots.txt: https://developers.google.com/search/docs/crawling-indexing/robots/intro and https://techcrunch.com/2022/04/18/web-scraping-legal-court/?guccounter=1
Data privacy laws and web scraping: https://www.fieldfisher.com/en/services/privacy-security-and-information/privacy-security-and-information-law-blog/data-scraping-considering-the-privacy-issues
Limit request rate: https://blog.cloudflare.com/advanced-rate-limiting/

What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.

Solution: A robots.txt file is placed on a website’s server, instructing web crawlers how to crawl and index its pages. It is a plain text file specifying which parts of the website the web crawlers are allowed or not to access.

An example of a robots.txt file looks like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

References: Check for the complete robots.txt: https://kinsta.com/robots.txt

The User-agent: * line indicates what rules apply to all web-crawler. The Disallow lines specify directories or pages not to be accessed by the bots, while Allow lines allow the bot to access pages or directories by it.

This Disallow help to prevent the bot from crawling a page with sensitive information.

Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.

Solution: We are doing web-scrape for Active Civil Service List data, which consists of all candidate which passed the exam.

https://data.cityofnewyork.us/City-Government/Civil-Service-List-Active-/vx8i-nprf

library(polite)
library(rvest)
library(httr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

polite_GET <- politely(httr::GET,verbose = TRUE)
res <- polite_GET("https://data.cityofnewyork.us/resource/vx8i-nprf.json")

Fetching robots.txt

rt_robotstxt_http_getter: normal http get


New copy robots.txt was fetched from https://data.cityofnewyork.us/robots.txt

Total of 1 crawl delay rule(s) defined for this host.

Your rate will be set to 1 request every 5 second(s).

Pausing...

Scraping: https://data.cityofnewyork.us/resource/vx8i-nprf.json

Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot

#res
df <- jsonlite::fromJSON(rawToChar(res$content))
head(df)

  exam_no list_no first_name last_name adj_fa list_title_code
1    0162 875.000    JENELLE    FRASER  78.00           10001
2    0162 876.000        AML  METRYOSE  78.00           10001
3    0162 877.000    YAHAIRA   ALMONTE  78.00           10001
4    0162 878.000    CHI SUN      CHOW  77.00           10001
5    0162 879.000     RACHEL    CANCEL  77.00           10001
6    0162 880.000    FARHANA     AKTER  76.00           10001
            list_title_desc group_no list_agency_code list_agency_desc
1 ADMINISTRATIVE ACCOUNTANT      000              000 OPEN COMPETITIVE
2 ADMINISTRATIVE ACCOUNTANT      000              000 OPEN COMPETITIVE
3 ADMINISTRATIVE ACCOUNTANT      000              000 OPEN COMPETITIVE
4 ADMINISTRATIVE ACCOUNTANT      000              000 OPEN COMPETITIVE
5 ADMINISTRATIVE ACCOUNTANT      000              000 OPEN COMPETITIVE
6 ADMINISTRATIVE ACCOUNTANT      000              000 OPEN COMPETITIVE
           published_date        established_date        anniversary_date   mi
1 2021-05-26T00:00:00.000 2021-07-28T00:00:00.000 2025-07-28T00:00:00.000 <NA>
2 2021-05-26T00:00:00.000 2021-07-28T00:00:00.000 2025-07-28T00:00:00.000    T
3 2021-05-26T00:00:00.000 2021-07-28T00:00:00.000 2025-07-28T00:00:00.000 <NA>
4 2021-05-26T00:00:00.000 2021-07-28T00:00:00.000 2025-07-28T00:00:00.000 <NA>
5 2021-05-26T00:00:00.000 2021-07-28T00:00:00.000 2025-07-28T00:00:00.000 <NA>
6 2021-05-26T00:00:00.000 2021-07-28T00:00:00.000 2025-07-28T00:00:00.000 <NA>
  veteran_credit sibling_lgy_credit
1           <NA>               <NA>
2           <NA>               <NA>
3           <NA>               <NA>
4           <NA>               <NA>
5           <NA>               <NA>
6           <NA>               <NA>

Total Candidate passed for different agency

df_grouped <- df %>%
  group_by(list_agency_desc) %>%
  summarize(total_person = n())

df_grouped

# A tibble: 9 × 2
  list_agency_desc                       total_person
  <chr>                                         <int>
1 ADMINISTRATION FOR CHILDREN'S SERVICES            2
2 DEPARTMENT OF CITY PLANNING                       1
3 DEPARTMENT OF EDUCATION                          25
4 HRA/DEPARTMENT OF SOCIAL SERVICES                 4
5 NYC EMPLOYEES' RETIREMENT SYSTEM                  5
6 OFFICE OF THE COMPTROLLER                        19
7 OPEN COMPETITIVE                                932
8 POLICE DEPARTMENT                                 9
9 TEACHERS' RETIREMENT SYSTEM                       3

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)