Web scraping

Web scraping etiquette …
Author

Charch

Published

April 6, 2023

The json schema for this blog is different in the announcement than the schema on the github. Let me know if you want me to change it.

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.

My main takeaways from the ethical web scraping is that it takes resources to build and maintain website and we should respect the robot.txt file and their terms of services.

Bow and scrape: Basically we should also obtain consent if we are going to use their content and give them the credit.

It would be nice to provide some value in return if we can. Also, we should respect the data privacy and try to use only the content which is necessary for the work or hobby we are doing. Anything else is best to be not stored on our computer.

  1. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.

Robot.txt file is use to instruct the web robot how to crawl (find stuff on) the web. It communicate what parts of the site are accessible and what actions are allowed. It is usually at the root of a website.

Below is the example from gooogle search central:

” User-agent: Googlebot Disallow: /nogooglebot/

User-agent: * Allow: /

Sitemap: https://www.example.com/sitemap.xml ”

The above file is stopping an agent named “Googlebot” to crawl any URL that starts with xyz.com/nogooglebot/ but everyone else is allowed.

Robots.txt is usually a guideline and some users do not seem to follow that online.

  1. Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.

First example is just from the second link of this rmd file as a first try for this blog.

library(polite)
library(rvest)
session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
  html_node("#main-body") %>% 
  html_nodes("h3") %>% 
  html_text()
head(result)
[1] "3-Cheese Italian Blend"  "Abbaye de Citeaux"      
[3] "Abbaye du Mont des Cats" "Adelost"                
[5] "ADL Brick Cheese"        "Ailsa Craig"            
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
session <- bow("https://en.wikipedia.org/wiki/List_of_national_independence_days", force = TRUE)
ind_html <-
  polite::scrape(session) %>%
  rvest::html_nodes("table.wikitable") %>% 
  rvest::html_table(fill = TRUE)

ind_tab <- 
  ind_html[[1]] %>% 
  clean_names()
ind_tab
# A tibble: 202 × 6
   country             name_of_holiday           date_…¹ year_…² indep…³ event…⁴
   <chr>               <chr>                     <chr>   <chr>   <chr>   <chr>  
 1 Afghanistan         Afghan Independence Day … 19 Aug… 1919    United… "Anglo…
 2 Albania             Flag Day (Dita e Flamuri… 28 Nov… 1912    Ottoma… "Alban…
 3 Algeria             Independence Day          5 July  1962    France  "Alger…
 4 Angola              Independence Day          11 Nov… 1975    Portug… "The A…
 5 Antigua and Barbuda Independence Day          1 Nove… 1981    United… "The e…
 6 Argentina           Independence Day          9 July  1816[8] Spanis… "Argen…
 7 Armenia             Republic Day              28 May  1918[9] Russia… "Decla…
 8 Armenia             Independence Day          21 Sep… 1991    Soviet… "1991 …
 9 Azerbaijan          Independence Day          28 May  1918    Russia… "Decla…
10 Azerbaijan          Independendence Restorat… 18 Oct… 1991[1… Soviet… "Adopt…
# … with 192 more rows, and abbreviated variable names ¹​date_of_holiday,
#   ²​year_of_event, ³​independence_from, ⁴​event_commemorated_and_notes

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)