Web scraping etiquette …

Errors and warnings in packages
Author

Muxin Hua

Published

April 6, 2023

Prompt:

With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.

Find sources on ethical web scraping - some readings that might help you get started with that are:

After reading through some of the ethics essays write a blog post addressing the following questions:

  1. What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.

My takeaways are from both side: being respectful and grateful when scraping, and being open to ethical scrapers when being an owner.

To me, scraping is like making a cold call: it takes two to make the deal. Visitors should leave the customer alone if there’s a “No soliciting”. If no such a sign, visitors need to knock on the door before getting in. After getting in, visitors are responsible for identifying themselves, following the instructions from the owner, being polite, and saying thanks before leaving. These correspond to access APIs if there’s any, reasonably request data, respect rules and data, showing gratitude.

On the other hand, the owner can make rules or signs to avoid confusion. If the owner decides to start a conversation, respecting the visitors’ follow rules, explain why he needs the visitors to leave if there’s any situation. These correspond to considering public APIs, allowing ethical scrapers, and reaching out to scrapers before blocking.

  1. What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.

    A robots.txt file tells scrappers which URLs can be accessed. Here’s a line of robots.txt.

    User-agent: Googlebot

    Disallow: /nogooglebot/

The user agent named Googlebot is not allowed to crawl any URL that starts with https://example.com/nogooglebot/.

User-agent: *

Allow: /

Above lines mean all user agent are allowed to crawl the entire site.

Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the polite package.

library(polite)
library(rvest)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
url <- "https://en.wikipedia.org/wiki/AFC_Asian_Cup_records_and_statistics"
session <- bow(url, user_agent = "hiiiua blog-10 assignment")

section <- scrape(session) %>% html_nodes("#mw-content-text > div.mw-parser-output > table:nth-child(5)")

section %>% html_table()
[[1]]
# A tibble: 19 × 6
    Year `Host(s)`                           Winners     Winni…¹ Top s…² Best …³
   <int> <chr>                               <chr>       <chr>   <chr>   <chr>  
 1  1956 Hong Kong                           South Korea Kim Su… Nahum … —      
 2  1960 South Korea                         South Korea Kim Yo… Cho Yo… —      
 3  1964 Israel                              Israel      Yosef … Inder … —      
 4  1968 Iran                                Iran        Mahmou… Homayo… —      
 5  1972 Thailand                            Iran        Mohamm… Hossei… Ebrahi…
 6  1976 Iran                                Iran        Heshma… Gholam… Ali Pa…
 7  1980 Kuwait                              Kuwait      Carlos… Behtas… —      
 8  1984 Singapore                           Saudi Arab… Khalil… Jia Xi… Jia Xi…
 9  1988 Qatar                               Saudi Arab… Carlos… Lee Ta… Kim Jo…
10  1992 Japan                               Japan       Hans O… Fahad … Kazuyo…
11  1996 United Arab Emirates                Saudi Arab… Nelo V… Ali Da… Khodad…
12  2000 Lebanon                             Japan       Philip… Lee Do… Hirosh…
13  2004 China                               Japan       Zico    A'ala … Shunsu…
14  2007 Indonesia Malaysia Thailand Vietnam Iraq        Jorvan… Younis… Younis…
15  2011 Qatar                               Japan       Albert… Koo Ja… Keisuk…
16  2015 Australia                           Australia   Ange P… Ali Ma… Massim…
17  2019 United Arab Emirates                Qatar       Félix … Almoez… Almoez…
18  2023 Qatar                               TBD         TBD     TBD     TBD    
19  2027 Saudi Arabia                        TBD         TBD     TBD     TBD    
# … with abbreviated variable names ¹​`Winning coach`, ²​`Top scorer(s) (goals)`,
#   ³​`Best player award`

Instructions:

Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)