library(polite)
library(rvest)
<- bow("https://en.wikipedia.org/wiki/Table_tennis") session
Prompt:
With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.
Find sources on ethical web scraping - some readings that might help you get started with that are:
R package polite
After reading through some of the ethics essays write a blog post addressing the following questions:
What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.
Prior to the reading, I was not aware of User Agent strings or robot.txt files. I think these two concepts, along with public APIs, are important to understand as an ethical web scraper. Despite using packages like rvest and Python’s Beautiful Soup in the past, I had never encountered the need for a User Agent string or interaction with a robots.txt file. James Densmore did a great job outlining these topics in his towardsdatascience article. If everyone introduced themselves to the site owners via a User Agent string, checked what the house rules were via the robot.txt file, and utilized public API’s when available, site owners might be less likely to dissuade scrapping.
What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.
The ROBOT.txt can prevent scrappers from accessing data whether that is a web page, media file, or resource files. In a informative article by Google, the search giant explains how website owners use robot.txt to indicate which URLs they would like crawled by Google and to avoid too much traffic.
Identify a website that you would like to scrape (or one an example from class) and implement a scrape using the
polite
package.I utilized the polite package to scrap information on notable table tennis players from Wikipedia.
First, I started by using bow()
to introduce myself to the page.
Next, I let the scrape()
function know that I had properly introduced myself and passed the result to html_nodes()
to get the table I was looking for by specifying the class. I passed the html to the html_table()
function to nicely parse it into appealing R terms for me. Finally, I printed my object and reflected on how much I like Wikipedia.
<- scrape(session) |>
result html_nodes(".wikitable") |>
html_table()
result
[[1]]
# A tibble: 11 × 7
Name Gender Nationality `Times won` `Times won` Times…¹ ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Name Gender Nationality Olympics World Champi… World … ""
2 Jan-Ove Waldner Male Sweden 1 (1992) 2 (1989, 199… 1 (199… "[76…
3 Deng Yaping Female China 2 (1992, 1996) 3 (1991, 199… 1 (199… "[77…
4 Liu Guoliang Male China 1 (1996) 1 (1999) 1 (199… "[78…
5 Wang Nan Female China 1 (2000) 3 (1999, 200… 4 (199… "[79…
6 Kong Linghui Male China 1 (2000) 1 (1995) 1 (199… "[80…
7 Zhang Yining Female China 2 (2004, 2008) 2 (2005, 200… 4 (200… "[81…
8 Zhang Jike Male China 1 (2012) 2 (2011, 201… 2 (201… "[82…
9 Li Xiaoxia Female China 1 (2012) 1 (2013) 1 (200… "[83…
10 Ding Ning Female China 1 (2016) 3 (2011, 201… 2 (201… "[84…
11 Ma Long Male China 2 (2016, 2020) 3 (2015, 201… 2 (201… <NA>
# … with abbreviated variable name ¹`Times won`