With great power comes great responsibility - a large part of the web is based on data and services that scrape those data. Now that we start to apply scraping mechanisms, we need to think about how to apply those skills without becoming a burden to the internet society.
Find sources on ethical web scraping - some readings that might help you get started with that are:
After reading through some of the ethics essays write a blog post addressing the following questions:
What are your main three takeaways for ethical web scraping? - Give examples, or cite your sources.
My takeaways are from both side: being respectful and grateful when scraping, and being open to ethical scrapers when being an owner.
To me, scraping is like making a cold call: it takes two to make the deal. Visitors should leave the customer alone if there’s a “No soliciting”. If no such a sign, visitors need to knock on the door before getting in. After getting in, visitors are responsible for identifying themselves, following the instructions from the owner, being polite, and saying thanks before leaving. These correspond to access APIs if there’s any, reasonably request data, respect rules and data, showing gratitude.
On the other hand, the owner can make rules or signs to avoid confusion. If the owner decides to start a conversation, respecting the visitors’ follow rules, explain why he needs the visitors to leave if there’s any situation. These correspond to considering public APIs, allowing ethical scrapers, and reaching out to scrapers before blocking.
What is a ROBOTS.TXT file? Identify one instance and explain what it allows/prevents.
A robots.txt file tells scrappers which URLs can be accessed. Here’s a line of robots.txt.
[[1]]
# A tibble: 19 × 6
Year `Host(s)` Winners Winni…¹ Top s…² Best …³
<int> <chr> <chr> <chr> <chr> <chr>
1 1956 Hong Kong South Korea Kim Su… Nahum … —
2 1960 South Korea South Korea Kim Yo… Cho Yo… —
3 1964 Israel Israel Yosef … Inder … —
4 1968 Iran Iran Mahmou… Homayo… —
5 1972 Thailand Iran Mohamm… Hossei… Ebrahi…
6 1976 Iran Iran Heshma… Gholam… Ali Pa…
7 1980 Kuwait Kuwait Carlos… Behtas… —
8 1984 Singapore Saudi Arab… Khalil… Jia Xi… Jia Xi…
9 1988 Qatar Saudi Arab… Carlos… Lee Ta… Kim Jo…
10 1992 Japan Japan Hans O… Fahad … Kazuyo…
11 1996 United Arab Emirates Saudi Arab… Nelo V… Ali Da… Khodad…
12 2000 Lebanon Japan Philip… Lee Do… Hirosh…
13 2004 China Japan Zico A'ala … Shunsu…
14 2007 Indonesia Malaysia Thailand Vietnam Iraq Jorvan… Younis… Younis…
15 2011 Qatar Japan Albert… Koo Ja… Keisuk…
16 2015 Australia Australia Ange P… Ali Ma… Massim…
17 2019 United Arab Emirates Qatar Félix … Almoez… Almoez…
18 2023 Qatar TBD TBD TBD TBD
19 2027 Saudi Arabia TBD TBD TBD TBD
# … with abbreviated variable names ¹`Winning coach`, ²`Top scorer(s) (goals)`,
# ³`Best player award`
Instructions:
Submit to your repo. Make sure that all of the github actions pass (check menu item Actions - all of the actions should have green checks)