Stat 585 - Reproducing Randomness

Frontmatter check

Prompt:

In May 2015 Science retracted - without consent of the lead author - a paper on how canvassers can sway people’s opinions about gay marriage, see also: http://www.sciencemag.org/news/2015/05/science-retracts-gay-marriage-paper-without-agreement-lead-author-lacour The Science Editor-in-Chief cited as reasons for the retraction that the original survey data was not made available for independent reproduction of results, that survey incentives were misrepresented and that statements made about sponsorships turned out to be incorrect.
The investigation resulting in the retraction was triggered by two Berkeley grad students who attempted to replicate the study and discovered that the data must have been faked.

FiveThirtyEight has published an article with more details on the two Berkeley students’ work.

Malicious changes to the data such as in the LaCour case are hard to prevent, but more rigorous checks should be built into the scientific publishing system. All too often papers have to be retracted for unintended reasons. Retraction Watch is a data base that keeps track of retracted papers (see the related Science magazine publication).

Read the paper Ten Simple Rules for Reproducible Computational Research by Sandve et al.

Write a blog post addressing the questions:

Pick one of the papers from Retraction Watch that were retracted because of errors in the paper (you might want to pick a paper from the set of featured papers, because there are usually more details available). Describe what went wrong. Would any of the rules by Sandve et al. have helped in this situation?

Answer: From the Retraction Watch database, there contained an article titled Facial expressions can detect Parkinson’s disease: preliminary evidence from videos collected online. https://www.nature.com/articles/s41746-021-00502-8 . It was retracted because of the authors used a data pre-processing tool which resulted in an information leak from train to test samples. This made it so there were incorrect classification metrics.

Following some of the rules by Sandve et all, I believe this mistake could have been avoided or at least mitigated. Much of the data manipulation was done manually through individual lines of code rather than through a more robust method like something from the sklearn package in Python. Looking at the code for their clustering analysis, they split the train and test sets manually rather than through a sklearn train, test, split method. This possibly led to the train and test sets being bled together, leading to inaccurate results.

After reading the paper by Sandve et al. describe which rule you are most likely to follow and why, and which rule you find the hardest to follow and will likely not (be able to) follow in your future projects.

Some of the most useful rules I personally will start using is rule 10. After reading over multiple papers involving data science topics, I didn’t know before that the scripts used in the paper were so readily available. I always knew Github could be used to store results and all of the methods used, but I didn’t know it was becoming the standard. In doing research in the future, I will make sure to document cleanly and publish all of my work in a github repo so my results can be replicated as accurately as possible.

One of the rules that will be the hardest to follow is rule 6. In projects containing a level of randomness, to be able to reproduce results, you have to have the same seed used in the experiment. One example I would think of that would make things difficult to replicate would be the train and test splits, making sure those are exact and completely reproducible. While with a proper statistical analysis, randomly splitting the test and train sets should be yielding similar results regardless of the split, it is still difficult to make it exact and reproducible for that kind of case.

Submission

Push your changes to your repository.
You are ready to call it good, once all your github actions pass without an error. You can check on that by selecting ‘Actions’ on the menu and ensure that the last item has a green checkmark. The action for this repository checks the yaml of your contribution for the existence of the author name, a title, date and categories. Don’t forget the space after the colon! Once the action passes, the badge along the top will also change its color accordingly. As of right now, the status for the YAML front matter is:

Frontmatter check

---
author: "Your Name"
title: "Specify your title"
date: "2023-02-23"
categories: "Ethics and Reproducibility..."
---