Importance of Data Reproducibility

Ethics and Reproducibility…
Author

Logan Johnson

Published

February 23, 2023

Frontmatter check

Prompt:

In May 2015 Science retracted - without consent of the lead author - a paper on how canvassers can sway people’s opinions about gay marriage, see also: http://www.sciencemag.org/news/2015/05/science-retracts-gay-marriage-paper-without-agreement-lead-author-lacour The Science Editor-in-Chief cited as reasons for the retraction that the original survey data was not made available for independent reproduction of results, that survey incentives were misrepresented and that statements made about sponsorships turned out to be incorrect.
The investigation resulting in the retraction was triggered by two Berkeley grad students who attempted to replicate the study and discovered that the data must have been faked.

FiveThirtyEight has published an article with more details on the two Berkeley students’ work.

Malicious changes to the data such as in the LaCour case are hard to prevent, but more rigorous checks should be built into the scientific publishing system. All too often papers have to be retracted for unintended reasons. Retraction Watch is a data base that keeps track of retracted papers (see the related Science magazine publication).

Read the paper Ten Simple Rules for Reproducible Computational Research by Sandve et al.

Write a blog post addressing the questions:

  1. Pick one of the papers from Retraction Watch that were retracted because of errors in the paper (you might want to pick a paper from the set of featured papers, because there are usually more details available). Describe what went wrong. Would any of the rules by Sandve et al. have helped in this situation?

  2. After reading the paper by Sandve et al. describe which rule you are most likely to follow and why, and which rule you find the hardest to follow and will likely not (be able to) follow in your future projects.

Responses

  1. Attached is the link to the Article from Retraction Watch that I picked. This article appears to have gained attention when readers noticed the error bars on the graph from Figure 9 were simply the letter “T”. This is obviously a concern because it raises questions about the underlying data, the methods for collecting and analyzing the data, and the overall confidence in the data. After the journal was made aware of these concerns about the paper, they conducted an investigation on the paper which ultimately led to the paper’s retraction. The journal identified that the data used in the study was not generated. There was also incorrect notation in some of the equations used in the paper as well as references in the paper that were retracted prior to submission.
  • It is concerning that a paper with this many errors made it through the peer review process. This is obviously a separate concern, but I think there were some steps that the journal could have used to determine the quality of the data prior. I think Rules 1, 2, 7, and 10 all could have helped to avoid the publishing of the paper in the first place. Again, I think there were greater underlying problems with this paper apart from their ability to manage and report their data and also some concerns that this paper made it through the peer-review process in the first place. But I think if the authors applied some of these data management strategies outline by Sandve et al. that this retraction likely would not have happened.
  1. Through STAT 585 and the information that I have learned thus far, I will most definitely implement version control (Rule 4) in my research. I have already begun to transition existing projects onto my GitHub. I think this is a basic part that I have not done previously that will help me and future students as well. The challenge that I faced previously was a lack of understanding of Git and GitHub that hindered my ability to apply that version control to my research.
  • I think the hardest rule to follow might be Rule 2. I think with some of the data that I have on a small scale, it is easier to manipulate that manually versus have a code or workflow for that. With some of the larger projects and data that I have, it is definitely easier and more efficient to manipulate it with code. I just think there will be times that it is easier done by hand or manually than it would be to write it all out.

Submission

  1. Push your changes to your repository.

  2. You are ready to call it good, once all your github actions pass without an error. You can check on that by selecting ‘Actions’ on the menu and ensure that the last item has a green checkmark. The action for this repository checks the yaml of your contribution for the existence of the author name, a title, date and categories. Don’t forget the space after the colon! Once the action passes, the badge along the top will also change its color accordingly. As of right now, the status for the YAML front matter is:

Frontmatter check