library(tidyverse)
options(digits = 3)
#options(prompt = "R> ")
<- rgb(102, 204, 255, max = 255)
brightblue
# ----------------------------
# Base ball case study ============================================
<- subset(plyr::baseball, id == "ruthba01")
baberuth <- transform(baberuth, cyear = year - min(year) + 1)
baberuth
# ----------------------------
## The Original Code
library(plyr)
<- ddply(baseball, .(id), transform,
baseball.plyr cyear = year - min(year) + 1)
## Making this a Dplyr Code
<- baseball%>%group_by(id)%>%mutate(cyear = year - min(year) + 1) baseball.dplyr
Prompt:
The plyr
package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.
Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.
Write a blog post addressing the following questions:
The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating
plyr
functionality (such asdlply
orddply
, …) and rewrite the example using functionality from the packagedplyr
. Make sure that your example works and the results are identical.- I used the Baseball example with calculating the
cyear
.
- I used the Baseball example with calculating the
- Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?
- All of the *apply() functions (like
lapply()
,mapply()
,sapply()
, etc) can help with the apply part of the split-apply-combine strategy. - We could also use concatination functions, like
cbind()
andrbind()
- We can finally use functions like
$
to separate in base R.
- All of the *apply() functions (like
- The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?
The “split-apply-combine” paradigm is a data analysis strategy used in statistics and data science to analyze grouped data. It consists of three steps:
Split: The data is first split into groups based on one or more variables of interest.
Apply: A specific operation, function, or a group of functions is applied to each group of data, usually for the purpose of aggregating, summarizing, or transforming the data within each group.
Combine: The results of the operation applied to each group are then combined and returned as a single output.
The split-apply-combine paradigm is particularly useful in cases where the data cannot be analyzed as a whole and instead must be analyzed in smaller, more manageable chunks. The slipt-apply-combine paradigm is also useful when the data must have multiple functions applied to one another. This allows for more efficient data processing and improved scalability. The paradigm is commonly implemented using tools such as the “group_by” function in the R and Python programming languages, and the “dplyr” library in R.
You can write your answers directly the README.Rmd
file. Make sure that the file knits (you will need to activate any packages your use in your code chunks with a call to library(xxx)
, where xxx is the name of the package, such as plyr
). Commit your changes and push to your repo; add any files in the README_files
directory to your repository.