library(plyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
baseball <- ddply(baseball, .(id), transform,
cyear = year - min(year) + 1)
#using dplyr
baseball%>%
group_by(id)%>%
mutate(cyear = year - min(year) + 1)
#> # A tibble: 21,699 × 23
#> # Groups: id [1,228]
#> id year stint team lg g ab r h X2b X3b hr rbi
#> <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 aaro… 1954 1 ML1 NL 122 468 58 131 27 6 13 69
#> 2 aaro… 1955 1 ML1 NL 153 602 105 189 37 9 27 106
#> 3 aaro… 1956 1 ML1 NL 153 609 106 200 34 14 26 92
#> 4 aaro… 1957 1 ML1 NL 151 615 118 198 27 6 44 132
#> 5 aaro… 1958 1 ML1 NL 153 601 109 196 34 4 30 95
#> 6 aaro… 1959 1 ML1 NL 154 629 116 223 46 7 39 123
#> 7 aaro… 1960 1 ML1 NL 153 590 102 172 20 11 40 126
#> 8 aaro… 1961 1 ML1 NL 155 603 115 197 39 10 34 120
#> 9 aaro… 1962 1 ML1 NL 156 592 127 191 28 6 45 128
#> 10 aaro… 1963 1 ML1 NL 161 631 121 201 29 4 44 130
#> # … with 21,689 more rows, and 10 more variables: sb <int>, cs <int>, bb <int>,
#> # so <int>, ibb <int>, hbp <int>, sh <int>, sf <int>, gidp <int>, cyear <dbl>Prompt:
The plyr package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.
Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.
Write a blog post addressing the following questions:
- The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating
plyrfunctionality (such asdlplyorddply, …) and rewrite the example using functionality from the packagedplyr. Make sure that your example works and the results are identical.
I chose to replace the ddply function. ddply splits data frames by variables, which can be done in dplyr with group_by. mutate creates new columns that are functions of existing variables.
- Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?
- split()
- lapply()
- rbind()
are all functions in base R. I think they work well, but maybe not as efficiently as plyr. However, I do believe they are sufficient. I use rbind and lapply() all the time.
- The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?
First thing is, I wouldn’t say “analyze grouped data”. To me, it isn’t incorrect, just not great wording. It also missed that split-apply-combine can be used in base R and plyr. Also, it missed that it is a paradigm and therefor has many applications outside of just data cleaning and manipulation. As Hadley Wickham’s paper says, you can use it in modeling and visualization as well.