library(plyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
<- ddply(baseball, .(id), transform,
baseball cyear = year - min(year) + 1)
#using dplyr
%>%
baseballgroup_by(id)%>%
mutate(cyear = year - min(year) + 1)
#> # A tibble: 21,699 × 23
#> # Groups: id [1,228]
#> id year stint team lg g ab r h X2b X3b hr rbi
#> <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 aaro… 1954 1 ML1 NL 122 468 58 131 27 6 13 69
#> 2 aaro… 1955 1 ML1 NL 153 602 105 189 37 9 27 106
#> 3 aaro… 1956 1 ML1 NL 153 609 106 200 34 14 26 92
#> 4 aaro… 1957 1 ML1 NL 151 615 118 198 27 6 44 132
#> 5 aaro… 1958 1 ML1 NL 153 601 109 196 34 4 30 95
#> 6 aaro… 1959 1 ML1 NL 154 629 116 223 46 7 39 123
#> 7 aaro… 1960 1 ML1 NL 153 590 102 172 20 11 40 126
#> 8 aaro… 1961 1 ML1 NL 155 603 115 197 39 10 34 120
#> 9 aaro… 1962 1 ML1 NL 156 592 127 191 28 6 45 128
#> 10 aaro… 1963 1 ML1 NL 161 631 121 201 29 4 44 130
#> # … with 21,689 more rows, and 10 more variables: sb <int>, cs <int>, bb <int>,
#> # so <int>, ibb <int>, hbp <int>, sh <int>, sf <int>, gidp <int>, cyear <dbl>
Prompt:
The plyr
package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.
Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.
Write a blog post addressing the following questions:
- The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating
plyr
functionality (such asdlply
orddply
, …) and rewrite the example using functionality from the packagedplyr
. Make sure that your example works and the results are identical.
I chose to replace the ddply function. ddply splits data frames by variables, which can be done in dplyr with group_by. mutate creates new columns that are functions of existing variables.
- Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?
- split()
- lapply()
- rbind()
are all functions in base R. I think they work well, but maybe not as efficiently as plyr. However, I do believe they are sufficient. I use rbind and lapply() all the time.
- The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?
First thing is, I wouldn’t say “analyze grouped data”. To me, it isn’t incorrect, just not great wording. It also missed that split-apply-combine can be used in base R and plyr. Also, it missed that it is a paradigm and therefor has many applications outside of just data cleaning and manipulation. As Hadley Wickham’s paper says, you can use it in modeling and visualization as well.