library(plyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise,
#>     summarize
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
baseball <- ddply(baseball, .(id), transform,
  cyear = year - min(year) + 1)
#using dplyr
baseball%>%
  group_by(id)%>%
  mutate(cyear = year - min(year) + 1)
#> # A tibble: 21,699 × 23
#> # Groups:   id [1,228]
#>    id     year stint team  lg        g    ab     r     h   X2b   X3b    hr   rbi
#>    <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1 aaro…  1954     1 ML1   NL      122   468    58   131    27     6    13    69
#>  2 aaro…  1955     1 ML1   NL      153   602   105   189    37     9    27   106
#>  3 aaro…  1956     1 ML1   NL      153   609   106   200    34    14    26    92
#>  4 aaro…  1957     1 ML1   NL      151   615   118   198    27     6    44   132
#>  5 aaro…  1958     1 ML1   NL      153   601   109   196    34     4    30    95
#>  6 aaro…  1959     1 ML1   NL      154   629   116   223    46     7    39   123
#>  7 aaro…  1960     1 ML1   NL      153   590   102   172    20    11    40   126
#>  8 aaro…  1961     1 ML1   NL      155   603   115   197    39    10    34   120
#>  9 aaro…  1962     1 ML1   NL      156   592   127   191    28     6    45   128
#> 10 aaro…  1963     1 ML1   NL      161   631   121   201    29     4    44   130
#> # … with 21,689 more rows, and 10 more variables: sb <int>, cs <int>, bb <int>,
#> #   so <int>, ibb <int>, hbp <int>, sh <int>, sf <int>, gidp <int>, cyear <dbl>Prompt:
The plyr package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.
Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.
Write a blog post addressing the following questions:
- The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating plyrfunctionality (such asdlplyorddply, …) and rewrite the example using functionality from the packagedplyr. Make sure that your example works and the results are identical.
I chose to replace the ddply function. ddply splits data frames by variables, which can be done in dplyr with group_by. mutate creates new columns that are functions of existing variables.
- Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?
- split()
- lapply()
- rbind()
are all functions in base R. I think they work well, but maybe not as efficiently as plyr. However, I do believe they are sufficient. I use rbind and lapply() all the time.
- The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?
First thing is, I wouldn’t say “analyze grouped data”. To me, it isn’t incorrect, just not great wording. It also missed that split-apply-combine can be used in base R and plyr. Also, it missed that it is a paradigm and therefor has many applications outside of just data cleaning and manipulation. As Hadley Wickham’s paper says, you can use it in modeling and visualization as well.