Split, apply, combine

Split-apply-combine
Author

Yingchao Zhou

Published

February 16, 2023

Prompt:

The plyr package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.

Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.

Write a blog post addressing the following questions:

  1. The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating plyr functionality (such as dlply or ddply, …) and rewrite the example using functionality from the package dplyr. Make sure that your example works and the results are identical.

  2. Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?

  3. The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?

The “split-apply-combine” paradigm is a data analysis strategy used in statistics and data science to analyze grouped data. It consists of three steps:

Split: The data is first split into groups based on one or more variables of interest.

Apply: A specific operation or function is applied to each group of data, usually for the purpose of aggregating, summarizing, or transforming the data within each group.

Combine: The results of the operation applied to each group are then combined and returned as a single output.

The split-apply-combine paradigm is particularly useful in cases where the data cannot be analyzed as a whole and instead must be analyzed in smaller, more manageable chunks. This allows for more efficient data processing and improved scalability. The paradigm is commonly implemented using tools such as the “groupby” function in the R and Python programming languages, and the “dplyr” library in R.

You can write your answers directly the README.Rmd file. Make sure that the file knits (you will need to activate any packages your use in your code chunks with a call to library(xxx), where xxx is the name of the package, such as plyr ). Commit your changes and push to your repo; add any files in the README_files directory to your repository.

1 Rewrite plyr example using dplyr

I choose ddply.

Check out ddply first:

library("ggplot2")
library("plyr")
options(digits = 3)

## check original baseball dataset first
head(baseball[,c("id", "year")])
#>            id year
#> 4   ansonca01 1871
#> 44  forceda01 1871
#> 68  mathebo01 1871
#> 99  startjo01 1871
#> 102 suttoez01 1871
#> 106 whitede01 1871
baseball1 <- ddply(baseball, .(id), transform, 
  cyear = year - min(year) + 1)
head(baseball1[,c("id", "year", "cyear")])
#>          id year cyear
#> 1 aaronha01 1954     1
#> 2 aaronha01 1955     2
#> 3 aaronha01 1956     3
#> 4 aaronha01 1957     4
#> 5 aaronha01 1958     5
#> 6 aaronha01 1959     6

This operation creates a new variable “cyear”, which is the year of a record minus the minimum year record for the same id, then plus 1. We also notice the records are grouped by the same id.

We can take a closer look for id “aaronha01” to make sure we have the right understanding:

baseball[baseball$id == "aaronha01",c("id", "year")]
#>              id year
#> 37157 aaronha01 1954
#> 37795 aaronha01 1955
#> 38408 aaronha01 1956
#> 39023 aaronha01 1957
#> 39645 aaronha01 1958
#> 40291 aaronha01 1959
#> 40921 aaronha01 1960
#> 41642 aaronha01 1961
#> 42323 aaronha01 1962
#> 43076 aaronha01 1963
#> 43824 aaronha01 1964
#> 44578 aaronha01 1965
#> 45338 aaronha01 1966
#> 46122 aaronha01 1967
#> 46880 aaronha01 1968
#> 47703 aaronha01 1969
#> 48618 aaronha01 1970
#> 49524 aaronha01 1971
#> 50416 aaronha01 1972
#> 51294 aaronha01 1973
#> 52200 aaronha01 1974
#> 52657 aaronha01 1975
#> 53564 aaronha01 1976
baseball1[baseball1$id == "aaronha01",c("id", "year", "cyear")]
#>           id year cyear
#> 1  aaronha01 1954     1
#> 2  aaronha01 1955     2
#> 3  aaronha01 1956     3
#> 4  aaronha01 1957     4
#> 5  aaronha01 1958     5
#> 6  aaronha01 1959     6
#> 7  aaronha01 1960     7
#> 8  aaronha01 1961     8
#> 9  aaronha01 1962     9
#> 10 aaronha01 1963    10
#> 11 aaronha01 1964    11
#> 12 aaronha01 1965    12
#> 13 aaronha01 1966    13
#> 14 aaronha01 1967    14
#> 15 aaronha01 1968    15
#> 16 aaronha01 1969    16
#> 17 aaronha01 1970    17
#> 18 aaronha01 1971    18
#> 19 aaronha01 1972    19
#> 20 aaronha01 1973    20
#> 21 aaronha01 1974    21
#> 22 aaronha01 1975    22
#> 23 aaronha01 1976    23

Next we rewrite this example using dplyr:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise,
#>     summarize
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
baseball %>% group_by(id) %>%
  arrange(id) %>%
  mutate(cyear = year - min(year) + 1) -> baseball2

Check whether the two datasets are the same.

baseball1 %>% select_if(~ !any(is.na(.))) -> baseball1
baseball2 %>% select_if(~ !any(is.na(.))) -> baseball2

sum(baseball1 != baseball2)
#> [1] 0

2 Base R option

I know some functions in the “apply” family can do split-apply-combine operations, like apply, lapply, sapply. I think these are useful functions and can get a lot of work done. But sometimes they are tricky in what type of data they would take, the specification of dimension, etc. For example, lapply can take a list or vector and return a list, which takes a lot of space when the list object is large.

3 Change to ChatGPT

There is no groupby function in R. I would say SQL has a group by operator and SAS has a by operator.