Split Apply Combine

Author

Muxin Hua

Published

February 16, 2022

Prompt:

The plyr package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.

Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.

Write a blog post addressing the following questions:

  1. The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating plyr functionality (such as dlply or ddply, …) and rewrite the example using functionality from the package dplyr. Make sure that your example works and the results are identical.
library(plyr)
baseball <- baseball
baseball1 <- ddply(baseball, .(id), transform, cyear = year - min(year) + 1)
head(baseball1, 10)
#>           id year stint team lg   g  ab   r   h X2b X3b hr rbi sb cs bb so ibb
#> 1  aaronha01 1954     1  ML1 NL 122 468  58 131  27   6 13  69  2  2 28 39  NA
#> 2  aaronha01 1955     1  ML1 NL 153 602 105 189  37   9 27 106  3  1 49 61   5
#> 3  aaronha01 1956     1  ML1 NL 153 609 106 200  34  14 26  92  2  4 37 54   6
#> 4  aaronha01 1957     1  ML1 NL 151 615 118 198  27   6 44 132  1  1 57 58  15
#> 5  aaronha01 1958     1  ML1 NL 153 601 109 196  34   4 30  95  4  1 59 49  16
#> 6  aaronha01 1959     1  ML1 NL 154 629 116 223  46   7 39 123  8  0 51 54  17
#> 7  aaronha01 1960     1  ML1 NL 153 590 102 172  20  11 40 126 16  7 60 63  13
#> 8  aaronha01 1961     1  ML1 NL 155 603 115 197  39  10 34 120 21  9 56 64  20
#> 9  aaronha01 1962     1  ML1 NL 156 592 127 191  28   6 45 128 15  7 66 73  14
#> 10 aaronha01 1963     1  ML1 NL 161 631 121 201  29   4 44 130 31  5 78 94  18
#>    hbp sh sf gidp cyear
#> 1    3  6  4   13     1
#> 2    3  7  4   20     2
#> 3    2  5  7   21     3
#> 4    0  0  3   13     4
#> 5    1  0  3   21     5
#> 6    4  0  9   19     6
#> 7    2  0 12    8     7
#> 8    2  1  9   16     8
#> 9    3  0  6   14     9
#> 10   0  0  5   11    10
options(dplyr.width = Inf)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise,
#>     summarize
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
basketball2 <- baseball %>% group_by(id) %>% mutate(cyear = year-min(year)+1) %>% arrange(id)
head(basketball2, 10)
#> # A tibble: 10 × 23
#> # Groups:   id [1]
#>    id         year stint team  lg        g    ab     r     h   X2b   X3b    hr
#>    <chr>     <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
#>  1 aaronha01  1954     1 ML1   NL      122   468    58   131    27     6    13
#>  2 aaronha01  1955     1 ML1   NL      153   602   105   189    37     9    27
#>  3 aaronha01  1956     1 ML1   NL      153   609   106   200    34    14    26
#>  4 aaronha01  1957     1 ML1   NL      151   615   118   198    27     6    44
#>  5 aaronha01  1958     1 ML1   NL      153   601   109   196    34     4    30
#>  6 aaronha01  1959     1 ML1   NL      154   629   116   223    46     7    39
#>  7 aaronha01  1960     1 ML1   NL      153   590   102   172    20    11    40
#>  8 aaronha01  1961     1 ML1   NL      155   603   115   197    39    10    34
#>  9 aaronha01  1962     1 ML1   NL      156   592   127   191    28     6    45
#> 10 aaronha01  1963     1 ML1   NL      161   631   121   201    29     4    44
#>      rbi    sb    cs    bb    so   ibb   hbp    sh    sf  gidp cyear
#>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
#>  1    69     2     2    28    39    NA     3     6     4    13     1
#>  2   106     3     1    49    61     5     3     7     4    20     2
#>  3    92     2     4    37    54     6     2     5     7    21     3
#>  4   132     1     1    57    58    15     0     0     3    13     4
#>  5    95     4     1    59    49    16     1     0     3    21     5
#>  6   123     8     0    51    54    17     4     0     9    19     6
#>  7   126    16     7    60    63    13     2     0    12     8     7
#>  8   120    21     9    56    64    20     2     1     9    16     8
#>  9   128    15     7    66    73    14     3     0     6    14     9
#> 10   130    31     5    78    94    18     0     0     5    11    10
  1. Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?
    The first function came into my mind is apply, because I can always write a function to fulfill the requirement. In terms of writing functions to complete tasks, apply is sufficient, but it’s time consuming and low efficient. It has to apply the function many times, store the result before writing it into the dataframe. Besides, users have to write function on their own, which also increases the time cost and risk of raising bugs.

    Operations in dplyr have different ways of completing tasks in terms of different objects (array, vector, list ect.). These nicely designed data structure save both time and space, which makes the split-apply-combine strategy efficiently implemented.

  2. The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?

The “split-apply-combine” paradigm is a data analysis strategy used in statistics and data science to analyze grouped data. It consists of three steps:

Split: The data is first split into groups based on one or more variables of interest.

Apply: A specific operation or function is applied to each group of data, usually for the purpose of aggregating, summarizing, or transforming the data within each group.

Combine: The results of the operation applied to each group are then combined and returned as a single output.

The split-apply-combine paradigm is particularly useful in cases where the data cannot be analyzed as a whole and instead must be analyzed in smaller, more manageable chunks. This allows for more efficient data processing and improved scalability. The paradigm is commonly implemented using tools such as the “groupby” function in the R and Python programming languages, and the “dplyr” library in R.

I think chatGPD got most of the part right except groupby is a function in Python does the grouping work, group_by is the function in R does the same work.

You can write your answers directly the README.Rmd file. Make sure that the file knits (you will need to activate any packages your use in your code chunks with a call to library(xxx), where xxx is the name of the package, such as plyr ). Commit your changes and push to your repo; add any files in the README_files directory to your repository.