Blog post 4 - Split-Apply-Combine

Author

Denise Bradford

Published

February 16, 2022

Prompt:

The plyr package has by now been replaced by other, even faster packages, but the idea of Split, apply, combine is as relevant as ever.

Read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.

Write a blog post addressing the following questions:

  1. The R code for the split-apply-combine paper is posted with the paper. Pick one of the examples demonstrating plyr functionality (such as dlply or ddply, …) and rewrite the example using functionality from the package dplyr. Make sure that your example works and the results are identical.
data("baseball")
subset_baseball <- baseball %>% filter(year > 2000)

#plyr example
plyr::ddply(subset_baseball, .(team, lg), 
            summarise, hbp = mean(hbp, na.rm = TRUE), 
            year = max(year) ) %>% 
  head()
#>   team lg       hbp year
#> 1  ANA AL 0.1818182 2004
#> 2  ARI NL 1.7846154 2007
#> 3  ATL NL 1.4583333 2007
#> 4  BAL AL 1.9189189 2007
#> 5  BOS AL 1.2878788 2007
#> 6  CHA AL 2.3513514 2007

#dplyr example
subset_baseball %>% 
  group_by(team,lg) %>% 
 dplyr::do(mean = mean(.$hbp, na.rm = TRUE),
           year = max(.$year)) %>% 
  unnest() %>%
  head()
#> # A tibble: 6 × 4
#>   team  lg     mean  year
#>   <chr> <chr> <dbl> <int>
#> 1 ANA   AL    0.182  2004
#> 2 ARI   NL    1.78   2007
#> 3 ATL   NL    1.46   2007
#> 4 BAL   AL    1.92   2007
#> 5 BOS   AL    1.29   2007
#> 6 CHA   AL    2.35   2007
from pybaseball import batting_stats_range
import numpy as np

subset_baseball = batting_stats_range('2008-05-01', '2010-11-08')
subset_baseball.head()

subset_baseball.groupby(['Tm','Lev']).agg({'HBP': 'mean', 'OBP': 'max'}).head()
  1. Which (base R) functions do you know that support the split-apply-combine strategy? In your opinion, are these sufficient? State why or why not?

    Functions that support the split-apply-combine strategy in base R such as: by(), do.call(),reshape(),aggregate(),subset(). In my opinion, I believe that some of these functions are sufficient (like subset()) when looking at a larger dataset, it will be useful to see what the data look like without knowing what the actual table contents.

  2. The ChatGPT response to the prompt “Describe the split-apply-combine paradigm” is given below. Generally, the chatbot’s answers are decent. ChatGPT does have problems getting details right. Which details would you change in the response?

The “split-apply-combine” paradigm is a data analysis strategy used in statistics and data science to analyze grouped data. It consists of three steps:

Split: The data is first split into groups based on one or more variables of interest.

Apply: A specific operation or function is applied to each group of data, usually for the purpose of aggregating, summarizing, or transforming the data within each group.

Combine: The results of the operation applied to each group are then combined and returned as a single output.

The split-apply-combine paradigm is particularly useful in cases where the data cannot be analyzed as a whole and instead must be analyzed in smaller, more manageable chunks. This allows for more efficient data processing and improved scalability. The paradigm is commonly implemented using tools such as the “groupby” function in the R and Python programming languages, and the “dplyr” library in R.

“…cases where the data cannot be analyzed as a whole and instead must be analyzed in smaller, more manageable chunks.” The details that I would not completely agree with this statement for the fact that the split-apply-combine method can be perform other functional things that are useful. We can use the functionality to explore the data when we have a large dataset. This method could be used to combine data that are similar from different data sources, for example state related outcomes.

You can write your answers directly the README.Rmd file. Make sure that the file knits (you will need to activate any packages your use in your code chunks with a call to library(xxx), where xxx is the name of the package, such as plyr ). Commit your changes and push to your repo; add any files in the README_files directory to your repository.