class: center, middle, inverse, title-slide .title[ # Stat 585 - The
purrr
package: split-apply-combine with lists ] .author[ ### Heike Hofmann ] --- ## From Lab 2 Folder of csv files with data scraped from press logs: ```r csvs <- dir("data", pattern="presslog-", full.names = TRUE) csvs ``` ``` ## [1] "data/presslog-20210209.csv" "data/presslog-20210214.csv" ## [3] "data/presslog-20210425.csv" "data/presslog-20220122.csv" ## [5] "data/presslog-20220407.csv" "data/presslog-20220607.csv" ## [7] "data/presslog-20230213.csv" "data/presslog-20230216.csv" ## [9] "data/presslog-20230220.csv" "data/presslog-20230223.csv" ``` -- Make this vector a part of a data set: ```r results <- data.frame(origin = csvs) ``` -- Now we want to read the data inside the csv files. One file we would read as: `read_csv(origin[1])` --- ## List Variables `purrr::map` applies function `.f` to each element of a variable: ```r results <- results %>% mutate( data = origin %>% purrr::map(.f = readr::read_csv, show_col_types = FALSE) ) str(results) ``` ``` ## 'data.frame': 10 obs. of 2 variables: ## $ origin: chr "data/presslog-20210209.csv" "data/presslog-20210214.csv" "data/presslog-20210425.csv" "data/presslog-20220122.csv" ... ## $ data :List of 10 ## ..$ : spc_tbl_ [223 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "2/9/2021 02:30AM" "2/9/2021 01:52AM" "2/9/2021 12:05AM" "2/8/2021 11:55PM" ... ## .. ..$ Incident ID : num 2.1e+08 2.1e+08 2.1e+08 2.1e+08 2.1e+08 ... ## .. ..$ How Call was Rec'd : chr "PHONE" "PHONE" "PHONE" "PHONE" ... ## .. ..$ Nature Code Description : chr "DISTURBANCE & NOISE PARTY" "PRIVATE TOW" "DISTURBANCE & NOISE PARTY" "SCAM" ... ## .. ..$ Location Address : chr "4211 LINCOLN SWING" "1206 S 4TH ST" "624-202 S 17TH ST" "1118 S DUFF AVE" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA 2.1e+07 2.1e+07 ... ## .. ..$ Closing Disposition or Cancel Code: chr "NAT" "PTOW" "NOR" "RPT" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [217 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "2/14/2021 04:10AM" "2/14/2021 02:39AM" "2/14/2021 02:31AM" "2/14/2021 02:24AM" ... ## .. ..$ Incident ID : num 2.1e+08 2.1e+08 2.1e+08 2.1e+08 2.1e+08 ... ## .. ..$ How Call was Rec'd : chr "PHONE" "PHONE" "PHONE" "PHONE" ... ## .. ..$ Nature Code Description : chr "GENERAL ALARM BANK / RESIDENTI" "PUBLIC INTOXICATION" "SUSPICIOUS PERSON/VEH/ACTIVITY" "DISORDERLY CONDUCT" ... ## .. ..$ Location Address : chr "710 S DUFF AVE" "100-BLK WELCH AVE" "1400-BLK IDAHO AVE" "400-BLK WELCH AVE" ... ## .. ..$ Report Number Assigned to Event : num NA 2.1e+07 NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "BIL" "ARR" "UTL" "UTL" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [261 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "4/25/2021 05:19AM" "4/25/2021 04:25AM" "4/25/2021 04:20AM" "4/25/2021 04:04AM" ... ## .. ..$ Incident ID : num 2.1e+08 2.1e+08 2.1e+08 2.1e+08 2.1e+08 ... ## .. ..$ How Call was Rec'd : chr "SELF" "PHONE" "PHONE" "W911" ... ## .. ..$ Nature Code Description : chr "BURGLARY TO MOTOR VEHICLE" "SUSPICIOUS PERSON/VEH/ACTIVITY" "SUSPICIOUS PERSON/VEH/ACTIVITY" "MEDICAL ASSIST/AMBULANCE CALL" ... ## .. ..$ Location Address : chr "511 S 4TH ST" "312 HAYWARD AVE" "229 S KELLOGG AVE" "3115 GROVE AVE" ... ## .. ..$ Report Number Assigned to Event : num 2.1e+07 NA NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "RPT" "NOR" "NOR" "NRSP" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [281 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "1/22/2022 04:35AM" "1/22/2022 04:10AM" "1/22/2022 04:05AM" "1/22/2022 03:59AM" ... ## .. ..$ Incident ID : num 2.2e+08 2.2e+08 2.2e+08 2.2e+08 2.2e+08 ... ## .. ..$ How Call was Rec'd : chr "PHONE" "PHONE" "PHONE" "PHONE" ... ## .. ..$ Nature Code Description : chr "MEDICAL ASSIST/AMBULANCE CALL" "MISSING PERSON / RUNAWAY" "LAW DEPARTMENT ASSIST" "LAW DEPARTMENT ASSIST" ... ## .. ..$ Location Address : chr "3305-109 STANGE RD" "3006 WOODLAND ST" "1605-103 S DAYTON PL" "US HIGHWAY 30/S 500TH AVE" ... ## .. ..$ Report Number Assigned to Event : num NA NA 2.2e+07 NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "NRSP" "ASC" "ARR" "NOR" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [224 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "4/7/2022 05:46AM" "4/7/2022 02:31AM" "4/7/2022 02:15AM" "4/7/2022 01:58AM" ... ## .. ..$ Incident ID : num 2.2e+08 2.2e+08 2.2e+08 2.2e+08 2.2e+08 ... ## .. ..$ How Call was Rec'd : chr "E911" "PHONE" "PHONE" "E911" ... ## .. ..$ Nature Code Description : chr "DISORDERLY CONDUCT" "DISTURBANCE & NOISE PARTY" "DISORDERLY CONDUCT" "MEDICAL ASSIST/AMBULANCE CALL" ... ## .. ..$ Location Address : chr "133-13 BEEDLE DR" "4912-424 MORTENSEN RD" "119 MAIN ST" "3127 BURNHAM DR" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "CBC" "ASC" "ASC" "ASC" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [224 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "6/7/2022 05:17AM" "6/7/2022 05:02AM" "6/7/2022 05:00AM" "6/7/2022 04:00AM" ... ## .. ..$ Incident ID : num 2.21e+08 2.21e+08 2.21e+08 2.21e+08 2.21e+08 ... ## .. ..$ How Call was Rec'd : chr "SELF" "PHONE" "SELF" "SELF" ... ## .. ..$ Nature Code Description : chr "TRAFFIC STOP / ENFORCEMENT" "GENERAL ALARM BANK / RESIDENTI" "TRAFFIC STOP / ENFORCEMENT" "TRAFFIC STOP / ENFORCEMENT" ... ## .. ..$ Location Address : chr "148 MM US HIGHWAY 30 E" "3619 STANGE RD" "148 MM US HIGHWAY 30 E" "S DUFF AVE/S 3RD ST" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "WAR" "BIL" "WAR" "WAR" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [288 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "2/13/2023 05:52AM" "2/13/2023 03:12AM" "2/13/2023 01:22AM" "2/13/2023 12:37AM" ... ## .. ..$ Incident ID : num 2.3e+08 2.3e+08 2.3e+08 2.3e+08 2.3e+08 ... ## .. ..$ How Call was Rec'd : chr "SELF" "SELF" "W911" "PHONE" ... ## .. ..$ Nature Code Description : chr "TRAFFIC STOP / ENFORCEMENT" "TRESPASS" "MEDICAL ASSIST/AMBULANCE CALL" "EQUIPMENT/SIGN MALFUNCTION" ... ## .. ..$ Location Address : chr "DAYTON AVE/E LINCOLN WAY" "1605 6TH ST" "2143 FRILEY RD" "515 CLARK AVE" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA NA NA NA NA NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "CIT" "NOR" "NRSP" "INFO" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [229 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "2/16/2023 05:53AM" "2/16/2023 05:06AM" "2/16/2023 03:53AM" "2/16/2023 03:40AM" ... ## .. ..$ Incident ID : num 2.3e+08 2.3e+08 2.3e+08 2.3e+08 2.3e+08 ... ## .. ..$ How Call was Rec'd : chr "W911" "E911" "SELF" "PHONE" ... ## .. ..$ Nature Code Description : chr "MEDICAL ASSIST/AMBULANCE CALL" "MEDICAL ASSIST/AMBULANCE CALL" "TRAFFIC STOP / ENFORCEMENT" "PRIVATE TOW" ... ## .. ..$ Location Address : chr "3018 OAKLAND ST" "3218 ORION DR" "UNIVERSITY BLVD/LINCOLN WAY" "1206 S 4TH" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA NA 2.3e+07 ... ## .. ..$ Closing Disposition or Cancel Code: chr "NRSP" "NRSP" "WAR" "PTOW" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [311 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "2/20/2023 05:15AM" "2/20/2023 02:25AM" "2/20/2023 01:46AM" "2/19/2023 11:30PM" ... ## .. ..$ Incident ID : num 2.3e+08 2.3e+08 2.3e+08 2.3e+08 2.3e+08 ... ## .. ..$ How Call was Rec'd : chr "W911" "SELF" "W911" "PHONE" ... ## .. ..$ Nature Code Description : chr "911 HANG UP / INCOMPLETE CALL" "SUSPICIOUS PERSON/VEH/ACTIVITY" "MEDICAL ASSIST/AMBULANCE CALL" "PRIVATE TOW" ... ## .. ..$ Location Address : chr "1407 FLORIDA AVE" "5205 GRAND AVE" "225 S KELLOGG AVE" "5310 MORTENSEN RD" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "ASC" "NOR" "NAT" "PTOW" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ## ..$ : spc_tbl_ [265 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## .. ..$ Call Received Date/Time : chr "2/22/2023 05:37AM" "2/22/2023 05:15AM" "2/22/2023 04:58AM" "2/22/2023 12:30AM" ... ## .. ..$ Incident ID : num 2.3e+08 2.3e+08 2.3e+08 2.3e+08 2.3e+08 ... ## .. ..$ How Call was Rec'd : chr "W911" "PHONE" "W911" "SELF" ... ## .. ..$ Nature Code Description : chr "MEDICAL ASSIST/AMBULANCE CALL" "TRAFFIC HAZARD" "MEDICAL ASSIST/AMBULANCE CALL" "TRAFFIC STOP / ENFORCEMENT" ... ## .. ..$ Location Address : chr "1501 MAXWELL AVE" "LINCOLN WAY/N ELM AVE" "203-606 S 5TH ST" "100-BLK S 5TH ST" ... ## .. ..$ Report Number Assigned to Event : num NA NA NA NA NA ... ## .. ..$ Closing Disposition or Cancel Code: chr "NRSP" "ASC" "NRSP" "CIT" ... ## .. ..- attr(*, "spec")= ## .. .. .. cols( ## .. .. .. `Call Received Date/Time` = col_character(), ## .. .. .. `Incident ID` = col_double(), ## .. .. .. `How Call was Rec'd` = col_character(), ## .. .. .. `Nature Code Description` = col_character(), ## .. .. .. `Location Address` = col_character(), ## .. .. .. `Report Number Assigned to Event` = col_double(), ## .. .. .. `Closing Disposition or Cancel Code` = col_character() ## .. .. .. ) ## .. ..- attr(*, "problems")=<externalptr> ``` --- ## List Variables `results$data` is a list column (or list variable). List-columns are columns of a data frame where each element is a list or vector instead of an atomic value. -- Specifically, each element of `results$data` is a data set. This makes `results` a **nested data set**: ```r head(results$data[[1]]) ``` ``` ## # A tibble: 6 × 7 ## `Call Received Date/Time` Incident I…¹ How C…² Natur…³ Locat…⁴ Repor…⁵ Closi…⁶ ## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 2/9/2021 02:30AM 210202114 PHONE DISTUR… 4211 L… NA NAT ## 2 2/9/2021 01:52AM 210202108 PHONE PRIVAT… 1206 S… NA PTOW ## 3 2/9/2021 12:05AM 210202101 PHONE DISTUR… 624-20… NA NOR ## 4 2/8/2021 11:55PM 210202100 PHONE SCAM 1118 S… 2.10e7 RPT ## 5 2/8/2021 11:19PM 210202095 W911 BURGLA… 2919 O… 2.10e7 RPT ## 6 2/8/2021 10:39PM 210202091 PHONE CIVIL … 219-11… NA NOR ## # … with abbreviated variable names ¹`Incident ID`, ²`How Call was Rec'd`, ## # ³`Nature Code Description`, ⁴`Location Address`, ## # ⁵`Report Number Assigned to Event`, ⁶`Closing Disposition or Cancel Code` ``` --- ## Mapping - `purrr` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/main/purrr.pdf) take each element of a list (or vector) and apply function `.f` to it Then return a | Return | Function | |---|---| | list | map | | numeric vector | map_dbl | | logical vector | map_lgl | | data frame | map_df | -- `purrr` also implements very powerful ways of reshaping lists --- class: inverse ## Your turn We have been looking at the object `got_chars` in the `repurrrsive` package. - What happens, when you apply the function `purrr::transpose` on the list? - What does the length of the transposed object signify? - Can we turn the transposed object into a data frame? into a tibble? --- A tibble it is! How do we get the lists out of the data? - we unnest! ```r library(repurrrsive) transpose(got_chars) %>% names() ``` ``` ## [1] "url" "id" "name" "gender" "culture" ## [6] "born" "died" "alive" "titles" "aliases" ## [11] "father" "mother" "spouse" "allegiances" "books" ## [16] "povBooks" "tvSeries" "playedBy" ``` ```r got_chars_tbl <- as_tibble(transpose(got_chars)) got_chars_tbl ``` ``` ## # A tibble: 30 × 18 ## url id name gender culture born died alive titles aliases father ## <list> <list> <lis> <list> <list> <lis> <lis> <lis> <list> <list> <list> ## 1 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 2 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 3 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 4 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 5 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 6 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 7 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 8 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 9 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## 10 <chr [1]> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> ## # … with 20 more rows, and 7 more variables: mother <list>, spouse <list>, ## # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>, ## # playedBy <list> ``` --- # unnesting `nest` and `unnest` are functions in the package `tidyr` to work with [nested data frames](https://tidyr.tidyverse.org/articles/nest.html) `tidyr` 1.3.0 has just been released (Jan 24 2023) https://tidyr.tidyverse.org/news/index.html which brings some syntactic changes with it: [new syntax](https://tidyr.tidyverse.org/reference/nest.html#new-syntax) ```r got_chars_tbl %>% unnest(cols=1:6) ``` ``` ## # A tibble: 30 × 18 ## url id name gender culture born died alive titles aliases father ## <chr> <int> <chr> <chr> <chr> <chr> <lis> <lis> <list> <list> <list> ## 1 https://w… 1022 Theo… Male "Ironb… "In … <chr> <lgl> <chr> <chr> <chr> ## 2 https://w… 1052 Tyri… Male "" "In … <chr> <lgl> <chr> <chr> <chr> ## 3 https://w… 1074 Vict… Male "Ironb… "In … <chr> <lgl> <chr> <chr> <chr> ## 4 https://w… 1109 Will Male "" "" <chr> <lgl> <chr> <chr> <chr> ## 5 https://w… 1166 Areo… Male "Norvo… "In … <chr> <lgl> <chr> <chr> <chr> ## 6 https://w… 1267 Chett Male "" "At … <chr> <lgl> <chr> <chr> <chr> ## 7 https://w… 1295 Cres… Male "" "In … <chr> <lgl> <chr> <chr> <chr> ## 8 https://w… 130 Aria… Female "Dorni… "In … <chr> <lgl> <chr> <chr> <chr> ## 9 https://w… 1303 Daen… Female "Valyr… "In … <chr> <lgl> <chr> <chr> <chr> ## 10 https://w… 1319 Davo… Male "Weste… "In … <chr> <lgl> <chr> <chr> <chr> ## # … with 20 more rows, and 7 more variables: mother <list>, spouse <list>, ## # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>, ## # playedBy <list> ``` --- class: inverse ## Your Turn The tibble `got_chars_tbl` is created by transposing the `got_chars` object and casting it into a tibble: ```r got_chars_tbl <- as_tibble(transpose(got_chars)) ``` > The `unnest` function takes all elements out of a list object and inserts them into the data set: ```r got_chars_tbl %>% unnest("url") ``` Why does `got_chars_tbl %>% unnest("books")` not work? --- class: inverse ## Your Turn We fix the empty list in element 11 of `books` and change it to a list of an empty string instead: ```r got_chars_tbl$books[11] <- list("") ``` What does `got_chars_tbl %>% unnest("books")` do now? --- class: inverse ## Your Turn Write a function `chars_to_char` that takes a vector of characters and returns a single character .... does that sound like `paste`? Apply that function to the `books` variable of `got_chars_tbl` What does `got_chars_tbl %>% unnest("books")` do now? Can we apply this idea to other variables, such as `allegiances`? --- class: inverse ## Your Turn What is the difference between ``` got_chars_tbl %>% pluck("books") %>% reduce(.f = paste) ``` and ``` got_chars_tbl %>% pluck("books") %>% map(.f = function(x) reduce(x, .f = paste, sep=", ", .init="")) ``` and how can we get rid of the leading comma? --- # Back to Lab 2 The following code extracts the call codes from one of the press logs published by Ames PD: ```r library(tabulizer) codes <- locate_areas("presslogs/PressLog-2023-02-20.pdf", pages=1) #codes <- c(474.79543, 20.19263, 546.14273, 747.12743) #names(codes) <- c("top", "left", "bottom", "right") # codes <- list(codes) call_codes <- extract_tables("presslogs/PressLog-2023-02-20.pdf", pages=1, area = codes, guess = FALSE) # turn all the codes into a single string call_codes <- as.vector(paste(call_codes[[1]][,1], collapse = "")) call_codes ``` --- class: inverse ## Your Turn Take the `call_codes` object and create a data frame as shown below. Make sure to use functionality of the `purrr` package! ``` ## code description ## 1 ACC Accidently Chose New Event ## 2 ARR Arrest ## 3 ASC Assignment Complete ## 4 ASST Assist ## 5 BIL Billable Alarm ## 6 CALR Canceled Alarm Reset ``` --- ```r # split the string along each comma: codes_list <- str_split(call_codes, pattern=",")[[1]] # get rid of leading and trailing white spaces codes_list <- trimws(codes_list) # now split the string along each equal symbol: codes <- str_split(codes_list, pattern="=") codes_df <- codes %>% purrr::map(.f = function(x) { list(code = x[1], description = x[2]) }) %>% transpose() %>% as_tibble %>% unnest(cols = 1:2) write.csv(codes_df, "data/call_codes.csv", row.names = FALSE) ``` --- class: middle, center ## A bigger example --- ## Example: the gapminder project Part of a quantitative world is our responsibility to make data accessible - Hans Rosling (2017✝) was at the fore-front of providing World Health Statistics (gapminder) - in R through the package `gapminder` by Jenny Bryan - Hans Rosling's [TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en) -- Can we describe countries by the way that their life expectancy develops over time? --- ## First Look: gapminder ```r library(gapminder) gg <- gapminder %>% ggplot(aes(x = year, y = lifeExp)) + geom_line(aes(group= country)) + facet_wrap(~continent) plotly::ggplotly(gg) ```
-- Most countries show an increase in life expectancy over time, but not all --- ## Split-Apply-Combine for lists - `nest` creates list of data frames (split) - `map` applies function to list object (apply) - `unnest` moves list results back into a data frame (combine) -- Usually a good idea to develop code for one example, then generalize --- ## First Model: US only ```r gapminder %>% filter(country == "United States") %>% ggplot(aes(x = year, y =lifeExp)) + geom_line() + geom_point() + geom_smooth(method="lm", se=FALSE) ``` ``` ## `geom_smooth()` using formula = 'y ~ x' ``` ![](02b_purrr_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- ## Model of the US ```r lm(lifeExp~I(year-1950), data = filter(gapminder, country == "United States")) ``` ``` ## ## Call: ## lm(formula = lifeExp ~ I(year - 1950), data = filter(gapminder, ## country == "United States")) ## ## Coefficients: ## (Intercept) I(year - 1950) ## 68.0455 0.1842 ``` --- class: inverse ## Your turn: another country Pick another country in the gapminder data and derive estimates. How do the estimates compare to the US? --- # Now we want to apply this to all countries Nest by country: ```r country_nest <- gapminder %>% mutate(year = year - 1950) %>% group_by(continent, country) %>% nest() ``` Creates a `data` variable with a dataset for each country ```r country_nest$data[[1]] %>% head() ``` ``` ## # A tibble: 6 × 4 ## year lifeExp pop gdpPercap ## <dbl> <dbl> <int> <dbl> ## 1 2 28.8 8425333 779. ## 2 7 30.3 9240934 821. ## 3 12 32.0 10267083 853. ## 4 17 34.0 11537966 836. ## 5 22 36.1 13079460 740. ## 6 27 38.4 14880372 786. ``` --- # List variables: stepping through The `map` function of the `purrr` package allows us to apply a function to each element of a list. `map_dbl` returns a double value, `map_df` a data frame. `map` itself returns a list. ```r country_nest <- country_nest %>% mutate( model = purrr::map(data, function(d) lm(lifeExp~year, data = d)) ) ``` --- ```r country_nest <- country_nest %>% mutate( intercept = model %>% purrr::map_dbl(.f = function(m) m$coef[1]) ) country_nest %>% head() ``` ``` ## # A tibble: 6 × 5 ## # Groups: continent, country [6] ## country continent data model intercept ## <fct> <fct> <list> <list> <dbl> ## 1 Afghanistan Asia <tibble [12 × 4]> <lm> 29.4 ## 2 Albania Europe <tibble [12 × 4]> <lm> 58.6 ## 3 Algeria Africa <tibble [12 × 4]> <lm> 42.2 ## 4 Angola Africa <tibble [12 × 4]> <lm> 31.7 ## 5 Argentina Americas <tibble [12 × 4]> <lm> 62.2 ## 6 Australia Oceania <tibble [12 × 4]> <lm> 67.9 ``` --- ![](02b_purrr_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- class: inverse ## Your Turn - Work through the previous example and expand the object `country_nest` by adding the slope from each model as an additional variable - Plot a scatterplot of slope and intercept and compare. - Advanced: Jenny Bryan is using pretty color schemes for these data at https://github.com/jennybc/gapminder Can you get those to work for your example? Test with ggplotly. --- class: inverse ## Your Turn Find the data for Canada from `country_nest` Why does `country_nest$data[[country_nest$country=="Canada"]]` not work? --- ## Nested data frames (cont'd) We can use the list elements as data inputs: ```r lm(lifeExp~year, data=country_nest$data[[10]]) ``` ``` ## ## Call: ## lm(formula = lifeExp ~ year, data = country_nest$data[[10]]) ## ## Coefficients: ## (Intercept) year ## 67.4738 0.2091 ``` --- class: inverse ## Your Turn (8 mins) The goal of this your turn is for you to try out nesting operations on various datasets. - Nest the `ChickWeight` data in different ways: what are the results of ``` ChickWeight %>% nest(-Diet) ChickWeight %>% nest(weight) ChickWeight %>% nest(Time, weight) ``` - `chickwts` is yet another data set on feeding chicks. Nest it by different feeds. - `nest` respects grouping structure introduced by `group_by`. Use `group_by` to nest the `iris` dataset by species. --- # Extracting model estimates ```r country_nest %>% mutate( intercept = model %>% purrr::map_dbl(.f = function(m) coef(m)[1]), slope = model %>% purrr::map_dbl(.f = function(m) coef(m)[2]) ) %>% head(5) ``` ``` ## # A tibble: 5 × 6 ## # Groups: continent, country [5] ## country continent data model intercept slope ## <fct> <fct> <list> <list> <dbl> <dbl> ## 1 Afghanistan Asia <tibble [12 × 4]> <lm> 29.4 0.275 ## 2 Albania Europe <tibble [12 × 4]> <lm> 58.6 0.335 ## 3 Algeria Africa <tibble [12 × 4]> <lm> 42.2 0.569 ## 4 Angola Africa <tibble [12 × 4]> <lm> 31.7 0.209 ## 5 Argentina Americas <tibble [12 × 4]> <lm> 62.2 0.232 ``` --- ## The broom package Now we have all these models ... <img src="images/broom.jpeg" class="cover" width=1000> --- ## The `broom` package `broom` allows to extract values from models on three levels: - for each model: `broom::glance` - for each coefficient in the model: `broom::tidy` - for each value in the dataset: `broom::augment` -- ```r library(broom) broom::glance(country_nest$model[[1]]) ``` ``` ## # A tibble: 1 × 12 ## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 0.948 0.942 1.22 181. 9.84e-8 1 -18.3 42.7 44.1 15.0 10 ## # … with 1 more variable: nobs <int>, and abbreviated variable names ## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual ``` --- ```r broom::tidy(country_nest$model[[1]]) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.4 0.699 42.0 1.40e-12 ## 2 year 0.275 0.0205 13.5 9.84e- 8 ``` --- ```r broom::augment(country_nest$model[[1]]) ``` ``` ## # A tibble: 12 × 8 ## lifeExp year .fitted .resid .hat .sigma .cooksd .std.resid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 28.8 2 29.9 -1.11 0.295 1.21 0.243 -1.08 ## 2 30.3 7 31.3 -0.952 0.225 1.24 0.113 -0.884 ## 3 32.0 12 32.7 -0.664 0.169 1.27 0.0360 -0.595 ## 4 34.0 17 34.0 -0.0172 0.127 1.29 0.0000165 -0.0151 ## 5 36.1 22 35.4 0.674 0.0991 1.27 0.0185 0.581 ## 6 38.4 27 36.8 1.65 0.0851 1.15 0.0923 1.41 ## 7 39.9 32 38.2 1.69 0.0851 1.15 0.0967 1.44 ## 8 40.8 37 39.5 1.28 0.0991 1.21 0.0667 1.10 ## 9 41.7 42 40.9 0.754 0.127 1.26 0.0317 0.660 ## 10 41.8 47 42.3 -0.534 0.169 1.27 0.0233 -0.479 ## 11 42.1 52 43.7 -1.54 0.225 1.15 0.299 -1.43 ## 12 43.8 57 45.1 -1.22 0.295 1.19 0.296 -1.19 ``` --- ##Extract values for each coefficient Extract coefficients for all countries automatically (hello `map` again!) ```r # works, but we lose the context coefs = country_nest$model %>% purrr::map(.f=broom::tidy) head(coefs) ``` ``` ## [[1]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.4 0.699 42.0 1.40e-12 ## 2 year 0.275 0.0205 13.5 9.84e- 8 ## ## [[2]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 58.6 1.13 51.7 1.79e-13 ## 2 year 0.335 0.0332 10.1 1.46e- 6 ## ## [[3]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 42.2 0.756 55.8 8.22e-14 ## 2 year 0.569 0.0221 25.7 1.81e-10 ## ## [[4]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 31.7 0.804 39.4 2.63e-12 ## 2 year 0.209 0.0235 8.90 4.59e- 6 ## ## [[5]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 62.2 0.167 372. 4.80e-22 ## 2 year 0.232 0.00489 47.4 4.22e-13 ## ## [[6]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 67.9 0.355 192. 3.70e-19 ## 2 year 0.228 0.0104 21.9 8.67e-10 ``` --- ##Extract values for each coefficient (cont'd) ```r # better, but everything is still stuffed into this strange format country_nest <- country_nest %>% mutate( coefs = model %>% purrr::map(.f=broom::tidy) ) country_nest %>% head() ``` ``` ## # A tibble: 6 × 7 ## # Groups: continent, country [6] ## country continent data model intercept slope coefs ## <fct> <fct> <list> <list> <dbl> <dbl> <list> ## 1 Afghanistan Asia <tibble [12 × 4]> <lm> 29.4 0.275 <tibble> ## 2 Albania Europe <tibble [12 × 4]> <lm> 58.6 0.335 <tibble> ## 3 Algeria Africa <tibble [12 × 4]> <lm> 42.2 0.569 <tibble> ## 4 Angola Africa <tibble [12 × 4]> <lm> 31.7 0.209 <tibble> ## 5 Argentina Americas <tibble [12 × 4]> <lm> 62.2 0.232 <tibble> ## 6 Australia Oceania <tibble [12 × 4]> <lm> 67.9 0.228 <tibble> ``` --- ## Ready to fly! The opposite of `nest` is `unnest`: ```r coefs <- country_nest %>% select(country, continent, coefs) %>% unnest(cols = coefs) coefs ``` ``` ## # A tibble: 284 × 7 ## # Groups: continent, country [142] ## country continent term estimate std.error statistic p.value ## <fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan Asia (Intercept) 29.4 0.699 42.0 1.40e-12 ## 2 Afghanistan Asia year 0.275 0.0205 13.5 9.84e- 8 ## 3 Albania Europe (Intercept) 58.6 1.13 51.7 1.79e-13 ## 4 Albania Europe year 0.335 0.0332 10.1 1.46e- 6 ## 5 Algeria Africa (Intercept) 42.2 0.756 55.8 8.22e-14 ## 6 Algeria Africa year 0.569 0.0221 25.7 1.81e-10 ## 7 Angola Africa (Intercept) 31.7 0.804 39.4 2.63e-12 ## 8 Angola Africa year 0.209 0.0235 8.90 4.59e- 6 ## 9 Argentina Americas (Intercept) 62.2 0.167 372. 4.80e-22 ## 10 Argentina Americas year 0.232 0.00489 47.4 4.22e-13 ## # … with 274 more rows ``` --- ## Reshaping ```r coefsLong <- coefs %>% pivot_longer(names_to="Statistic", values_to = "Value", 4:7) coefsTerm <- coefsLong %>% pivot_wider(names_from="term", values_from="Value") coefsTerm %>% filter(Statistic == "estimate") %>% ggplot(aes(x = `(Intercept)`, y = year)) + geom_point(aes(colour = continent)) + ylab("Slope (year)") ``` ![](02b_purrr_files/figure-html/unnamed-chunk-29-1.png)<!-- --> --- class: inverse ## Your turn (10 mins) - Extract other model diagnostics: find the R square value for each model. Draw a visualization of R square by country - can you see a pattern? What does the pattern mean? - Extract residuals and fitted values for each of the models and store them in a dataset together with country and continent information. - Plot residuals across the years and fit a smooth. What does the pattern mean? - Include predicted Life Expectancy for 2017 into the data set. Then compare predicted life expectancy in 2017 with that in 1950.