class: center, middle, inverse, title-slide .title[ # Stat 585 - profiling code ] .author[ ### Heike Hofmann ] --- ## Beyond the: let's just get it to work Signs of possible inefficiencies - Violation of the language model - e.g. R is vector oriented language, i.e. ```for``` loops should be used sparingly, - e.g. use of ```rbind``` (it is *known* to be very slow) - Insufficient knowledge of the language. We're all guilty of that at different levels - Reading other people's code helps improve our language knowledge! --- ## What do others say? - **Ken Thompson:** <br> keep the number of indentations down, high number of local variables is suspicious - **Hadley Wickham:** <br> functions with more than ten lines are suspicious, as is highly repetitive code - **Generally Good Practices:** <br> structure code using indentation and spacing, write comments, use clear naming convention <br><br> - BUT: always test and don't assume --- ## Claim: loops are slow `x` is vector with numbers 1,2,3,4 Objective: re-code these into weather events: "Sun", "Rain", "Snow", "Hail" ```r x <- sample.int(4, size=1000, replace = TRUE) y <- vector("character", length=length(x)) for (i in 1:length(x)) { y[i] <- switch (x[i], "Sun", "Rain", "Snow", "Hail", default=NA) } table(y) ``` ``` ## y ## Hail Rain Snow Sun ## 249 235 228 288 ``` --- ## Removing for loops (cont'd) ```r # the switch statement on the previous slide is equivalent to # the following lines: for (i in 1:length(x)) { if (x[i] == 1) y[i] <- "Sun" if (x[i] == 2) y[i] <- "Rain" if (x[i] == 3) y[i] <- "Snow" if (x[i] == 4) y[i] <- "Hail" } # using R's vector system we can reduce this to a single line: y <- c("Sun", "Rain", "Snow", "Hail")[x] ``` --- ## Removing for loops (cont'd) All of the previous solutions doctor around symptoms We are still breaking the language model :) -- ```r y <- factor(x, labels = c("Sun", "Rain", "Snow", "Hail")) summary(y) ``` ``` ## Sun Rain Snow Hail ## 288 235 228 249 ``` But: this is the only variant that returns a factor variable -- ... now we want to compare these four options --- ## Profiling Code can be profiled at different levels: 1. two (or more) alternatives of code with the same objectives are timed to determined the one which is the fastest (`system.time()` and `proc.time()`) 2. one piece of code is analyzed to find weak/slow spots (e.g. `profvis`, `benchmarkr` or `microbenchmark` ) --- ## Profiling process - Collect information on code performance: - memory usage - frequency and duration of function calls - Goal: optimization of code - R has an in-built profiler: `Rprof` Output in `Rprof.out` --- ## `system.time()` or `proc.time()` - Probably simplest of all measures - Evaluates the time spent in the whole expression - doesn't give any clues about what is done at a lower level <br><br> --- ## Claim: 'Everybody knows that' `read_csv` is faster than `read.csv` ```r data(diamonds, package="ggplot2") tmpfile <- tempfile(fileext = "csv") write.csv(diamonds, file = tmpfile) ptm <- proc.time() t1 <- read.csv(tmpfile) proc.time() - ptm ``` ``` ## user system elapsed ## 0.108 0.004 0.113 ``` ```r ptm <- proc.time() t2 <- readr::read_csv(file=tmpfile, progress = FALSE) proc.time() - ptm ``` ``` ## user system elapsed ## 0.561 0.063 0.644 ``` --- ## Use `rbenchmark` ```r library(rbenchmark) benchmark( "read.csv" = { read.csv(tmpfile) }, "read_csv" = { readr::read_csv(file=tmpfile, progress = FALSE) } ) ``` ``` ## test replications elapsed relative user.self sys.self user.child ## 2 read_csv 100 7.675 1.000 19.136 1.085 0 ## 1 read.csv 100 12.707 1.656 12.322 0.332 0 ## sys.child ## 2 0 ## 1 0 ``` --- ## Everybody knows that `read_csv` is faster for large data? ```r data(diamonds, package="ggplot2") tmpfile <- tempfile(fileext = "csv") diamonds5 <- rbind(diamonds,diamonds,diamonds,diamonds,diamonds) write.csv(diamonds5, file = tmpfile, row.names=FALSE) ``` ```r benchmark( "read.csv" = { read.csv(tmpfile) }, "read_csv" = { readr::read_csv(file=tmpfile, progress = FALSE) } ) ``` ``` ## test replications elapsed relative user.self sys.self user.child ## 2 read_csv 100 15.378 1.000 65.362 2.333 0 ## 1 read.csv 100 57.557 3.743 54.848 2.545 0 ## sys.child ## 2 0 ## 1 0 ``` --- ## Reading binary files - fast! But introduces platform/version dependencies ```r tmpfile_rds <- gsub("csv", "rds", tmpfile) saveRDS(diamonds5, tmpfile_rds) tmpfile_rda <- gsub("csv", "rda", tmpfile) save(diamonds5, file=tmpfile_rda) ptm <- proc.time() t3 <- readRDS(tmpfile_rds) proc.time() - ptm ``` ``` ## user system elapsed ## 0.037 0.000 0.038 ``` ```r ptm <- proc.time() load(tmpfile_rda) proc.time() - ptm ``` ``` ## user system elapsed ## 0.037 0.001 0.038 ``` --- class: inverse ## Your turn: which option is fastest? ```r x <- sample.int(4, 1000, replace=TRUE) y <- vector("character", length=length(x)) # Option #1 for (i in 1:length(x)) { y[i] <- switch (x[i], "Sun", "Rain", "Snow", "Hail", default=NA) } # Option #2 for (i in 1:length(x)) { if (x[i] == 1) y[i] <- "Sun" if (x[i] == 2) y[i] <- "Rain" if (x[i] == 3) y[i] <- "Snow" if (x[i] == 4) y[i] <- "Hail" } # Option #3 y <- c("Sun", "Rain", "Snow", "Hail")[x] # Option #4 y <- factor(x) levels(y) <- c("Sun", "Rain", "Snow", "Hail") ``` --- ## Your Turn Result Based on 500 evaluations of samples of size 5000: ![](03b_profiling-code_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- ```r results %>% group_by(Option) %>% summarise(eval=mean(value)) ``` ``` ## # A tibble: 4 × 2 ## Option eval ## <int> <dbl> ## 1 1 0.00243 ## 2 2 0.00138 ## 3 3 0.0000620 ## 4 4 0.000286 ``` Option 3 is fastest on average. But why? - Here we can relatively easily answer the question. Generally: more in-depth profiling --- ## Profiling Code: frequency and duration of function calls - R has an in-built profiler: `Rprof` Output in `Rprof.out` ```r Rprof() # .... Rprof(NULL) ``` - call stack is written every 0.02 seconds (interval) into a temporary file - output file is processed to produce a summary of the usage - time measurements varies by platform: on Unix-alike time is CPU time of the R process (excludes time waiting for input) - Note: profiling takes time, too; once the timer goes off, the information is not recorded until the next timing click (probably in the range 1–10 msecs). Can’t go below that. --- ## `profvis` example `profvis` package provides a wrapper around `Rprof` ```r library(profvis) p <- profvis({ data(diamonds, package = "ggplot2") plot(price ~ carat, data = diamonds) m <- lm(price ~ carat, data = diamonds) abline(m, col = "red") }) ``` `profvis` creates interactive response with code and flame plot. Copy and paste the code into your console to run it. --- class: inverse ## Your Turn - Part 1 Write a function to find the factors of a positive integer `x` --- ## Discuss approaches --- class: inverse ## Your Turn - Part 2 Benchmark the approaches to identify the fastest function. Don't forget about readability and maintainability of code!