Stat 585 - profiling code

class: center, middle, inverse, title-slide

.title[
# Stat 585 - profiling code
]
.author[
### Heike Hofmann
]

---

## Beyond the: let's just get it to work

Signs of possible inefficiencies

- Violation of the language model
    - e.g. R is vector oriented language, i.e.  ```for``` loops should be used sparingly, 
    - e.g. use of ```rbind``` (it is *known* to be very slow)

- Insufficient knowledge of the language. We're all guilty of that at different levels

- Reading other people's code helps improve our language knowledge!

---

## What do others say?

- **Ken Thompson:** keep the number of indentations down, high number of local variables is suspicious

- **Hadley Wickham:** functions with more than ten lines are suspicious, as is highly repetitive code

- **Generally Good Practices:** structure code using indentation and spacing, write comments, use clear naming convention

- BUT: always test and don't assume
---

## Claim: loops are slow

`x` is vector with numbers 1,2,3,4

Objective: re-code these into weather events: "Sun", "Rain", "Snow", "Hail"

```r
x <- sample.int(4, size=1000, replace = TRUE)
y <- vector("character", length=length(x))
for (i in 1:length(x)) {
 y[i] <- switch (x[i], "Sun", "Rain", "Snow", "Hail", default=NA)
}

table(y)
```

```
## y
## Hail Rain Snow  Sun 
##  249  235  228  288
```

---

## Removing for loops (cont'd)

```r
# the switch statement on the previous slide is equivalent to 
# the following lines:
for (i in 1:length(x)) {
 if (x[i] == 1) y[i] <- "Sun"
 if (x[i] == 2) y[i] <- "Rain"
 if (x[i] == 3) y[i] <- "Snow"
 if (x[i] == 4) y[i] <- "Hail"
}

# using R's vector system we can reduce this to a single line:
y <- c("Sun", "Rain", "Snow", "Hail")[x]
```

---

## Removing for loops (cont'd)

All of the previous solutions doctor around symptoms

We are still breaking the language model :)

```r
y <- factor(x, labels = c("Sun", "Rain", "Snow", "Hail"))

summary(y)
```

```
##  Sun Rain Snow Hail 
##  288  235  228  249
```

But: this is the only variant that returns a factor variable

... now we want to compare these four options

---

## Profiling

Code can be profiled at different levels:

1. two (or more) alternatives of code with the same objectives are timed to determined the one which is the fastest (`system.time()` and `proc.time()`)

2. one piece of code is analyzed to find weak/slow spots (e.g. `profvis`, `benchmarkr` or `microbenchmark` )

---

## Profiling process

- Collect information on code performance:

- memory usage

- frequency and duration of function calls

- Goal: optimization of code

- R has an in-built profiler: `Rprof`  Output in `Rprof.out`

---

## `system.time()` or `proc.time()`

- Probably simplest of all measures

- Evaluates the time spent in the whole expression

- doesn't give any clues about what is done at a lower level

---

## Claim: 'Everybody knows that' `read_csv` is faster than `read.csv`

```r
data(diamonds, package="ggplot2")
tmpfile <- tempfile(fileext = "csv")
write.csv(diamonds, file = tmpfile)

ptm <- proc.time()
t1 <- read.csv(tmpfile)
proc.time() - ptm
```

```
##    user  system elapsed 
##   0.108   0.004   0.113
```

```r
ptm <- proc.time()
t2 <- readr::read_csv(file=tmpfile, progress = FALSE)
proc.time() - ptm
```

```
##    user  system elapsed 
##   0.561   0.063   0.644
```
---

## Use `rbenchmark`

```r
library(rbenchmark)

benchmark(
  "read.csv" = {
    read.csv(tmpfile)
    },
  "read_csv" = {
    readr::read_csv(file=tmpfile, progress = FALSE)
  }
)
```

```
##       test replications elapsed relative user.self sys.self user.child
## 2 read_csv          100   7.675    1.000    19.136    1.085          0
## 1 read.csv          100  12.707    1.656    12.322    0.332          0
##   sys.child
## 2         0
## 1         0
```
---

## Everybody knows that `read_csv` is faster for large data?

```r
data(diamonds, package="ggplot2")
tmpfile <- tempfile(fileext = "csv")
diamonds5 <- rbind(diamonds,diamonds,diamonds,diamonds,diamonds)
write.csv(diamonds5, file = tmpfile, row.names=FALSE)
```

```r
benchmark(
  "read.csv" = {
    read.csv(tmpfile)
    },
  "read_csv" = {
    readr::read_csv(file=tmpfile, progress = FALSE)
  }
)
```

```
##       test replications elapsed relative user.self sys.self user.child
## 2 read_csv          100  15.378    1.000    65.362    2.333          0
## 1 read.csv          100  57.557    3.743    54.848    2.545          0
##   sys.child
## 2         0
## 1         0
```

---

## Reading binary files - fast!

But introduces platform/version dependencies

```r
tmpfile_rds <- gsub("csv", "rds", tmpfile)
saveRDS(diamonds5, tmpfile_rds)
tmpfile_rda <- gsub("csv", "rda", tmpfile)
save(diamonds5, file=tmpfile_rda)

ptm <- proc.time()
t3 <- readRDS(tmpfile_rds)
proc.time() - ptm
```

```
##    user  system elapsed 
##   0.037   0.000   0.038
```

```r
ptm <- proc.time()
load(tmpfile_rda)
proc.time() - ptm
```

```
##    user  system elapsed 
##   0.037   0.001   0.038
```

---
class: inverse
## Your turn: which option is fastest?

```r
x <- sample.int(4, 1000, replace=TRUE)
y <- vector("character", length=length(x))

# Option #1
for (i in 1:length(x)) {
 y[i] <- switch (x[i], "Sun", "Rain", "Snow", "Hail", default=NA)
}

# Option #2
for (i in 1:length(x)) {
 if (x[i] == 1) y[i] <- "Sun"
 if (x[i] == 2) y[i] <- "Rain"
 if (x[i] == 3) y[i] <- "Snow"
 if (x[i] == 4) y[i] <- "Hail"
}

# Option #3
y <- c("Sun", "Rain", "Snow", "Hail")[x]

# Option #4
y <- factor(x)
levels(y) <- c("Sun", "Rain", "Snow", "Hail")
```

---

## Your Turn Result

Based on 500 evaluations of samples of size 5000:

![](03b_profiling-code_files/figure-html/unnamed-chunk-10-1.png)

---

```r
results %>% group_by(Option) %>% summarise(eval=mean(value))
```

```
## # A tibble: 4 × 2
## Option eval
## <int> <dbl>
## 1 1 0.00243 
## 2 2 0.00138 
## 3 3 0.0000620
## 4 4 0.000286
```

Option 3 is  fastest on average. But why? - Here we can relatively easily answer the question. Generally: more in-depth profiling

---

## Profiling Code: frequency and duration of function calls

- R has an in-built profiler: `Rprof`  Output in `Rprof.out`

```r
Rprof()

# ....

Rprof(NULL)
```

- call stack is written every 0.02 seconds (interval) into a temporary file

- output file is processed to produce a summary of the usage

- time measurements varies by platform: on Unix-alike time is CPU time of the R process (excludes time waiting for input)

- Note: profiling takes time, too; once the timer goes off, the information is not recorded until the next timing click (probably in the range 1–10 msecs). Can’t go below that.

---

## `profvis` example

`profvis` package provides a wrapper around `Rprof`

```r
library(profvis)

p <- profvis({			
 data(diamonds, package = "ggplot2")			
 plot(price ~ carat, data = diamonds)			
 m <- lm(price ~ carat, data = diamonds)			
 abline(m, col = "red")			
})
```

`profvis` creates interactive response with code and flame plot.

Copy and paste the code into your console to run it.

---
class: inverse
## Your Turn - Part 1

Write a function to find the factors of a positive integer `x`

---

## Discuss approaches

---
class: inverse
## Your Turn - Part 2

Benchmark the approaches to identify the fastest function.

Don't forget about readability and maintainability of code!