Using libkalibera for benchmark analysis

libkalibera is an implementation of the technique described in “Rigorous Benchmarking in Reasonable Time” by Kalibera and Jones. It lets you determine, given certain benchmarking outputs, how many benchmark iterations to optimally make use of your compute time. It also provides facilities for computing the confidence intervals to estimate benchmark improvements and show statistical significance.

It’s been used in a couple of papers, but as far as I can tell there is no real documentation on how to use it. So this is an (unofficial!) user guide about how to start using it yourself. In particular, I’ll be showing off the Python version pykalibera, but similar logic applies to the Ruby version.

Installation

pykalibera isn’t in PyPI or anything, so just git clone the repo:

$ git clone https://github.com/softdevteam/libkalibera.git

then to use it, you’ll need to cd into the python directory:

$ cd libkalibera/python/
$ python
>>> import pykalibera

If you are using Python 3, this won’t work, and you’ll need to apply the patch below.

Patch for Python 3

If you are using Python 3, you’ll need to apply the patch below. You can do this by saving the patch below into a file, and then doing:

$ cd libkalibera/
$ git apply {path to file}

Here is the patch:

diff --git a/python/pykalibera/data.py b/python/pykalibera/data.py
index a07ceed..ee1115b 100644
--- a/python/pykalibera/data.py
+++ b/python/pykalibera/data.py
@@ -1,10 +1,11 @@
 import math, itertools, random
+import base64
 
 import bz2
 
 from functools import wraps
 
-constants = bz2.decompress("""\
+constants = bz2.decompress(base64.b64decode("""\
 QlpoOTFBWSZTWbTS4VUAC9bYAEAQAAF/4GAOGZ3e40HH2YJERUKomGbCNMAAtMBaAkCOP9U0/R+q
 qNCqfjAqVGOY3+qk96qmmIp+CCVNDD/1VGjfqkBJpIElG6uN92vE/PP+5IxhMIIgAbOxEMKLMVSq
 VWtZmZaEklAAAttoAAAAAAAAAAAAEklAAEklABttkksklkkknVu2dX1vW9yWrkuXJJJJJJJJJJKK
@@ -47,7 +48,7 @@ rJBfbPMSKZc3wmij3ULrhE9nIwoDMp4WAK2GkIKIqrHAK0Bjvo7sA2VZ941ggrwIsfGLZTHvGSZR
 HrTVYQJbJ1e3y6B7LoCh5qyXWO03X5WbxWT0UvY55cyRbhmB8ib6lkhRo5USRAoLFA4WELV93ZV/
 DKh2MIhnIWCPBLEh3FUTBSxJC7h4Z15qTFPTRmpe1Ldj1rlkVnAKHDySryior3OheiTPKZY2GaQ6
 N2YyvJh9wuO75VOarCWLEUdLavAs2RShYOntLrMVabUAyDnTJIQ4deJa92pAWd6KBz+F3JFOFCQt
-NLhVQA==""".decode("base64"))
+NLhVQA=="""))
 
 constants = [float(x) for x in constants.split()]
 
@@ -99,7 +100,7 @@ def confidence_slice(means, confidence="0.95"):
 
 def memoize(func):
     """ The @memoize decorator """
-    attr = "%s_%s" % (func.func_name, id(func))
+    attr = "%s_%s" % (func.__name__, id(func))
     @wraps(func)
     def memoized(self, *args, **kwargs):
         d = self._memoization_values

What `libkalibera` does

The idea behind libkalibera is that you have a benchmark and you want to know how many times you want to run it. There may also be multiple levels. For example, let’s consider the following benchmarking methodology. Here we will have three levels, although you can have more or less.

Level 3: You compile the program you are going to benchmark 10 times. Because of non-determinism in your build system, each executable created is slightly different.
Level 2: You run the compiled program 20 times. Because of non-determinism of things like ASLR and LTO, each process will perform slightly differently. The process starts out by running the benchmark 10 times as a “warmup” step.
Level 1: After the “warmup” completes, you do a loop which executes your benchmark 30 times. Because of non-determinism in your program or external factors (caches, context switches, CPU throttling, etc.), each iteration performs slightly differently.

This would in total require 10 * 20 * (10 + 30) = 8000 executions of the actual benchmark, which is a lot. We’ll call the runtime costs associated with each level c_3, c_2, c_1 respectively. So for example:

c_3: time to build the program
c_2: time for the process start + finishing the warmup steps
c_1: time to do one iteration of the benchmark

After doing this many executions, you may have confidence in your results. But could you have done with less executions? The idea behind libkalibera is that you’ll first do an initial experiment to determine the optimal amount of executions for later real experiments.

The first step of using libkalibera is to get the data into a Data class. Here’s an example of how you would do that:

import pykalibera.data

level3_iterations = 10
level2_iterations = 20
level1_iterations = 30

dictionary = dict()
for l3 in range(level3_iterations):
    for l2 in range(level2_iterations):
        # Set dictionary[(l3, l2)] to a list 
        # of benchmark timings EXCLUDING the
        # warmup iterations
        non_warmup_steps = # ...
        assert len(non_warmup_steps) == level1_iterations
        dictionary[(l3, l2)] = non_warmup_steps

reps = (level3_iterations, level2_iterations,
        level1_iterations)
d = pykalibera.data.Data(dictionary, reps)

Before continuing, you should verify the assumption that the timings of each iteration of the benchmark are independent of each other and identically distributed. Of course, it’s impossible to prove that this is true, so instead we’ll try to find reasons it’s not true.

The first reason is that you may not have warmed up the benchmark enough, especially if you’re benchmarking a JIT language which requires a lot of warmup. You can usually see this via visual inspection of the plots – here’s of a benchmark which does not do enough warmup. The sudden drop after the first few iterations suggests this:

If this happens, you should probably include more runs. See Determining the number of warmup iterations needed for more information about how to determine the number of warmups.

Another common reason is that the residuals are “autocorrelated” – i.e. if one iteration of the benchmark depends on a previous one. You can check if this is the case via statsmodel:

import pandas as pd
import statsmodel as sm

s = pd.Series(benchmark_timings)
sm.graphics.tsa.plot_acf(s, lags=10)

For example, the following graph shows a low autocorrelation which eliminates one possibly cause of unsuitability. Here’s what a good graph looks like. It’s good because all of the autocorrelations are below the threshold for statistical significance, suggesting that they’re likely to be noise:

How many times do I need to run my benchmarks?

First, we’ll figure out how many times you need to run each level! Run d.Ti2(i) where i is the level. If it’s \le 0, then you only need to run this level once! Otherwise, you’ll need to do more work to determine this correct number of repetitions.

d.optimalreps(i, costs) will tell you how many times you should run the ith level. costs should be a list of costs for each level, from high to low. So for example, let’s say that it takes 600 seconds to build your application (c_3), 30 seconds to warmup the benchmark (c_2) and then 20 seconds to run the actual benchmars (c_1). Then you should pass in costs = [600, 30, 20]. For example, d.optimalreps(1, [600, 30, 20]) = 25 would mean that you should use 25 iterations at Level 1, not 30 like we did originally.

Note that d.optimalreps will throw if you pass in the highest level (in our case, i = 3). This is because the optimal number of times to run the highest level depends on the accuracy you want from your benchmarking suite. You can always get more accurate results by iterating the top-level more times. Remember that “optimal” here is referring to the cost/accuracy tradeoff, so you can get more accurate results at a higher cost.

Confidence Intervals

You can get a 95% confidence interval for the mean benchmark time using bootstrap:

>>> d.bootstrap_confidence_interval(confidence="0.95")
(387551372.5, 388219399.3325, 389102520.205)

The first number is the 2.5th percentile, the second is the median of the means of the bootstrap, and the third is the 97.5th percentile.

You can also use the parametric method confidence95 if you are confident that your data satisfies the CLT. It will likely result in tighter confident intervals. I personally do not recommend this since the assumptions are difficult to verify.

If you have another data set, you can compare the two via bootstrap_quotient. This returns a confidence interval for the quotient of sample mean times between d and other.

>>> d.bootstrap_quotient(other)
(0.9970837820037386, 0.9999710061391922, 1.002867547532627)

In this case the two results are not statistically significantly different, since the confidence interval contains 1. This is a ratio of the runtimes, so a value less than 1 would indicate a speedup, and a value greater than one would indicate a slowdown.

See the paper for more information on these confidence intervals, along with details for the bootstrap method.

Determining the number of warmup iterations needed

The above discussion doesn’t tell you how to choose the correct number of warmup iterations. pykalibera does not provide functionality for this. I used ruptures for changepoint analysis to some effect. See Why Aren’t More Users More Happy With Our VMs? Part 1 for more information about how to use changepoint analysis to determine when the warmup period is over.

On `pykalibera.graphs`

The functions in pykalibera.graphs reproduce the graphs shown in Figure 1 of the paper. Personally, I wouldn’t use these. The functionality is pretty standard and can be provided by a lot of different libraries. Here are the equivalents I use for a Pandas series called s:

run_sequence_plot: seaborn’s lineplot(s) is equivalent.
lag_plot: seaborn’s scatterplot(y=s, x=s.shift(1)) does the equivalent for a shift of 1.
acr_plot: statsmodel’s graphics.tsa.plot_acf(s) is better, and also shows which of the correlations are statistically significant.