Using libkalibera for benchmark analysis
Tags: low-level
libkalibera
is an implementation of the technique described in
“Rigorous Benchmarking in Reasonable Time” by Kalibera and Jones.
It lets you determine, given certain benchmarking outputs, how many
benchmark iterations to optimally make use of your compute time. It also
provides facilities for computing the confidence intervals to estimate
benchmark improvements and show statistical significance.
It’s been used in a couple of papers, but as far as I can
tell there is no real documentation on how to use it. So this is an
(unofficial!) user guide about how to start using it yourself. In
particular, I’ll be showing off the Python version pykalibera
, but
similar logic applies to the Ruby version.
Installation
pykalibera isn’t in PyPI or anything, so just git clone
the repo:
$ git clone https://github.com/softdevteam/libkalibera.git
then to use it, you’ll need to cd
into the python
directory:
$ cd libkalibera/python/
$ python
>>> import pykalibera
If you are using Python 3, this won’t work, and you’ll need to apply the patch below.
Patch for Python 3
If you are using Python 3, you’ll need to apply the patch below. You can do this by saving the patch below into a file, and then doing:
$ cd libkalibera/
$ git apply {path to file}
Here is the patch:
diff --git a/python/pykalibera/data.py b/python/pykalibera/data.py
index a07ceed..ee1115b 100644
--- a/python/pykalibera/data.py
+++ b/python/pykalibera/data.py
@@ -1,10 +1,11 @@
import math, itertools, random
+import base64
import bz2
from functools import wraps
-constants = bz2.decompress("""\
+constants = bz2.decompress(base64.b64decode("""\
QlpoOTFBWSZTWbTS4VUAC9bYAEAQAAF/4GAOGZ3e40HH2YJERUKomGbCNMAAtMBaAkCOP9U0/R+q
qNCqfjAqVGOY3+qk96qmmIp+CCVNDD/1VGjfqkBJpIElG6uN92vE/PP+5IxhMIIgAbOxEMKLMVSq
VWtZmZaEklAAAttoAAAAAAAAAAAAEklAAEklABttkksklkkknVu2dX1vW9yWrkuXJJJJJJJJJJKK
@@ -47,7 +48,7 @@ rJBfbPMSKZc3wmij3ULrhE9nIwoDMp4WAK2GkIKIqrHAK0Bjvo7sA2VZ941ggrwIsfGLZTHvGSZR
HrTVYQJbJ1e3y6B7LoCh5qyXWO03X5WbxWT0UvY55cyRbhmB8ib6lkhRo5USRAoLFA4WELV93ZV/
DKh2MIhnIWCPBLEh3FUTBSxJC7h4Z15qTFPTRmpe1Ldj1rlkVnAKHDySryior3OheiTPKZY2GaQ6
N2YyvJh9wuO75VOarCWLEUdLavAs2RShYOntLrMVabUAyDnTJIQ4deJa92pAWd6KBz+F3JFOFCQt
-NLhVQA==""".decode("base64"))
+NLhVQA=="""))
constants = [float(x) for x in constants.split()]
@@ -99,7 +100,7 @@ def confidence_slice(means, confidence="0.95"):
def memoize(func):
""" The @memoize decorator """
- attr = "%s_%s" % (func.func_name, id(func))
+ attr = "%s_%s" % (func.__name__, id(func))
@wraps(func)
def memoized(self, *args, **kwargs):
d = self._memoization_values
What libkalibera
does
The idea behind libkalibera
is that you have a benchmark and you want
to know how many times you want to run it. There may also be multiple
levels. For example, let’s consider the following benchmarking
methodology. Here we will have three levels, although you can have more
or less.
Level 3: You compile the program you are going to benchmark 10 times. Because of non-determinism in your build system, each executable created is slightly different.
Level 2: You run the compiled program 20 times. Because of non-determinism of things like ASLR and LTO, each process will perform slightly differently. The process starts out by running the benchmark 10 times as a “warmup” step.
Level 1: After the “warmup” completes, you do a loop which executes your benchmark 30 times. Because of non-determinism in your program or external factors (caches, context switches, CPU throttling, etc.), each iteration performs slightly differently.
This would in total require 10 * 20 * (10 + 30) = 8000 executions of the actual benchmark, which is a lot. We’ll call the runtime costs associated with each level c_3, c_2, c_1 respectively. So for example:
- c_3: time to build the program
- c_2: time for the process start + finishing the warmup steps
- c_1: time to do one iteration of the benchmark
After doing this many executions, you may have confidence in your
results. But could you have done with less executions? The idea behind
libkalibera
is that you’ll first do an initial experiment to
determine the optimal amount of executions for later real experiments.
The first step of using libkalibera
is to get the data into a Data
class. Here’s an example of how you would do that:
import pykalibera.data
= 10
level3_iterations = 20
level2_iterations = 30
level1_iterations
= dict()
dictionary for l3 in range(level3_iterations):
for l2 in range(level2_iterations):
# Set dictionary[(l3, l2)] to a list
# of benchmark timings EXCLUDING the
# warmup iterations
= # ...
non_warmup_steps assert len(non_warmup_steps) == level1_iterations
= non_warmup_steps
dictionary[(l3, l2)]
= (level3_iterations, level2_iterations,
reps
level1_iterations)= pykalibera.data.Data(dictionary, reps) d
Before continuing, you should verify the assumption that the timings of each iteration of the benchmark are independent of each other and identically distributed. Of course, it’s impossible to prove that this is true, so instead we’ll try to find reasons it’s not true.
The first reason is that you may not have warmed up the benchmark enough, especially if you’re benchmarking a JIT language which requires a lot of warmup. You can usually see this via visual inspection of the plots – here’s of a benchmark which does not do enough warmup. The sudden drop after the first few iterations suggests this:

If this happens, you should probably include more runs. See Determining the number of warmup iterations needed for more information about how to determine the number of warmups.
Another common reason is that the residuals are “autocorrelated” – i.e.
if one iteration of the benchmark depends on a previous one. You can
check if this is the case via statsmodel
:
import pandas as pd
import statsmodel as sm
= pd.Series(benchmark_timings)
s =10) sm.graphics.tsa.plot_acf(s, lags
For example, the following graph shows a low autocorrelation which eliminates one possibly cause of unsuitability. Here’s what a good graph looks like. It’s good because all of the autocorrelations are below the threshold for statistical significance, suggesting that they’re likely to be noise:

How many times do I need to run my benchmarks?
First, we’ll figure out how many times you need to run each level!
Run d.Ti2(i)
where i
is the level. If it’s \le 0, then you only
need to run this level once! Otherwise, you’ll need to do more work to
determine this correct number of repetitions.
d.optimalreps(i, costs)
will tell you how many times you should run
the i
th level. costs
should be a list of costs for each level, from
high to low. So for example, let’s say that it takes 600 seconds to
build your application (c_3), 30 seconds to warmup the benchmark
(c_2) and then 20 seconds to run the actual benchmars (c_1). Then
you should pass in costs = [600, 30, 20]
.
For example, d.optimalreps(1, [600, 30, 20]) = 25
would mean that you
should use 25 iterations at Level 1, not 30 like we did originally.
Note that d.optimalreps
will throw if you pass in the highest level
(in our case, i = 3
). This is because the optimal number of times to
run the highest level depends on the accuracy you want from your
benchmarking suite. You can always get more accurate results by
iterating the top-level more times. Remember that “optimal” here is
referring to the cost/accuracy tradeoff, so you can get more accurate
results at a higher cost.
Confidence Intervals
You can get a 95% confidence interval for the mean benchmark time using bootstrap:
>>> d.bootstrap_confidence_interval(confidence="0.95")
(387551372.5, 388219399.3325, 389102520.205)
The first number is the 2.5th percentile, the second is the median of the means of the bootstrap, and the third is the 97.5th percentile.
You can also use the parametric method confidence95
if you are
confident that your data satisfies the CLT. It will likely result in
tighter confident intervals. I personally do not recommend this since
the assumptions are difficult to verify.
If you have another data set, you can compare the two via
bootstrap_quotient
. This returns a confidence interval for the
quotient of sample mean times between d
and other
.
>>> d.bootstrap_quotient(other)
(0.9970837820037386, 0.9999710061391922, 1.002867547532627)
In this case the two results are not statistically significantly different, since the confidence interval contains 1. This is a ratio of the runtimes, so a value less than 1 would indicate a speedup, and a value greater than one would indicate a slowdown.
See the paper for more information on these confidence intervals, along with details for the bootstrap method.
Determining the number of warmup iterations needed
The above discussion doesn’t tell you how to choose the correct number
of warmup iterations. pykalibera
does not provide functionality for this. I used
ruptures
for changepoint analysis to some effect. See Why Aren’t More Users
More Happy With Our VMs? Part 1 for more information about how
to use changepoint analysis to determine when the warmup period is over.
On pykalibera.graphs
The functions in pykalibera.graphs
reproduce the graphs shown in
Figure 1 of the paper. Personally, I wouldn’t use these. The
functionality is pretty standard and can be provided by a lot of
different libraries. Here are the equivalents I use for a Pandas series
called s
:
run_sequence_plot
: seaborn’slineplot(s)
is equivalent.lag_plot
: seaborn’sscatterplot(y=s, x=s.shift(1))
does the equivalent for a shift of 1.acr_plot
: statsmodel’sgraphics.tsa.plot_acf(s)
is better, and also shows which of the correlations are statistically significant.