Node.js Startup: Profiling Startup

In the previous post in this series, we saw how to measure Node.js’s startup time using hyperfine. hyperfine is great for statistical analysis, but it doesn’t provide as much information as perf-stat(1):

$ perf stat --repeat=1000 ./node -e 0

 Performance counter stats for './node -e 0' (1000 runs):

    23.18 msec task-clock                #    0.996 CPUs utilized   (+- 0.20%)
       19      context-switches          #  818.731 /sec            (+- 1.23%)
        0      cpu-migrations            #    0.000 /sec            
     3091      page-faults               #  133.195 K/sec           (+- 0.01%)
 83289964      cycles                    #    3.589 GHz             (+- 0.19%)
118025906      instructions              #    1.41  insn per cycle  (+- 0.01%)
 19306506      branches                  #  831.938 M/sec           (+- 0.02%)
   448669      branch-misses             #    2.33% of all branches (+- 0.06%)

0.0232809 +- 0.0000467 seconds time elapsed  ( +-  0.20% )

Node.js is built on top of Google’s V8 Javascript engine, which has its own barebones shell called “d8”. I wanted to see how fast d8 was. This would provide a useful lower-bound: d8 does way less than Node.js, so it’s unlikely we could beat it.

$ perf stat -r 1000 out/Release/d8 -e 0

 Performance counter stats for 'out/Release/d8 -e 0' (1000 runs):

     7.84 msec task-clock                #    1.321 CPUs utilized   (+- 0.11%)
       33      context-switches          #    4.506 K/sec           (+- 0.28%)
        0      cpu-migrations            #    0.000 /sec            
      843      page-faults               #  115.105 K/sec           (+- 0.02%)
 24402305      cycles                    #    3.332 GHz             (+- 0.10%)
 26381058      instructions              #    1.16  insn per cycle  (+- 0.01%)
  4358727      branches                  #  595.150 M/sec           (+- 0.01%)
    77103      branch-misses             #    1.78% of all branches (+- 0.06%)

0.00593508 +- 0.00000559 seconds time elapsed  ( +-  0.09% )

The salient points for me here were:

Node.js’s startup time is pretty small at 23ms elapsed time. We’ll need to be working on shaving off milliseconds/microseconds in order to get faster. It also seems unlikely that we’ll get more than 66% faster, based on the d8 lower bound.
The CPU is almost entirely utilized. Node.js’s startup is almost entirely single-threaded, so “0.996 CPUs utilized” indicates that most of the bottleneck is in CPU speed.
133000 page faults per second is a lot of page faults for a single-threaded workload. Something in the hundreds per second would be a more typical workload. This makes sense though – a lot of memory movement is happening during program startup: loading the executable into memory, shared library resolution, and so on.

Profiling with perf

My go-to profiler is perf(1). perf works by interrupting the program periodically and capturing a stacktrace of the currently executing code. For more details about perf, I recommend Brendan Gregg’s writeup. Here’s the perf command I ran to profile Node.js’s startup:

LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu \
    perf record -g -F 299 -- bash -c 'for i in {1..1000}; do ./node -e 0; done'

Profiling just one execution of node was not very useful, because the runtime is so short, and so perf does not have much time to sample the execution. Using a tight loop increased the number of samples. I also chose to use libc6-prof via LD_LIBRARY_PATH, which is a version of glibc with frame pointers enabled. Frame pointers are necessary so that perf record can correctly unwind stacks with -g, otherwise time spent in the C standard library may not be correctly attributed to its callers. On Debian, you can install this via apt install libc6-prof.

For viewing the collected profiles, I rarely use the included perf-report interface.¹ Using perf script, we can export the stack traces to a format that can be ingested by other tools. The output format is a series of stack traces separated by newlines. Each stack trace has a single header line containing information like the pid and sampling time, followed by the sampled backtrace. Here’s the first couple of backtraces for our sample:

$ perf script -F comm,pid,tid,time,period,event,ip,sym,symoff,dso
perf-exec 46503/46503  7797.395284:          1 cycles: 
        ffffffff93296146 native_write_msr+0x6 ([kernel.kallsyms])
        ffffffff9320f1a5 intel_pmu_enable_all+0x15 ([kernel.kallsyms])
        ffffffff932092fb x86_pmu_enable+0x1ab ([kernel.kallsyms])
	...

node 46505/46505  7797.403597:   27382873 cycles: 
        ffffffff938a2ea7 copy_page+0x7 ([kernel.kallsyms])
        ffffffff934fa35e do_fault+0x24e ([kernel.kallsyms])
        ffffffff934fa9cd handle_pte_fault+0x1cd ([kernel.kallsyms])
        ...

I wanted to post-process this output to focus on only Node processes. Also since there was many Node processes, I wanted to collect all of their stack traces as if they came from one process (by replacing all of their pids to be the same). Finally I wanted to focus only on stack traces where the pid was equal to the tid. That indicates the main thread of Node.js, which is where almost all of the interesting work happens. I ended up crafting this Perl script:²

# By default, we don't print the stack trace.
$b = 0 if /^\w/;
# Start printing the stack trace if it matches the regex,
# indicating that this is a stack sampled from the main thread
# of a Node process.
#
# Replace the main thread with a fake aggregated thread of
# "node 1/1".
$b = 1 if s@^node ([0-9]+)/\1@node 1/1@;
print if $b

And as a one-liner:

perf script -F comm,pid,tid,time,period,event,ip,sym,symoff,dso \
    | perl -n -e'$b=0 if/^\w/;$b=1 if s@^node ([0-9]+)/\1@node 1/1@;print if$b'

Aggregating across a bunch of processes means that a timeline view is useless, but flamegraph visualizations still work fine.

I happily loaded the output script into Firefox Profiler. Unfortunately, more than half of stacks looked purely like perf overhead, mentioning things like “intel_pmu” (“Performance Monitor Unit”):

node 1/1  7804.131591:          1 cycles: 
	ffffffff93296146 native_write_msr+0x6 ([kernel.kallsyms])
	ffffffff9320f1a5 intel_pmu_enable_all+0x15 ([kernel.kallsyms])
	ffffffff932092fb x86_pmu_enable+0x1ab ([kernel.kallsyms])
	ffffffff9349b09c __perf_event_task_sched_in+0x1ac ([kernel.kallsyms])
	ffffffff932fbf47 finish_task_switch.isra.0+0x157 ([kernel.kallsyms])
	ffffffff93304891 schedule_tail+0x11 ([kernel.kallsyms])
	ffffffff93204c88 ret_from_fork+0x8 ([kernel.kallsyms])
            7fa9c7a8256d send_vc+0x4bd (/usr/lib/libc6-prof/x86_64-linux-gnu/libc.so.6)
            7fa9c79ec6d5 __strncat_ssse3+0x325 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc.so.6)
            7fa9c79ed2bc __strncat_ssse3+0xf0c (/usr/lib/libc6-prof/x86_64-linux-gnu/libc.so.6)
        fffffffeffffffff [unknown] ([unknown])
            56367bb5ea70 [unknown] ([unknown])
        48f0ffffffff0000 [unknown] ([unknown])

Firefox Profiler’s visualizer already provides functionality to drop stack frames containing certain functions. However it is also possible to filter out these overhead stacks out with some perl:

# Don't print stack traces containing "send_vc" or "perf_event_".
$b=1 if /send_vc|perf_event_/

if ($_) {
    # If the line isn't empty, append it to the current
    # stack trace.
    $s = "$s\n$_"
} else {
    # If this is a newline, decide whether or not to print the
    # stack trace.
    say $s if !$b;
    # Reset for the next block.
    $b = 0;
    $s = ""
}

Or again as a bash one-liner:

perl -lanE 'if($_){$s="$s\n$_"}else{say $s if !$b;$b=0;$s=""};$b=1 if/send_vc|perf_event_/'

Here’s the script, all strung together.

#!/bin/bash
set -Eeuox pipefail
LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu \
  perf record -g -F 299 -- bash -c 'for i in {1..1000}; do ./node -e 0; done'
perf script -F comm,pid,tid,time,period,event,ip,sym,symoff,dso \
  | perl -n -e'$b=0 if/^\w/;$b=1 if s@^node ([0-9]+)/\1@node 1/1@;print if$b' \
  | perl -lanE 'if($_){$s="$s\n$_"}else{say $s if !$b;$b=0;$s=""};$b=1 if/send_vc|perf_event_/' \
  | sed '1{/^$/d}' \
  | gzip -c > stacks.gz

I inserted some manual printf statements to measure time and checked that the results were similar to the CPU times I got from the profiles. This indicated to me that filtering out these overhead stacks was a reasonable decision.

We can generate an interactive SVG flamegraph using FlameGraph:

./flamegraph.pl --inverted --colors js \
  --title 'Node.js startup graph' \
  <(perl stackcollapse-perf.pl <(gunzip -c ~/node/stacks.gz)) \
  > nodejs-startup-flamegraph.svg

I included the FlameGraph inline below for your pleasure (full size version). Feel free to click around and see what you can find in advance of the next post in the series. If you prefer to view this in an external tool like Firefox Profiler or Speedscope, you can download the profile here.