Node.js Startup: V8 changes
Tags: nodejs-startup, low-level
In our previous post, we made a lot of progress on the Node.js side towards speeding up initialization. Now it’s time to dive deeper, and see changes we need to make to V8 itself.
Stop copying builtins
Continuing on the theme of the previous post, I noticed there was still one more unaddressed large memcpy call.
V8 includes “builtin” functions, which power common Javascript features like sorting an array. V8 builtins are included once in the binary and shared across different V8 instances via mmap. For performance reasons, it’s important that the builtin calls are “short jumps” away from the code generated by V8’s JIT: the generated builtin code should be near (in terms of virtual memory offsets) the compiled Javascript code to allow for efficient jumps. However it’s not always possible to ensure that this happens, so V8’s solution is to copy the builtins to be near the generated Javascript code when initializing the V8 isolate.
This copying process is expensive since the builtins are quite large (~1.5 MiB
today). Rather than copying the builtins, V8 attempts to mmap the builtins into
the correct place. It parses /proc/self/maps
to find out where the binary is
in memory and where the builtins are located in the binary, and then remaps the
binary’s builtins to the correct spot in memory. This means that there are
multiple virtual memory addresses in the process pointing to the same physical
memory address used by the builtins.
Doing this correctly is a little tricky:
Now we have a file descriptor to the same path the data we want to remap comes from. But… is it the same file? This is not guaranteed (e.g. in case of updates), so to avoid hard-to-track bugs, check that the underlying file is the same using the device number and the inode. Inodes are not unique across filesystems, and can be reused. The check works here though, since we have the problems:
- Inode uniqueness: check device numbers.
- Inode reuse: the initial file is still open, since we are running code from it. So its inode cannot have been reused.
However, the code to parse /proc/self/maps
was broken in certain cases.
This is what it looked like (more context):
;
MemoryRegion regionuint8_t dev_major = 0, dev_minor = 0;
uintptr_t inode = 0;
int path_index = 0;
uintptr_t offset = 0;
// The format is:
// address perms offset dev inode pathname
// 08048000-08056000 r-xp 00000000 03:0c 64593 /usr/sbin/gpm
if (sscanf(line,
"%" V8PRIxPTR "-%" V8PRIxPTR " %4c %" V8PRIxPTR
" %hhx:%hhx %" V8PRIdPTR " %n",
®ion.start, ®ion.end, region.permissions, &offset,
&dev_major, &dev_minor, &inode, &path_index) < 7) {
return base::nullopt;
}
// <snip>
.dev = makedev(dev_major, dev_minor); region
dev_major
and dev_minor
are actually not uint8_t
: dev_major
can use up
to 12 bits and dev_minor
can use up to 20 bits. This meant that, depending on
what filesystem the executable was on, V8 would fallback to copying the
builtins when it was not really necessary. Technically using scanf to read an
integer which is too-big is undefined behavior, but in practice implementations
will just silently truncate the conversion.
On my local machine, dev_major
was small and the mmaping worked fine. But on
the Amazon EC2 instance I was benchmarking on, the NVMe SSDs have a major
device number of 0x103. That doesn’t fit in a uint8_t
, so V8 got confused and
fell back to copying the builtins. The fix was straightforward: I
changed dev_major
and dev_minor
to be unsigned int
, and therefore
speeding up initialization by ~500 microseconds and saving 1.6 MiB of memory.
This is a great example of (1) why you should profile in an environment which
closely matches your “production” environment and (2) why implicit integer
casting in C/C++ is evil. In a language like Rust, where integer casting is
explicit, you would need a cast to correctly type the arguments to makedev
(since makedev
takes in two unsigned ints). Hopefully the need for a cast
would tip off the developer that device numbers may actually be bigger than
uint8_t
s. (But maybe not! The cast is infallible, which makes it much more
likely to be overlooked.)
Stop recompiling
The next thing I noticed from the flamegraph is that we were spending time compiling functions.
This was surprising to me. Last time we learned about the code cache. The code cache should have eagerly precompiled everything, so we shouldn’t be compiling anything.
Single-stepping through the debugger revealed that the cause was incredibly simple: the relevant line in the V8 source code was commented out!
// flags.set_eager(compile_options == ScriptCompiler::kEagerCompile);
Node.js was using the kEagerCompile to ensure that all functions get compiled and put into the code cache, but it was being ignored. The default is that V8 uses some heuristics to guess which functions should be compiled. I uncommented the line, which fixed the issue. Unfortunately this makes the code cache larger, which makes it take longer to deserialize and increases its memory footprint. For the latter issue, it’s a good thing that we’ve removed the extra code cache copies in our previous post!
Stop snapshot checksumming
While profiling I noticed from the flamegraph is that we were spending a decent bit of time running a CRC checksum on the snapshot.
This checksumming was supposed to only happen in debug modes, but it wasn’t because the snapshot mode that Node.js used just didn’t have the flag. I added the flag check in, speeding up startup by another 500 microseconds.