Node.js Startup: V8 changes

Tags: ,

In our previous post, we made a lot of progress on the Node.js side towards speeding up initialization. Now it’s time to dive deeper, and see changes we need to make to V8 itself.

Stop copying builtins

Continuing on the theme of the previous post, I noticed there was still one more unaddressed large memcpy call.

V8 includes “builtin” functions, which power common Javascript features like sorting an array. V8 builtins are included once in the binary and shared across different V8 instances via mmap. For performance reasons, it’s important that the builtin calls are “short jumps” away from the code generated by V8’s JIT: the generated builtin code should be near (in terms of virtual memory offsets) the compiled Javascript code to allow for efficient jumps. However it’s not always possible to ensure that this happens, so V8’s solution is to copy the builtins to be near the generated Javascript code when initializing the V8 isolate.

This copying process is expensive since the builtins are quite large (~1.5 MiB today). Rather than copying the builtins, V8 attempts to mmap the builtins into the correct place. It parses /proc/self/maps to find out where the binary is in memory and where the builtins are located in the binary, and then remaps the binary’s builtins to the correct spot in memory. This means that there are multiple virtual memory addresses in the process pointing to the same physical memory address used by the builtins.

Doing this correctly is a little tricky:

Now we have a file descriptor to the same path the data we want to remap comes from. But… is it the same file? This is not guaranteed (e.g. in case of updates), so to avoid hard-to-track bugs, check that the underlying file is the same using the device number and the inode. Inodes are not unique across filesystems, and can be reused. The check works here though, since we have the problems:

  • Inode uniqueness: check device numbers.
  • Inode reuse: the initial file is still open, since we are running code from it. So its inode cannot have been reused.

However, the code to parse /proc/self/maps was broken in certain cases. This is what it looked like (more context):

MemoryRegion region;
uint8_t dev_major = 0, dev_minor = 0;
uintptr_t inode = 0;
int path_index = 0;
uintptr_t offset = 0;
// The format is:
// address           perms offset  dev   inode   pathname
// 08048000-08056000 r-xp 00000000 03:0c 64593   /usr/sbin/gpm
if (sscanf(line,
           "%" V8PRIxPTR "-%" V8PRIxPTR " %4c %" V8PRIxPTR
           " %hhx:%hhx %" V8PRIdPTR " %n",
           &region.start, &region.end, region.permissions, &offset,
           &dev_major, &dev_minor, &inode, &path_index) < 7) {
  return base::nullopt;
}
// <snip>
region.dev = makedev(dev_major, dev_minor);

dev_major and dev_minor are actually not uint8_t: dev_major can use up to 12 bits and dev_minor can use up to 20 bits. This meant that, depending on what filesystem the executable was on, V8 would fallback to copying the builtins when it was not really necessary. Technically using scanf to read an integer which is too-big is undefined behavior, but in practice implementations will just silently truncate the conversion.

On my local machine, dev_major was small and the mmaping worked fine. But on the Amazon EC2 instance I was benchmarking on, the NVMe SSDs have a major device number of 0x103. That doesn’t fit in a uint8_t, so V8 got confused and fell back to copying the builtins. The fix was straightforward: I changed dev_major and dev_minor to be unsigned int, and therefore speeding up initialization by ~500 microseconds and saving 1.6 MiB of memory.

This is a great example of (1) why you should profile in an environment which closely matches your “production” environment and (2) why implicit integer casting in C/C++ is evil. In a language like Rust, where integer casting is explicit, you would need a cast to correctly type the arguments to makedev (since makedev takes in two unsigned ints). Hopefully the need for a cast would tip off the developer that device numbers may actually be bigger than uint8_ts. (But maybe not! The cast is infallible, which makes it much more likely to be overlooked.)

Stop recompiling

The next thing I noticed from the flamegraph is that we were spending time compiling functions.

This was surprising to me. Last time we learned about the code cache. The code cache should have eagerly precompiled everything, so we shouldn’t be compiling anything.

Single-stepping through the debugger revealed that the cause was incredibly simple: the relevant line in the V8 source code was commented out!

// flags.set_eager(compile_options == ScriptCompiler::kEagerCompile);

Node.js was using the kEagerCompile to ensure that all functions get compiled and put into the code cache, but it was being ignored. The default is that V8 uses some heuristics to guess which functions should be compiled. I uncommented the line, which fixed the issue. Unfortunately this makes the code cache larger, which makes it take longer to deserialize and increases its memory footprint. For the latter issue, it’s a good thing that we’ve removed the extra code cache copies in our previous post!

Stop snapshot checksumming

While profiling I noticed from the flamegraph is that we were spending a decent bit of time running a CRC checksum on the snapshot.

This checksumming was supposed to only happen in debug modes, but it wasn’t because the snapshot mode that Node.js used just didn’t have the flag. I added the flag check in, speeding up startup by another 500 microseconds.

by Keyhan Vakil Posted on 2023-12-07