Greets! Today's article looks at browser WebAssembly implementations from a compiler throughput point of view. As I wrote in my article on Firefox's WebAssembly baseline compiler, web browsers have multiple wasm compilers: some that produce code fast, and some that produce fast code. Implementors are willing to pay the cost of having multiple compilers in order to satisfy these conflicting needs. So how well do they do their jobs? Why bother?
I start by benchmarking the various WebAssembly implementations, firstly in their standard configuration and then setting special run-time flags to measure the performance of the component compilers. I run these tests on the core-rich machine that I use for browser development (2 Xeon Silver 4114 CPUs for a total of 40 logical cores). The default-configuration numbers are therefore not indicative of performance on a low-end Android phone, but we can use them to extract aspects of the different implementations.
Since I'm interested in compiler throughput, I'm not particularly concerned about how well a compiler will use all 40 cores. Therefore when testing the specific compilers I will set implementation-specific flags to disable parallelism in the compiler and GC: --single-threaded on V8, --no-threads on SpiderMonkey, and --useConcurrentGC=false --useConcurrentJIT=false on JSC. To further restrict any threads that the implementation might decide to spawn, I'll bind these to a single core on my machine using taskset -c 4. Otherwise the machine is in its normal configuration (nothing else significant running, all cores available for scheduling, turbo boost enabled).
I'll express results in nanoseconds per WebAssembly code byte. Of the 40 megabytes or so in the Zen Garden demo, only 23 891 164 bytes are actually function code; the rest is mostly static data (textures and so on). So I'll divide the total time by this code byte count.
results & analysis
We can see some interesting results in this graph. Note that the Y axis is logarithmic. The "concurrent tiering" results in the graph correspond to the default configurations (no special flags, no taskset, all cores available).
The part of JSC that generates baseline interpreter code runs slower than SpiderMonkey's baseline compiler, so one is tempted to wonder why JSC bothers to go the interpreter route; but then we recall that on iOS, we can't generate machine code in some contexts, so the LLInt does appear to address a need.
One interesting feature of the LLInt is that it allows tier-up to the optimizing compiler directly from loops, which neither V8 nor SpiderMonkey support currently. Failure to tier up can be quite confusing for users, so good on JSC hackers for implementing this.
V8 and SpiderMonkey, on the other hand, are much of the same shape. Both implement a streaming baseline compiler and an optimizing compiler; for V8, we get these via --liftoff --no-wasm-tier-up or --no-liftoff, respectively, and for SpiderMonkey it's --wasm-compiler=baseline or --wasm-compiler=ion.
Finally, it would seem that the current cross-implementation benchmark for lowest-tier code generation throughput on a desktop machine would then be around 50 ns per WebAssembly code byte for a single core, which corresponds to receiving code over the wire at somewhere around 160 megabits per second (Mbps). If we add in concurrency and manage to farm out compilation tasks well, we can obviously double or triple that bitrate. Optimizing compilers run at least an order of magnitude slower. We can conclude that to the desktop end user, WebAssembly compilation time is indistinguishable from download time for the lowest tier. The optimizing tier is noticeably slower though, running more around 10-15 Mbps per core, so time-to-tier-up is still a concern for faster networks.
Going back to the question posed at the start of the article: yes, tiering shows a clear benefit in terms of WebAssembly compilation latency, letting users interact with web sites sooner. So that's that. Happy hacking and until next time!