wingolog

understanding webassembly code generation throughput

2020-04-14T08:59:17Z

Greets! Today's article looks at browser WebAssembly implementations from a compiler throughput point of view. As I wrote in my article on Firefox's WebAssembly baseline compiler, web browsers have multiple wasm compilers: some that produce code fast, and some that produce fast code. Implementors are willing to pay the cost of having multiple compilers in order to satisfy these conflicting needs. So how well do they do their jobs? Why bother?

In this article, I'm going to take the simple path and just look at code generation throughput on a single chosen WebAssembly module. Think of it as X-ray diffraction to expose aspects of the inner structure of the WebAssembly implementations in SpiderMonkey (Firefox), V8 (Chrome), and JavaScriptCore (Safari).

experimental setup

As a workload, I am going to use a version of the "Zen Garden" demo. This is a 40-megabyte game engine and rendering demo, originally released for other platforms, and compiled to WebAssembly a couple years later. Unfortunately the original URL for the demo was disabled at some point in late 2019, so it no longer has a home on the web. A bit of a weird situation and I am not clear on licensing either. In any case I have a version downloaded, and have hacked out a minimal set of "imports" that the WebAssembly module needs from the host to allow the module to compile and link when run from a JavaScript shell, without requiring WebGL and similar facilities. So the benchmark is just to instantiate a WebAssembly module from the 40-megabyte byte array and see how long it takes. It would be better if I had more test cases (and would be happy to add them to the comparison!) but this is a start.

I start by benchmarking the various WebAssembly implementations, firstly in their standard configuration and then setting special run-time flags to measure the performance of the component compilers. I run these tests on the core-rich machine that I use for browser development (2 Xeon Silver 4114 CPUs for a total of 40 logical cores). The default-configuration numbers are therefore not indicative of performance on a low-end Android phone, but we can use them to extract aspects of the different implementations.

Since I'm interested in compiler throughput, I'm not particularly concerned about how well a compiler will use all 40 cores. Therefore when testing the specific compilers I will set implementation-specific flags to disable parallelism in the compiler and GC: --single-threaded on V8, --no-threads on SpiderMonkey, and --useConcurrentGC=false --useConcurrentJIT=false on JSC. To further restrict any threads that the implementation might decide to spawn, I'll bind these to a single core on my machine using taskset -c 4. Otherwise the machine is in its normal configuration (nothing else significant running, all cores available for scheduling, turbo boost enabled).

I'll express results in nanoseconds per WebAssembly code byte. Of the 40 megabytes or so in the Zen Garden demo, only 23 891 164 bytes are actually function code; the rest is mostly static data (textures and so on). So I'll divide the total time by this code byte count.

I tested V8 at git revision 0961376575206, SpiderMonkey at hg revision 8ec2329bef74, and JavaScriptCore at subversion revision 259633. The benchmarks can be run using just a shell; see the pull request. I timed how long it took to instantiate the Zen Garden demo, ensuring that a basic export was callable. I collected results from 20 separate runs, sleeping a second between them. The bars in the charts below show the median times, with a histogram overlay of all results.

results & analysis

We can see some interesting results in this graph. Note that the Y axis is logarithmic. The "concurrent tiering" results in the graph correspond to the default configurations (no special flags, no taskset, all cores available).

The first interesting conclusions that pop out for me concern JavaScriptCore, which is the only implementation to have a baseline interpreter (run using --useWasmLLInt=true --useBBQJIT=false --useOMGJIT=false). JSC's WebAssembly interpreter is actually structured as a compiler that generates custom WebAssembly-specific bytecode, which is then run by a custom interpreter built using the same infrastructure as JSC's JavaScript interpreter (the LLInt). Directly interpreting WebAssembly might be possible as a low-latency implementation technique, but since you need to validate the WebAssembly anyway and eventually tier up to an optimizing compiler, apparently it made sense to emit fresh bytecode.

The part of JSC that generates baseline interpreter code runs slower than SpiderMonkey's baseline compiler, so one is tempted to wonder why JSC bothers to go the interpreter route; but then we recall that on iOS, we can't generate machine code in some contexts, so the LLInt does appear to address a need.

One interesting feature of the LLInt is that it allows tier-up to the optimizing compiler directly from loops, which neither V8 nor SpiderMonkey support currently. Failure to tier up can be quite confusing for users, so good on JSC hackers for implementing this.

Finally, while baseline interpreter code generation throughput handily beats V8's baseline compiler, it would seem that something in JavaScriptCore is not adequately taking advantage of multiple cores; if one core compiles at 51ns/byte, why do 40 cores only do 41ns/byte? It could be my tests are misconfigured, or it could be that there's a nice speed boost to be found somewhere in JSC.

JavaScriptCore's baseline compiler (run using --useWasmLLInt=false --useBBQJIT=true --useOMGJIT=false) runs much more slowly than SpiderMonkey's or V8's baseline compiler, which I think can be attributed to the fact that it builds a graph of basic blocks instead of doing a one-pass compile. To me these results validate SpiderMonkey's and V8's choices, looking strictly from a latency perspective.

I don't have graphs for code generation throughput of JavaSCriptCore's optimizing compiler (run using --useWasmLLInt=false --useBBQJIT=false --useOMGJIT=true); it turns out that JSC wants one of the lower tiers to be present, and will only tier up from the LLInt or from BBQ. Oh well!

V8 and SpiderMonkey, on the other hand, are much of the same shape. Both implement a streaming baseline compiler and an optimizing compiler; for V8, we get these via --liftoff --no-wasm-tier-up or --no-liftoff, respectively, and for SpiderMonkey it's --wasm-compiler=baseline or --wasm-compiler=ion.

Here we should conclude directly that SpiderMonkey generates code around twice as fast as V8 does, in both tiers. SpiderMonkey can generate machine code faster even than JavaScriptCore can generate bytecode, and optimized machine code faster than JSC can make baseline machine code. It's a very impressive result!

Another conclusion concerns the efficacy of tiering: for both V8 and SpiderMonkey, their baseline compilers run more than 10 times as fast as the optimizing compiler, and the same ratio holds between JavaScriptCore's baseline interpreter and compiler.

Finally, it would seem that the current cross-implementation benchmark for lowest-tier code generation throughput on a desktop machine would then be around 50 ns per WebAssembly code byte for a single core, which corresponds to receiving code over the wire at somewhere around 160 megabits per second (Mbps). If we add in concurrency and manage to farm out compilation tasks well, we can obviously double or triple that bitrate. Optimizing compilers run at least an order of magnitude slower. We can conclude that to the desktop end user, WebAssembly compilation time is indistinguishable from download time for the lowest tier. The optimizing tier is noticeably slower though, running more around 10-15 Mbps per core, so time-to-tier-up is still a concern for faster networks.

Going back to the question posed at the start of the article: yes, tiering shows a clear benefit in terms of WebAssembly compilation latency, letting users interact with web sites sooner. So that's that. Happy hacking and until next time!

multi-value webassembly in firefox: a binary interface

2020-04-08T09:02:18Z

Hey hey hey! Hope everyone is staying safe at home in these weird times. Today I have a final dispatch on the implementation of the multi-value feature for WebAssembly in Firefox. Last week I wrote about multi-value in blocks; this week I cover function calls.

on the boundaries between things

In my article on Firefox's baseline compiler, I mentioned that all WebAssembly engines in web browsers treat the function as the unit of compilation. This facilitates streaming, parallel compilation of WebAssembly modules, by farming out compilation of individual functions to worker threads. It also allows for easy tier-up from quick-and-dirty code generated by the low-latency baseline compiler to the faster code produced by the optimizing compiler.

There are some interesting Conway's Law implications of this choice. One is that division of compilation tasks becomes an opportunity for division of human labor; there is a whole team working on the experimental Cranelift compiler that could replace the optimizing tier, and in my hackings on Firefox I have had minimal interaction with them. To my detriment, of course; they are fine people doing interesting things. But the code boundary means that we don't need to communicate as we work on different parts of the same system.

Boundaries are where places touch, and sometimes for fluid crossing we have to consider boundaries as places in their own right. Functions compiled with the baseline compiler, with Ion (the production optimizing compiler), and with Cranelift (the experimental optimizing compiler) are all able to call each other because they actively maintain a common boundary, a binary interface (ABI). (Incidentally the A originally stands for "application", essentially reflecting division of labor between groups of people making different components of a software system; Conway's Law again.) Let's look closer at this boundary-place, with an eye to how it changes with multi-value.

what's in an ABI?

Among other things, an ABI specifies a calling convention: which arguments go in registers, which on the stack, how the stack values are represented, how results are returned to the callers, which registers are preserved over calls, and so on. Intra-WebAssembly calls are a closed world, so we can design a custom ABI if we like; that's what V8 does. Sometimes WebAssembly may call functions from the run-time, though, and so it may be useful to be closer to the C++ ABI on that platform (the "native" ABI); that's what Firefox does. (Incidentally here I think Firefox is probably leaving a bit of performance on the table on Windows by using the inefficient native ABI that only allows four register parameters. I haven't measured though so perhaps it doesn't matter.) Using something closer to the native ABI makes debugging easier as well, as native debugger tools can apply more easily.

One thing that most native ABIs have in common is that they are really only optimized for a single result. This reflects their heritage as artifacts from a world built with C and C++ compilers, where there isn't a concept of a function with more than one result. If multiple results are required, they are represented instead as arguments, typically as pointers to memory somewhere. Consider the AMD64 SysV ABI, used on Unix-derived systems, which carefully specifies how to pass arbitrary numbers of arbitrary-sized data structures to a function (§3.2.3), while only specifying what to do for a single return value. If the return value is too big for registers, the ABI specifies that a pointer to result memory be passed as an argument instead.

So in a multi-result WebAssembly world, what are we to do? How should a function return multiple results to its caller? Let's assume that there are some finite number of general-purpose and floating-point registers devoted to return values, and that if the return values will fit into those registers, then that's where they go. The problem is then to determine which results will go there, and if there are remaining results that don't fit, then we have to put them in memory. The ABI should indicate how to address that memory.

When looking into a design, I considered three possibilities.

first thought: stack results precede stack arguments

When a function needs some of its arguments passed on the stack, it doesn't receive a pointer to those arguments; rather, the arguments are placed at a well-known offset to the stack pointer.

We could do the same thing with stack results, either reserving space deeper on the stack than stack arguments, or closer to the stack pointer. With the advent of tail calls, it would make more sense to place them deeper on the stack. Like this:

The diagram above shows the ordering of stack arguments as implemented by Firefox's WebAssembly compilers: later arguments are deeper (farther from the stack pointer). It's an arbitrary choice that happens to match up with what the native ABIs do, as it was easier to re-use bits of the already-existing optimizing compiler that way. (Native ABIs use this stack argument ordering because of sloppiness in a version of C from before I was born. If you were starting over from scratch, probably you wouldn't do things this way.)

Stack result order does matter to the baseline compiler, though. It's easier if the stack results are placed in the same order in which they would be pushed on the virtual stack, so that when the function completes, the results can just be memmove'd down into place (if needed). The same concern dictates another aspect of our ABI: unlike calls, registers are allocated to the last results rather than the first results. This is to make it easy to preserve stack invariant (1) from the previous article.

At first I thought this was the obvious option, but I ran into problems. It turns out that stack arguments are fundamentally unlike stack results in some important ways.

While a stack argument is logically consumed by a call, a stack result starts life with a call. As such, if you reserve space for stack results just by decrementing the stack pointer before a call, probably you will need to load the results eagerly into registers thereafter or shuffle them into other positions to be able to free the allocated stack space.

Eager shuffling is busy-work that should be avoided if possible. It's hard to avoid in the baseline compiler. For example, a call to a function with 10 arguments will consume 10 values from the temporary stack; any results will be pushed on after removing argument values from the stack. If there any stack results, it's almost impossible to avoid a post-call memmove, to move stack results to where they should be before the 10 argument values were pushed on (and probably spilled). So the baseline compiler case is not optimal.

However, things get gnarlier with the Ion optimizing compiler. Like many other optimizing compilers, Ion is designed to compute the necessary stack frame size ahead of time, and to never move the stack pointer during an activation. The only exception is for pushing on any needed stack arguments for nested calls (which are popped directly after the nested call). So in that case, assuming there are a number of multi-value calls in a stack frame, we'll be shuffling in the optimizing compiler as well. Not great.

Besides the need to shuffle, stack arguments and stack results differ as regards ownership and garbage collection. A callee "owns" the memory for its stack arguments; it is responsible for them. The caller can't assume anything about the contents of that memory after a call, especially if the WebAssembly implementation supports tail calls (a whole 'nother blog post, that). If the values being passed are just bits, that's one thing, but with the reference types proposal, some result values may be managed by the garbage collector. The callee is responsible for making stack arguments visible to the garbage collector; the caller is responsible for the results. The caller will need to emit metadata to allow the garbage collector to see stack result references. For this reason, a stack result actually starts life just before a call, because it can become initialized at any point and thus needs to be traced during the entire callee activation. Not all callers can easily add garbage collection roots for writable stack slots, so the need to place stack results in a fixed position complicates calling multi-value WebAssembly functions in some cases (e.g. from C++).

second thought: pointers to individual stack results

Surely there are more well-trodden solutions to the multiple-result problem. If we encoded a multi-value return in C, how would we do it? Consider a function in C that has three 64-bit integer results. The idiomatic way to encode it would be to have one of the results be the return value of the function, and the two others to be passed "by reference":

int64_t foo(int64_t* a, int64_t* b) {
  *a = 1;
  *b = 2;
  return 3;
}
void call_foo(void) {
  int64 a, b, c;
  c = foo(&a, &b);
}

This program shows us a possibility for encoding WebAssembly's multiple return values: pass an additional argument for each stack result, pointing to the location to which to write the stack result. Like this:

The result pointers are normal arguments, subject to normal argument allocation. In the above example, given that there are already stack arguments, they will probably be passed on the stack, but in many cases the stack result pointers may be passed in registers.

The result locations themselves don't even need to be on the stack, though they certainly will be in intra-WebAssembly calls. However the ability to write to any memory is a useful form of flexibility when e.g. calling into WebAssembly from C++.

The advantage of this approach is that we eliminate post-call shuffles, at least in optimizing compilers. But, having to make an argument for each stack result, each of which might itself become a stack argument, seems a bit offensive. I thought we might be able to do a little better.

third thought: stack result area, passed as pointer

Given that stack results are going to be written to memory, it doesn't really matter where they will be written, from the perspective of the optimizing compiler at least. What if we allocated them all in a block and just passed one pointer to the block? Like this:

Here there's just one additional argument, no matter how many stack results. While we're at it, we can specify that the layout of the stack arguments should be the same as how they would be written to the baseline stack, to make the baseline compiler's job easier.

As I started implementation with the baseline compiler, I chose this third approach, essentially because I was already allocating space for the results in a block in this way by bumping the stack pointer.

When I got to the optimizing compiler, however, it was quite difficult to convince Ion to allocate an area on the stack of the right shape.

Looking back on it now, I am not sure that I made the right choice. The thing is, the IonMonkey compiler started life as an optimizing compiler for JavaScript. It can represent unboxed values, which is how it came to be used as a compiler for asm.js and later WebAssembly, and it does a good job on them. However it has never had to represent aggregate data structures like a C++ class, so it didn't have support for spilling arbitrary-sized data to the stack. It took a while staring at the register allocator to convince it to allocate arbitrary-sized stack regions, and then to allocate component scalar values out of those regions. If I had just asked the register allocator to give me one appropriate-sized stack slot for each scalar, and hacked out the ability to pass separate pointers to the stack slots to WebAssembly calls with stack results, then I would have had an easier time of it, and perhaps stack slot allocation could be more dense because multiple results wouldn't need to be allocated contiguously.

As it is, I did manage to hack it in, and I think in a way that doesn't regress. I added a layer over an argument type vector that adds a synthetic stack results pointer argument, if the function returns stack results; iterating over this type with ABIArgIter will allocate a stack result area pointer, either as a register argument or a stack argument. In the optimizing compiler, I added add a kind of value allocation coresponding to a variable-sized stack area, (using pointer tagging again!), and extended the register allocator to allocate LStackArea, and the component stack results. Interestingly, I had to add a kind of definition that starts life on the stack; previously all Ion results started life in registers and were only spilled if needed.

In the end, a function will capture the incoming stack result area argument, either as a normal SSA value (for Ion) or stored to a stack slot (baseline), and when returning will write stack results to that pointer as appropriate. Passing in a pointer as an argument did make it relatively easy to implement calls from WebAssembly to and from C++, getting the variable-shape result area to be known to the garbage collector for C++-to-WebAssembly calls was simple in the end but took me a while to figure out.

Finally I was a bit exhausted from multi-value work and ready to walk away from the "JS API", the bit that allows multi-value WebAssembly functions to be called from JavaScript (they return an array) or for a JavaScript function to return multiple values to WebAssembly (via an iterable) -- but then when I got to thinking about this blog post I preferred to implement the feature rather than document its lack. Avoidance-of-document-driven development: it's a thing!

towards deployment

As I said in the last article, the multi-value feature is about improved code generation and also making a more capable base for expressing further developments in the WebAssembly language.

As far as code generation goes, things are progressing but it is still early days. Thomas Lively has implemented support in LLVM for emitting return of C++ aggregates via multiple results, which is enabled via the -experimental-multivalue-abi cc1 flag. Thomas has also been implementing multi-value support in the binaryen WebAssembly toolchain component, used by the emscripten C++-to-WebAssembly toolchain. I think it will be a few months though before everything lands in a way that end users can take advantage of.

On the specification side, the multi-value feature is now at phase 4 since January, which basically means things are all done there.

Implementation-wise, V8 has had experimental support since 2017 or so, and the feature was staged last fall, although V8 doesn't yet support multi-value in their baseline compiler. WebKit also landed support last fall.

Unlike V8 and SpiderMonkey, JavaScriptCore (the JS and wasm engine in WebKit) actually implements a WebAssembly interpreter as their solution to the one-pass streaming compilation problem. Then on the compiler side, there are two tiers that both operate on basic block graphs (OMG and BBQ; I just puked a little in my mouth typing that). This strategy makes the compiler implementation quite straightforward. It's also an interesting design point because JavaScriptCore's garbage collector scans the stack conservatively; there's no need for the compiler to do bookkeeping on the GC's behalf, which I'm sure was a relief to the hacker. Anyway, multi-value in WebKit is done too.

The new thing of course is that finally, in Firefox, the feature is now fully implemented (woo) and enabled by default on Nightly builds (woo!). I did that! It took me a while! Perhaps too long? Anyway it's done. Thanks again to Bloomberg for supporting this work; large ups to y'all for helping the web move forward.

See you next time with a more general article rounding up compile-time benchmarks on a variety of WebAssembly implementations. Until then, happy hacking!

multi-value webassembly in firefox: from 1 to n

2020-04-03T10:56:29Z

Greetings, hackers! Today I'd like to write about something I worked on recently: implementation of the multi-value future feature of WebAssembly in Firefox, as sponsored by Bloomberg.

In the "minimum viable product" version of WebAssembly published in 2018, there were a few artificial restrictions placed on the language. Functions could only return a single value; if a function would naturally return two values, it would have to return at least one of them by writing to memory. Loops couldn't take parameters; any loop state variables had to be stored to and loaded from indexed local variables at each iteration. Similarly, any block that would naturally return more than one result would also have to do so via locals.

This restruction is lifted with the multi-value proposal. Function types now map from result type to result type, where a result type is a sequence of value types. That is to say, just as functions can take multiple arguments, they can return multiple results. Similarly, with the multi-value proposal, block types are now the same as function types: loops and blocks can take arguments and return any number of results. This change improves the expressiveness of WebAssembly as a compilation target; a C++ program compiled to multi-value WebAssembly can be encoded in fewer bytes than before. Multi-value also establishes a base for other language extensions. For example, the exception handling proposal builds on multi-value to pass multiple values to catch blocks.

So, that's multi-value. You would think that relaxing a restriction would be easy, but you'd be wrong! This task took me 5 months and had a number of interesting gnarly bits. This article is part one of two about interesting aspects of implementing multi-value in Firefox, specifically focussing on blocks. We'll talk about multi-value function calls next week.

multi-value in blocks

In the last article, I presented the basic structure of Firefox's WebAssembly support: there is a baseline compiler optimized for low latency and an optimizing compiler optimized for throughput. (There is also Cranelift, a new experimental compiler that may replace the current implementation of the optimizing compiler; but that doesn't affect the basic structure.)

The optimizing compiler applies traditional compiler techniques: SSA graph construction, where values flow into and out of graphs using the usual defs-dominate-uses relationship. The only control-flow joins are loop entry and (possibly) block exit, so the addition of loop parameters means in multi-value there are some new phi variables in that case, and the expansion of block result count from [0,1] to [0,n] means that you may have more block exit phi variables. But these compilers are built to handle these situations; you just build the SSA and let the optimizing compiler go to town.

The problem comes in the baseline compiler.

from 1 to n

Recall that the baseline compiler is optimized for compiler speed, not compiled speed. If there are only ever going to be 0 or 1 result from a block, for example, the baseline compiler's internal data structures will use something like a Maybe to represent that block result.

If you then need to expand this to hold a vector of values, the naïve approach of using a Vector would mean heap allocation and indirection, and thus would regress the baseline compiler.

In this case, and in many other similar cases, the solution is to use value tagging to represent 0 or 1 value type directly in a word, and the general case by linking out to an external vector. As block types are function types, they actually appear as function types in the WebAssembly type section, so they are already parsed; the BlockType in that case can just refer out to already-allocated memory.

In fact this value-tagging pattern applies all over the place. (The jit/ links above are for the optimizing compiler, but they relate to function calls; will write about that next week.) I have a bit of pause about value tagging, in that it's gnarly complexity and I didn't measure the speed of alternative implementations, but it was a useful migration strategy: value tagging minimizes performance risk to existing specialized use cases while adding support for new general cases. Gnarly it is, then.

control-flow joins

I didn't mention it in the last article, but there are two important invariants regarding stack discipline in the baseline compiler. Recall that there's a virtual stack, and that some elements of the virtual stack might be present on the machine stack. There are four kinds of virtual stack entry: register, constant, local, and spilled. Locals indicate local variable reads and are mostly like registers in practice; when registers spill to the stack, locals do too. (Why spill to the temporary stack instead of leaving the value in the local variable slot? Because locals are mutable. A local.get captures a local variable value at its point of execution. If future code changes the local variable value, you wouldn't want the captured value to change.)

Digressing, the stack invariants:

Spilled values precede registers and locals on the virtual stack. If u and v are virtual stack entries and u is older than v, then if u is in a register or is a local, then v is not spilled.
Older values precede newer values on the machine stack. Again for u and v, if they are both spilled, then u will be farther from the stack pointer than v.

There are five fundamental stack operations in the baseline compiler; let's examine them to see how the invariants are guaranteed. Recall that before multi-value, targets of non-local exits (e.g. of the br instruction) could only receive 0 or 1 value; if there is a value, it's passed in a well-known register (e.g. %rax or %xmm0). (On 32-bit machines, 64-bit values use a well-known pair of registers.)

push(v): Results of WebAssembly operations never push spilled values, neither onto the virtual nor the machine stack. v is either a register, a constant, or a reference to a local. Thus we guarantee both (1) and (2).
pop() -> v: Doesn't affect older stack entries, so (1) is preserved. If the newest stack entry is spilled, you know that it is closest to the stack pointer, so you can pop it by first loading it to a register and then incrementing the stack pointer; this preserves (2). Therefore if it is later pushed on the stack again, it will not be as a spilled value, preserving (1).
spill(): When spilling the virtual stack to the machine stack, you first traverse stack entries from new to old to see how far you need to spill. Once you get to a virtual stack entry that's already on the stack, you know that everything older has already been spilled, because of (1), so you switch to iterating back towards the new end of the stack, pushing registers and locals onto the machine stack and updating their virtual stack entries to be spilled along the way. This iteration order preserves (2). Note that because known constants never need to be on the machine stack, they can be interspersed with any other value on the virtual stack.
return(height, v): This is the stack operation corresponding to a block exit (local or nonlocal). We drop items from the virtual and machine stack until the stack height is height. In WebAssembly 1.0, if the target continuation takes a value, then the jump passes a value also; in that case, before popping the stack, v is placed in a well-known register appropriate to the value type. Note however that v is not pushed on the virtual stack at the return point. Popping the virtual stack preserves (1), because a stack and its prefix have the same invariants; popping the machine stack also preserves (2).
capture(t): Whereas return operations happen at block exits, capture operations happen at the target of block exits (the continuation). If no value is passed to the continuation, a capture is a no-op. If a value is passed, it's in a register, so we just push that register onto the virtual stack. Both invariants are obviously preserved.

Note that a value passed to a continuation via return() has a brief instant in which it has no name -- it's not on the virtual stack -- but only a location -- it's in a well-known place. capture() then gives that floating value a name.

Relatedly, there is another invariant, that the allocation of old values on block entry is the same as their allocation on block exit, so that all predecessors of the block exit flow all values via the same places. This is preserved by spilling on block entry. It's a big hammer, but effective.

So, given all this, how do we pass multiple values via return()? We don't have unlimited registers, so the %rax strategy isn't going to work.

The answer for the baseline compiler is informed by our lean into the stack machine principle. Multi-value returns are allocated in such a way that a capture() can push them onto the virtual stack. Because spilled values must precede registers, we therefore allocate older results on the stack, and put the last result in a register (or register pair for i64 on 32-bit platforms). Note that it's possible in theory to allocate multiple results to registers; we'll touch on this next week.

Therefore the implementation of return(height, v₁..v_n) is straightforward: we first pop register results, then spill the remaining virtual stack items, then shuffle stack results down towards height. This should result in a memmove of contiguous stack results towards the frame pointer. However because const values aren't present on the machine stack, depending on the stack height difference, it may mean a split between moving some values toward the frame pointer and some towards the stack pointer, then filling in by spilling constants. It's gnarly, but it is what it is. Note that the links to the return and capture implementations above are to the post-multi-value world, so you can see all the details there.

that's it!

In summary, the hard part of multi-value blocks was reworking internal compiler data structures to be able to represent multi-value block types, and then figuring out the low-level stack manipulations in the baseline compiler. The optimizing compiler on the other hand was pretty easy.

When it comes to calls though, that's another story. We'll get to that one next week. Thanks again to Bloomberg for supporting this work; I'm really delighted that Igalia and Bloomberg have been working together for a long time (coming on 10 years now!) to push the web platform forward. A special thanks also to Mozilla's Lars Hansen for his patience reviewing these patches. Until next week, then, stay at home & happy hacking!

firefox's low-latency webassembly compiler

2020-03-25T16:29:56Z

Good day!

Today I'd like to write a bit about the WebAssembly baseline compiler in Firefox.

background: throughput and latency

WebAssembly, as you know, is a virtual machine that is present in web browsers like Firefox. An important initial goal for WebAssembly was to be a good target for compiling programs written in C or C++. You can visit a web page that includes a program written in C++ and compiled to WebAssembly, and that WebAssembly module will be downloaded onto your computer and run by the web browser.

A good virtual machine for C and C++ has to be fast. The throughput of a program compiled to WebAssembly (the amount of work it can get done per unit time) should be approximately the same as its throughput when compiled to "native" code (x86-64, ARMv7, etc.). WebAssembly meets this goal by defining an instruction set that consists of similar operations to those directly supported by CPUs; WebAssembly implementations use optimizing compilers to translate this portable instruction set into native code.

There is another dimension of fast, though: not just work per unit time, but also time until first work is produced. If you want to go play Doom 3 on the web, you care about frames per second but also time to first frame. Therefore, WebAssembly was designed not just for high throughput but also for low latency. This focus on low-latency compilation expresses itself in two ways: binary size and binary layout.

On the size front, WebAssembly is optimized to encode small files, reducing download time. One way in which this happens is to use a variable-length encoding anywhere an instruction needs to specify an integer. In the usual case where, for example, there are fewer than 128 local variables, this means that a local.get instruction can refer to a local variable using just one byte. Another strategy is that WebAssembly programs target a stack machine, reducing the need for the instruction stream to explicitly load operands or store results. Note that size optimization only goes so far: it's assumed that the bytes of the encoded module will be compressed by gzip or some other algorithm, so sub-byte entropy coding is out of scope.

On the layout side, the WebAssembly binary encoding is sorted by design: definitions come before uses. For example, there is a section of type definitions that occurs early in a WebAssembly module. Any use of a declared type can only come after the definition. In the case of functions which are of course mutually recursive, function type declarations come before the actual definitions. In theory this allows web browsers to take a one-pass, streaming approach to compilation, starting to compile as functions arrive and before download is complete.

implementation strategies

The goals of high throughput and low latency conflict with each other. To get best throughput, a compiler needs to spend time on code motion, register allocation, and instruction selection; to get low latency, that's exactly what a compiler should not do. Web browsers therefore take a two-pronged approach: they have a compiler optimized for throughput, and a compiler optimized for latency. As a WebAssembly file is being downloaded, it is first compiled by the quick-and-dirty low-latency compiler, with the goal of producing machine code as soon as possible. After that "baseline" compiler has run, the "optimizing" compiler works in the background to produce high-throughput code. The optimizing compiler can take more time because it runs on a separate thread. When the optimizing compiler is done, it replaces the baseline code. (The actual heuristics about whether to do baseline + optimizing ("tiering") or just to go straight to the optimizing compiler are a bit hairy, but this is a summary.)

This article is about the WebAssembly baseline compiler in Firefox. It's a surprising bit of code and I learned a few things from it.

design questions

Knowing what you know about the goals and design of WebAssembly, how would you implement a low-latency compiler?

It's a question worth thinking about so I will give you a bit of space in which to do so.

After spending a lot of time in Firefox's WebAssembly baseline compiler, I have extracted the following principles:

The function is the unit of compilation
One pass, and one pass only
Lean into the stack machine
No noodling!

In the remainder of this article we'll look into these individual points. Note, although I have done a good bit of hacking on this compiler, its design and original implementation comes mainly from Mozilla hacker Lars Hansen, who also currently maintains it. All errors of exegesis are mine, of course!

the function is the unit of compilation

As we mentioned, in the binary encoding of a WebAssembly module, all definitions needed by any function come before all function definitions. This naturally leads to a partition between two phases of bytestream parsing: an initial serial phase that collects the set of global type definitions, annotations as to which functions are imported and exported, and so on, and a subsequent phase that compiles individual functions in an essentially independent manner.

The advantage of this approach is that compiling functions is a natural task unit of parallelism. If the user has a machine with 8 virtual cores, the web browser can keep one or two cores for the browser itself and farm out WebAssembly compilation tasks to the rest. The result is that the compiled code is available sooner.

Taking functions to be the unit of compilation also allows for an easy "tier-up" mechanism: after the baseline compiler is done, the optimizing compiler can take more time to produce better code, and when it is done, it can swap out the results on a per-function level. All function calls from the baseline compiler go through a jump table indirection, to allow for tier-up. In SpiderMonkey there is no mechanism currently to tier down; if you need to debug WebAssembly code, you need to refresh the page, causing the wasm code to be compiled in debugging mode. For the record, SpiderMonkey can only tier up at function calls (it doesn't do OSR).

This simple approach does have some down-sides, in that it leaves intraprocedural optimizations on the table (inlining, contification, custom calling conventions, speculative optimizations). This is mitigated in two ways, the most obvious being that LLVM or whatever produced the WebAssembly has ideally already done whatever inlining might be fruitful. The second is that WebAssembly is designed for predictable performance. In JavaScript, an implementation needs to do run-time type feedback and speculative optimizations to get good performance, but the result is that it can be hard to understand why a program is fast or slow. The designers and implementers of WebAssembly in browsers all had first-hand experience with JavaScript virtual machines, and actively wanted to avoid unpredictable performance in WebAssembly. Therefore there is currently a kind of détente among the various browser vendors, that everyone has agreed that they won't do speculative inlining -- yet, anyway. Who knows what will happen in the future, though.

Digressing, the summary here is that the baseline compiler receives an individual function body as input, and generates code just for that function.

one pass, and one pass only

The WebAssembly baseline compiler makes one pass through the bytecode of a function. Nowhere in all of this are we going to build an abstract syntax tree or a graph of basic blocks. Let's follow through how that works.

Firstly, emitFunction simply emits a prologue, then the body, then an epilogue. emitBody is basically a big loop that consumes opcodes from the instruction stream, dispatching to opcode-specific code emitters (e.g. emitAddI32).

The opcode-specific code emitters are also responsible for validating their arguments; for example, emitAddI32 is wrapped in an assertion that there are two i32 values on the stack. This validation logic is shared by a templatized codestream iterator so that it can be re-used by the optimizing compiler, as well as by the publicly-exposed WebAssembly.validate function.

A corollary of this approach is that machine code is emitted in bytestream order; if the WebAssembly instruction stream has an i32.add followed by a i32.sub, then the machine code will have an addl followed by a subl.

WebAssembly has a syntactically limited form of non-local control flow; it's not goto. Instead, instructions are contained in a tree of nested control blocks, and control can only exit nonlocally to a containing control block. There are three kinds of control blocks: jumping to a block or an if will continue at the end of the block, whereas jumping to a loop will continue at its beginning. In either case, as the compiler keeps a stack of nested control blocks, it has the set of valid jump targets and can use the usual assembler logic to patch forward jump addresses when the compiler gets to the block exit.

lean into the stack machine

This is the interesting bit! So, WebAssembly instructions target a stack machine. That is to say, there's an abstract stack onto which evaluating i32.const 32 pushes a value, and if followed by i32.const 10 there would then be i32(32) | i32(10) on the stack (where new elements are added on the right). A subsequent i32.add would pop the two values off, and push on the result, leaving the stack as i32(42). There is also a fixed set of local variables, declared at the beginning of the function.

The easiest thing that a compiler can do, then, when faced with a stack machine, is to emit code for a stack machine: as values are pushed on the abstract stack, emit code that pushes them on the machine stack.

The downside of this approach is that you emit a fair amount of code to do read and write values from the stack. Machine instructions generally take arguments from registers and write results to registers; going to memory is a bit superfluous. We're willing to accept suboptimal code generation for this quick-and-dirty compiler, but isn't there something smarter we can do for ephemeral intermediate values?

Turns out -- yes! The baseline compiler keeps an abstract value stack as it compiles. For example, compiling i32.const 32 pushes nothing on the machine stack: it just adds a ConstI32 node to the value stack. When an instruction needs an operand that turns out to be a ConstI32, it can either encode the operand as an immediate argument or load it into a register.

Say we are evaluating the i32.add discussed above. After the add, where does the result go? For the baseline compiler, the answer is always "in a register" via pushing a new RegisterI32 entry on the value stack. The baseline compiler includes a stupid register allocator that spills the value stack to the machine stack if no register is available, updating value stack entries from e.g. RegisterI32 to MemI32. Note, a ConstI32 never needs to be spilled: its value can always be reloaded as an immediate.

The end result is that the baseline compiler avoids lots of stack store and load code generation, which speeds up the compiler, and happens to make faster code as well.

Note that there is one limitation, currently: control-flow joins can have multiple predecessors and can pass a value (in the current WebAssembly specification), so the allocation of that value needs to be agreed-upon by all predecessors. As in this code:

(func $f (param $arg i32) (result i32)
  (block $b (result i32)
    (i32.const 0)
    (local.get $arg)
    (i32.eqz)
    (br_if $b) ;; return 0 from $b if $arg is zero
    (drop)
    (i32.const 1))) ;; otherwise return 1
;; result of block implicitly returned

When the br_if branches to the block end, where should it put the result value? The baseline compiler effectively punts on this question and just puts it in a well-known register (e.g., $rax on x86-64). Results for block exits are the only place where WebAssembly has "phi" variables, and the baseline compiler allocates all integer phi variables to the same register. A hack, but there we are.

no noodling!

When I started to hack on the baseline compiler, I did a lot of code reading, and eventually came on code like this:

void BaseCompiler::emitAddI32() {
  int32_t c;
  if (popConstI32(&c)) {
    RegI32 r = popI32();
    masm.add32(Imm32(c), r);
    pushI32(r);
  } else {
    RegI32 r, rs;
    pop2xI32(&r, &rs);
    masm.add32(rs, r);
    freeI32(rs);
    pushI32(r);
  }
}

I said to myself, this is silly, why are we only emitting the add-immediate code if the constant is on top of the stack? What if instead the constant was the deeper of the two operands, why do we then load the constant into a register? I asked on the chat channel if it would be OK if I improved codegen here and got a response I was not expecting: no noodling!

The reason is, performance of baseline-compiled code essentially doesn't matter. Obviously let's not pessimize things but the reason there's a baseline compiler is to emit code quickly. If we start to add more code to the baseline compiler, the compiler itself will slow down.

For that reason, changes are only accepted to the baseline compiler if they are necessary for some reason, or if they improve latency as measured using some real-world benchmark (time-to-first-frame on Doom 3, for example).

This to me was a real eye-opener: a compiler optimized not for the quality of the code that it generates, but rather for how fast it can produce the code. I had seen this in action before but this example really brought it home to me.

The focus on compiler throughput rather than compiled-code throughput makes it pretty gnarly to hack on the baseline compiler -- care has to be taken when adding new features not to significantly regress the old. It is much more like hacking on a production JavaScript parser than your traditional SSA-based compiler.

that's a wrap!

So that's the WebAssembly baseline compiler in SpiderMonkey / Firefox. Until the next time, happy hacking!

bigint shipping in firefox!

2019-05-23T12:13:59Z

I am delighted to share with folks the results of a project I have been helping out on for the last few months: implementation of "BigInt" in Firefox, which is finally shipping in Firefox 68 (beta).

what's a bigint?

BigInts are a new kind of JavaScript primitive value, like numbers or strings. A BigInt is a true integer: it can take on the value of any finite integer (subject to some arbitrarily large implementation-defined limits, such as the amount of memory in your machine). This contrasts with JavaScript number values, which have the well-known property of only being able to precisely represent integers between -2⁵³ and 2⁵³.

BigInts are written like "normal" integers, but with an n suffix:

var a = 1n;
var b = a + 42n;
b << 64n
// result: 793209995169510719488n

With the bigint proposal, the usual mathematical operations (+, -, *, /, %, <<, >>, **, and the comparison operators) are extended to operate on bigint values. As a new kind of primitive value, bigint values have their own typeof:

typeof 1n
// result: 'bigint'

Besides allowing for more kinds of math to be easily and efficiently expressed, BigInt also allows for better interoperability with systems that use 64-bit numbers, such as "inodes" in file systems, WebAssembly i64 values, high-precision timers, and so on.

You can read more about the BigInt feature over on MDN, as usual. You might also like this short article on BigInt basics that V8 engineer Mathias Bynens wrote when Chrome shipped support for BigInt last year. There is an accompanying language implementation article as well, for those of y'all that enjoy the nitties and the gritties.

can i ship it?

To try out BigInt in Firefox, simply download a copy of Firefox Beta. This version of Firefox will be fully released to the public in a few weeks, on July 9th. If you're reading this in the future, I'm talking about Firefox 68.

BigInt is also shipping already in V8 and Chrome, and my colleague Caio Lima has an project in progress to implement it in JavaScriptCore / WebKit / Safari. Depending on your target audience, BigInt might be deployable already!

thanks

I must mention that my role in the BigInt work was relatively small; my Igalia colleague Robin Templeton did the bulk of the BigInt implementation work in Firefox, so large ups to them. Hearty thanks also to Mozilla's Jan de Mooij and Jeff Walden for their patient and detailed code reviews.

Thanks as well to the V8 engineers for their open source implementation of BigInt fundamental algorithms, as we used many of them in Firefox.

Finally, I need to make one big thank-you, and I hope that you will join me in expressing it. The road to ship anything in a web browser is long; besides the "simple matter of programming" that it is to implement a feature, you need a specification with buy-in from implementors and web standards people, you need a good working relationship with a browser vendor, you need willing technical reviewers, you need to follow up on the inevitable security bugs that any browser change causes, and all of this takes time. It's all predicated on having the backing of an organization that's foresighted enough to invest in this kind of long-term, high-reward platform engineering.

In that regard I think all people that work on the web platform should send a big shout-out to Tech at Bloomberg for making BigInt possible by underwriting all of Igalia's work in this area. Thank you, Bloomberg, and happy hacking!

arrow functions coming to chrome 45!

2015-06-18T16:41:17Z

It's been a long time coming, but I just flipped the bit in V8 that will ship arrow functions in Chrome 45! Woo hoo!

You probably know, but arrow functions are a new way to write functions in JavaScript. They look like this:

// Two arguments, body implicitly returned.
(x, y) => x + y

// With just one argument, no parentheses needed.
x => x * 2

// Body can have braces too; in that case use "return".
x => { return x * 2 }

Relative to the other kind of function that is written like function (x) { return x * 2 }, arrow functions don't define this or arguments in their bodies, instead capturing these values from the environment. There are a couple of other minor differences, too, but instead of writing about them here I'll just point to the great article by Jason Orendorff of the SpiderMonkey team.

Arrow functions are part of the JavaScript language standard that was called "ECMAScript 6" or ES6, and I guess you could still call it that. It seems like a silly thing for the committee to do to throw away all their branding like that but they decided to rename it ECMAScript 2015, which I'm sure is a link that the pedants are glad I have included. The upshot is that the standard is now final, gold master, etched in stone, which from an implementor's perspective is a relief. You can practically feel the anxiety ebbing away by the happy rate at which commits bubble out of source repositories and into shipping browsers, free from the fear that some spec change will force the hack-stream to change course.

From the V8 side, our arrow function implementation has also been a long time coming. My colleague Adrián Pérez did the first half of the work, and I picked up on the back end of things. It seems like such a small feature and in many ways it is, but still it took a long time. Now I know that my readers are a bunch of nerds and many of you like implementing languages, so you might appreciate these nargish points.

One of the first bits is that arrow functions are hard to parse. Consider, this is a valid JavaScript expression:

(x,y)

It's a "comma expression" that will evaluate x then y and its result will be the result of evaluating y. But add an arrow on after the end and you get not an expression but a formal parameter list:

(x,y)=>x+y

Now you might think, well OK, when you see an arrow, rewind the input stream and parse in "arrow function mode". Indeed that would be fine, but not in combination with some additional ES6 features, optional and destructuring arguments. Optional arguments look like this:

(x=42)=>x

The =42 part is the expression that will be evaluated to give x a value, if the function is called with no arguments. Note that this bit is still under implementation in V8 so you can't try it in your browser. An optional argument initializer is an expression and not a value, so you can also have:

(x=(x)=>42)=>x

Combined, this makes rewinding the token stream a proposition of exponential complexity, which is a no-go for a production JavaScript parser. Parsers are on the hot path for page-load times and no browser vendor wants to introduce a pathological case into their page load.

Instead, V8 does something I hadn't seen before. It keeps an open mind about whether something is a comma expression or a formal parameter list of an arrow function, and only makes a decision when it sees the => (or not). As it parses, V8 records places that it would signal an error for either a parameter list or for an expression, and then when that superimposed wave function collapses it checks that the production is valid, signalling the appropriate error if not. I thought this was a really neat trick, so if you're into that thing see expression classifier to see those details.

The other thing that's tricky about arrow functions is the this binding. In JavaScript, this is basically a hidden parameter passed to a function when it is called. Calling a function like o.f() passes the value of o to f as its this parameter. If instead f() is called directly, like with no dot before the call, then undefined is passed as this. Also for sloppy-mode functions, if the passed this value isn't an object, then the global object instead is assigned to this. Finally outside a function, this is bound to the global object.

OK, I know all of you know these things. Thing is, you always have a this, and although it's like a variable it's not a valid variable name, and before ES6 nothing could capture its value, because each function has its own this value. Perhaps you see where I'm going with this (ahem) now. Arrow functions introduce a function scope that doesn't have a this value, and that indeed might capture some other scope's this value, forcing it to be context-allocated. Other parts of ES6 can actually force assignment to this, like a super call, and that assignment can actually come from within an arrow function. Zounds! A simple concept, but there was a lot of incidental complexity in V8 around the implementation. Between Adrián and myself it took like three months to fix this usage in V8 to always just go through the (possibly context-allocated) variable, and there are still probably some devtools bugs to find in the upcoming weeks.

Performance-wise, arrow functions are just like functions. They should be just as fast as if you wrote them with function. So use them with joy, use them with abandon, use them judiciously -- however you decide you use them, don't let perf influence your decision one way or the other.

That's about it! Like all of my JS engine work over the past couple years, this hacking was sponsored by fabulous folks over at Bloomberg, so big ups to them. From me and Adrián at Igalia, until next time! We leave you to puzzle out what this bit of JavaScript evaluates to:

(({},{},({},{})=>({},{}))=>(({},{})=>({},{}),{},{}))({},{})

Happy hacking!

generators in firefox now twenty-two times faster

2014-11-14T08:41:08Z

It's with great pleasure that I can announce that, thanks to Mozilla's Jan de Mooij, the new ES6 generator functions are twenty-two times faster in Firefox!

Some back-story, for the unawares. There's a new version of JavaScript coming, ECMAScript 6 (ES6). Among the new features that ES6 brings are generator functions: functions that can suspend. Firefox's JavaScript engine, SpiderMonkey, has had support for generators for many years, long before other engines. This support was upgraded to the new ES6 standard last year, thanks to sponsorship from Bloomberg, and was shipped out to users in Firefox 26.

The generators implementation in Firefox 26 was quite basic. As you probably know, modern JavaScript implementations have a number of tiered engines. In the case of SpiderMonkey there are three tiers: the interpreter, the baseline compiler, and the optimizing compiler. Code begins execution in the interpreter, which is the quickest engine to start. If a piece of code is hot -- meaning that lots of time is being spent there -- then it will "tier up" to the next level, where it is analyzed, possibly optimized, and then compiled to machine code.

Unfortunately, generators in SpiderMonkey have always been stuck at the lowest tier, the interpreter. This is because of SpiderMonkey's choice of implementation strategy for generators. Generators were implemented as "floating interpreter stack frames": heap-allocated objects whose shape was exactly the same as a stack frame in the interpreter. This had the advantage of being fairly cheap to implement in the beginning, but ultimately it made them unable to take advantage of JIT compilation, as JIT code runs on its own stack which has a different layout. The previous strategy also relied on trampolining through a helper written in C++ to resume generators, which killed optimization opportunities.

The solution was to represent suspended generator objects as snapshots of the state of a stack frame, instead of as stack frames themselves. In order for this to be efficient, last year we did a number of block scope optimizations to try and reduce the amount of state that a generator frame would have to restore. Finally, around March of this year we were at the point where we could refactor the interpreter to implement generators on the normal interpreter stack, with normal interpreter bytecodes, with the vision of being able to JIT-compile those bytecodes.

I ran out of time before I could land that patchset; although the patches were where we wanted to go, they actually caused generators to be even slower and so they languished in Bugzilla for a few more months. Sad monkey. It was with delight, then, that a month or so ago I saw that SpiderMonkey JIT maintainer Jan de Mooij was interested in picking up the patches. Since then he has been hacking off and on at getting my old patches into shape, and ended up applying them all.

He went further, optimizing stack frames to not reserve space for "aliased" locals (locals allocated on the scope chain), speeding up object literal creation in the baseline compiler and finally has implemented baseline JIT compilation for generators.

So, after all of that perf nargery, what's the upshot? Twenty-two times faster! In this microbenchmark:

function *g(n) {
    for (var i=0; iBefore, it took SpiderMonkey 980 milliseconds to complete on Jan's machine.  After?  Only 43!  It's actually marginally faster than V8 at this point, which has (temporarily, I think) regressed to 45 milliseconds on this test.  Anyway.  Competition is great and as a committer to both projects I find it very satisfactory to have good implementations on both sides.
As in V8, in SpiderMonkey generators cannot yet reach the highest tier of optimization.  I'm somewhat skeptical that it's necessary, too, as you expect generators to suspend fairly frequently.  That said, a yield point in a generator is, from the perspective of the optimizing compiler, not much different from a call site, in that it causes all locals to be saved.  The difference is that locals may have unboxed representations, so we would have to box those values when saving the generator state, and unbox on restore.
Thanks to Bloomberg for funding the initial work, and big, big thanks to Mozilla's Jan de Mooij for picking up where we left off.  Happy hacking with generators!

es6 generator and array comprehensions in spidermonkey

2014-03-07T21:11:55Z

Good news, everyone: ES6 generator and array comprehensions just landed in SpiderMonkey!

Let's take a quick look at what comprehensions are, then talk about what just landed in SpiderMonkey and when you'll see it in a Firefox release. Thanks to Bloomberg for sponsoring this work.

comprendes, mendes

Comprehensions are syntactic sugar for iteration. Unlike for-of, which processes its body for side effects, an array comprehension processes its body for its values, collecting them into a new array. Like this:

// Before (by hand)
var foo = (function(){
             var result = [];
             for (var x of y)
               result.push(x*x);
             return result;
           })();

// Before (assuming y has a map() method)
var foo = y.map(function(x) { return x*x });

// After
var foo = [for (x of y) x*x];

As you can see, array comprehensions are quite handy. They're expressions, not statements, and so their result can be passed directly to whatever code needs it. This can make your program more clear, because you aren't forced to give names to intermediate values, like result. At the same time, they work on any iterable, so you can use them on more kinds of data than just arrays. Because array comprehensions don't make a new closure, you can access arguments and this and even yield from within the comprehension tail.

Generator comprehensions are also nifty, but for a different reason. Let's look at an example first:

// Before
var bar = (function*(){ for (var x of y) yield x })();

// After
var bar = (for (x of y) x);

As you can see the syntactic win here isn't that much, compared to just writing out the function* and invoking it. The real advantage of generator comprehensions is their similarity to array comprehensions, and that often you can replace an array comprehension by a generator comprehension. That way you never need to build the complete list of values in memory -- you get laziness for free, just by swapping out those square brackets for the comforting warmth of parentheses.

Both kinds of comprehension can contain multiple levels of iteration, with embedded conditionals as you like. You can do [for (x of y) for (z of x) if (z % 2) z + 1] and all kinds of related niftiness. Comprehensions are almost always more concise than map and filter, with the added advantage that they are usually more efficient.

what happened

SpiderMonkey has had comprehensions for a while now, but only as a non-standard language extension you have to opt into. Now that comprehensions are in the draft ES6 specification, we can expose them to the web as a whole, by default.

Of course, the comprehensions that ES6 specified aren't quite the same as the ones that were in SM. The obvious difference is that SM's legacy comprehensions were written the other way around: [x for (x of y)] instead of the new [for (x of y) x]. There were also a number of other minor differences, which I'll list here for posterity:

ES6 comprehensions create one scope per "for" node -- not one for the comprehension as a whole.
ES6 comprehensions can have multiple "if" components, which may be followed by other "for" or "if" components.
ES6 comprehensions should make a fresh binding on each iteration of a "for", although Firefox currently doesn't do this (bug 449811). Incidentally, for-of in Firefox has this same problem.
ES6 comprehensions only do for-of iteration, not for-in iteration.
ES6 generator comprehensions always need parentheses around them. (The parentheses were optional in some cases for SM's old generator comprehensions.
ES6 generator comprehensions are ES6 generators (returning {value, done} objects), not legacy generators (StopIteration).

I should note in particular that the harmony wiki is out of date, as the feature has moved into the spec proper: array comprehensions, generator comprehensions.

For another fine article on ES6 generators, check out Ariya Hidayat's piece on comprehensions from about a year ago.

So, ES6 comprehensions just landed in SpiderMonkey today, which means it should be part of Firefox 30, which should reach "beta" in April and become a stable release in June. You can try it out tomorrow if you use a nightly build, provided it doesn't cause some crazy breakage tonight. As of this writing, Firefox will be the first browser to ship ES6 array and generator comprehensions.

colophon

I had a Monday of despair: hacking at random on something that didn't solve my problem. But I had a Tuesday morning of pleasure, when I realized that my Monday's flounderings could be cooked into a delicious mid-week bisque; the hack was obvious and small and would benefit the web as a whole. (Wednesday was for polish and Thursday on another bug, and Friday on a wild parser-to-OSR-to-assembly-and-back nailbiter; but in the end all is swell.)

Thanks again to Bloomberg for this opportunity to build out the web platform, and to Mozilla for their quality browser wares (and even better community of hackers).

This has been an Igalia joint. Until next time!

optimizing let in spidermonkey

2013-12-18T20:00:23Z

Peoples! Firefox now optimizes let-bound variables!

What does this mean, you ask? Well, as you nerdy wingolog readers probably know, the new ECMAScript 6 standard is coming soon. ES6 has new facilities that make JavaScript more expressive. At the same time, though, many ES6 features stress parts of JavaScript engines that are currently ignored.

One of these areas is lexical scope. Until now, with two exceptions, all local variables in JavaScript have been "hoisted" to the function level. (The two exceptions are the exception in a catch clause, and the name of a named function expression.) Therefore JS engines have rightly focused on compiling methods at a time, falling back to slower execution strategies when lexical scopes are present.

The presence of let in ES6 changes this. It was a hip saying a couple years ago that "let is the new var", but in reality no one uses let -- not only because engines didn't yet implement let without feature flags, if they implemented it at all, but because let was slow. Because let-bound variables were on the fallback path, they were unoptimized.

But that is changing. Firefox now optimizes many kinds of let-bound variables, making "let is the new var" much closer to being an actual thing. V8 has made some excellent steps in similar directions as well. I'll focus on the Firefox bit here as that's what I've been working on, then go on to mention the work that the excellent V8 hackers have done.

implementing scope in javascript

JavaScript the language is defined in terms of a "scope chain", with respect to which local variables are looked up. Each link in the chain binds some set of variables. Lookup of a name traverses the chain from tip to tail looking for a binding.

Usually it's possible to know at compile-time where a binding will be in the chain. For example, you can know that the "x" reference in the returned function here:

function () {
  var x = 3;
  return function () { return x; };
}

will be one link up in the scope chain, and will be stored in the first named slot in that link. This is a classic implementation of lexical scope, and it plays very well with the semantics of JavaScript as specified. All engines do something like this.

Another thing to note in this case is that we don't need to store the name for "x" anywhere, as we can see through all of its uses at compile-time. In practice however, as JavaScript functions are typically compiled lazily on first call, we do store the association "x -> slot 0" in the outer function's environment.

(Could you just propagate the constant, you ask? Of course you could. But since JS compilation typically occurs one function at a time, lazily, no engine does this initially. Of course later when the optimizing compiler kicks in, it does get propagated, but that usually doesn't avoid the scope chain node allocation.)

Let's take another case.

function foo() { var x = 3; return x; }

In this case we know that the "x" reference can be found in the top link of the chain, in the first slot. Semantically speaking, that is. No JS engine implements things this way. Instead we take advantage of noticing that "x" cannot be captured by any other function. Therefore we are free to assign "x" to a slot in a stack frame, and refer to it by index directly -- without allocating a scope chain node on the heap. All engines do this.

And of course later if this function is hot, the constant exists only in a register, probably inlined to its caller. For this reason scope hasn't had a lot of work put into it. The unit of optimization -- the function -- is the same as the unit of scope. If the implementation details of scope get costly, we take advantage of run-time optimization to speculatively remove allocations, loads, and stores.

OK. What if you have some variables captured, and some not captured?

function foo() { var x = 3; var y = 4; return function () { return x; } }

Here "x" is captured, but "y" is not. Usually engines will allocate "x" on the scope chain, and "y" as a local. (JavaScriptCore is an interesting exception; it uses a strategy I haven't seen elsewhere called "lazy tear-off". Basically all variables are on the stack for the dynamic extent of the scope, whether they are captured or not. If needed, potentially lazily, an object is pushed on the scope chain that aliases the stack slots. If a scope is pushed on the scope chain, when the scope ends the current values are "torn off" the stack and copied to the heap, and the slots pointer of the scope object updated to point to the heap.)

I digress. "x" on the scope chain, "y" not. (I regress: why allocate "y" at all, isn't it dead? Yes it is dead. Optimizing compilers kill it. Baseline compilers don't bother.) So typically access to "x" goes through the scope chain, and access to "y" does not.

Calculating the set of variables that can be captured is one of the few static analyses done by JS baseline compilers. The other ones are whether we are in strict mode or not, whether nested scopes contain "with" or "eval" (thus forcing all locals to be on the scope chain), and whether the "arguments" object is used or not. There may be more, but that is the lowest common denominator of efficiency.

implementing let

let is part of ES6, and all browsers will eventually implement it.

Let me give you an insight as to the mindset of a browser maker. What a browser maker does when they see a feature is first to try to ignore it -- all features have a cost and not all of them pay for themselves. Next you try to do the minimum correct thing -- the thing that passes all the test suites, but imposes a minimal burden on the rest of the system. Usually this means that the feature probably works, except for the corner cases which users will file bugs for, but it is slow. Finally when there is either an internal use the browser maker cares about (Google, Mozilla to an extent) or you want to heat up a benchmark war (everyone else), you start to optimize.

The state of let was until recently between "ignore" and "slow". "Slow" means different things to different browsers.

The engines that have started to implement let are V8 (Chrome) and SpiderMonkey (Firefox). In Chrome, until recently using let in a function was an easy way of preventing that function from ever being optimized by Crankshaft. Good times? When writing this article I was going to claim this was still the case, but I see that in fact this is happily not true. You can Crankshaft a function that has let bindings. I believe, with a cursory glance, that locals that are not captured are still allocated on a scope chain -- but perhaps I am misreading things.

Firefox, on the other hand, would not Ion-compile any function with a let. You would think that this would not be the case, given that Firefox has had let for many years. If you thought this, you misunderstand how browser development works. Basically let me lay it out for you:

class struggle : marxism :: benchmarks : browsers

Get it? Benchmarks are the motor of browser history. If this bloviation has any effect, please let it be to inspire you to go and make benchmarks!

Anyway, Firefox. No Ion for let. The reason why you would avoid optimizing these things is clear -- breaking the "optimization unit == scope unit" equivalence has a cost. However the time has come.

What I have done is to recognize let scopes that have no captured locals. This class of let scope will now be optimized by Ion. So for example in this case:

function sumto(n) {
  let sum = 0;
  for (let i=0; iPreviously this would allocate a block scope around the "for".  (Incidentally, Firefox borks block scoping in this case; each iteration should produce a fresh binding.  But this is historical, will be fixed, and I digress.)  The operations that created and pushed block scopes were not optimized.  Avoiding pushing and popping scopes from the scope chain avoids this limitation, and thus we have awesome speed!
The effect of simply turning on Ion cannot be denied.  In this case, the runtime of a sum up to 1e9 goes from 8.8 seconds (!) to 0.8 seconds.  Obviously it's a little micro-benchmark but that's how I'm rolling today.  The biggest optimization is to stop deoptimization.  And here's where history cranks up: V8 takes 7.7 seconds on this loop.  Gentlefolk, start your engines!
At the same time, in Firefox we are still able to reify a parallel chain of "debug scopes" as needed that corresponds to the scope chain as the specification would see it.  This was the most challenging part of optimizing block scope in Firefox -- not the actual optimization, which was trivial, but optimization while retaining introspection and debuggability.
future work
Unhappily, scopes with let-bound variables that are captured by nested scopes ("aliased", in spidermonkey parlance) are not yet optimized.  So not only do they cause scope chain allocation, but they also don't benefit from Ion.  Boo.  Bug 942810.
I should also mention that the let supported by Firefox is not the let specified in ES6.  In ES6 there is thing called a "temporal dead zone" whereby it is invalid to access the value of a let before its initialization.  This is like Scheme's "letrec restriction", and has to be enforced dynamically for similar reasons.  V8 does a great job in actually implementing this, and Firefox should do it soon.
Of course, it's not clear to me how let can actually be deployed without breaking the tubes.  I think it probably can somehow but I haven't checked the latest TC39 cogitations in that regard.
twisted paths
It's been a super-strange end-of-year.  I was sure I would be optimizing SpiderMonkey generators and moving on to other things, but I just got caught up with stuff -- the for-of bits took approximately forever and then the amount of state carried in SpiderMonkey stack frames filled me with despair.  So it was that this hack, while assisting me in that goal, was not actually a planned thing.
See, SpiderMonkey used to actually reserve a stack slot for the "block chain".  No, they aren't using your browser to mine for Bitcoins, though that would be a neat hack.  The "block chain" was a simulation of the spec-mandated scope chain.  But of course this might not reflect the actual implemented, optimized behavior -- but again one might want to map them to each other for debugging purposes.  It was a mess.
The resulting changeset to fix this ended up so large that it took weeks to land.  Well, live and learn, right?  I remember Michael Starzinger telling me the same thing about V8 -- you just have to keep your patches small, as small as possible, and always working.  Words to the wise indeed.
happy days
But in the end we at least have some juice from this citric fruit.  This has been an Igalia joint.  Thanks very much to Mozilla's Luke Wagner for suffering through the reviews.
Thanks especially to Bloomberg for making this work possible; you folks are swell.  Forward ES6!