wingolog

cps in hoot

2024-05-27T12:36:44Z

Good morning good morning! Today I have another article on the Hoot Scheme-to-Wasm compiler, this time on Hoot’s use of the continuation-passing-style (CPS) transformation.

calls calls calls

So, just a bit of context to start out: Hoot is a Guile, Guile is a Scheme, Scheme is a Lisp, one with “proper tail calls”: function calls are either in tail position, syntactically, in which case they are tail calls, or they are not in tail position, in which they are non-tail calls. A non-tail call suspends the calling function, putting the rest of it (the continuation) on some sort of stack, and will resume when the callee returns. Because non-tail calls push their continuation on a stack, we can call them push calls.

(define (f)
  ;; A push call to g, binding its first return value.
  (define x (g))
  ;; A tail call to h.
  (h x))

Usually the problem in implementing Scheme on other language run-times comes in tail calls, but WebAssembly supports them natively (except on JSC / Safari; should be coming at some point though). Hoot’s problem is the reverse: how to implement push calls?

The issue might seem trivial but it is not. Let me illustrate briefly by describing what Guile does natively (not compiled to WebAssembly). Firstly, note that I am discussing residual push calls, by which I mean to say that the optimizer might remove a push call in the source program via inlining: we are looking at those push calls that survive the optimizer. Secondly, note that native Guile manages its own stack instead of using the stack given to it by the OS; this allows for push-call recursion without arbitrary limits. It also lets Guile capture stack slices and rewind them, which is the fundamental building block we use to implement exception handling, Fibers and other forms of lightweight concurrency.

The straightforward function call will have an artificially limited total recursion depth in most WebAssembly implementations, meaning that many idiomatic uses of Guile will throw exceptions. Unpleasant, but perhaps we could stomach this tradeoff. The greater challenge is how to slice the stack. That I am aware of, there are three possible implementation strategies.

generic slicing

One possibility is that the platform provides a generic, powerful stack-capture primitive, which is what Guile does. The good news is that one day, the WebAssembly stack-switching proposal should provide this too. And in the meantime, the so-called JS Promise Integration (JSPI) proposal gets close: if you enter Wasm from JS via a function marked as async, and you call out to JavaScript to a function marked as async (i.e. returning a promise), then on that nested Wasm-to-JS call, the engine will suspend the continuation and resume it only when the returned promise settles (i.e. completes with a value or an exception). Each entry from JS to Wasm via an async function allocates a fresh stack, so I understand you can have multiple pending promises, and thus multiple wasm coroutines in progress. It gets a little gnarly if you want to control when you wait, for example if you might want to wait on multiple promises; in that case you might not actually mark promise-returning functions as async, and instead import an async-marked async function waitFor(p) { return await p} or so, allowing you to use Promise.race and friends. The main problem though is that JSPI is only for JavaScript. Also, its stack sizes are even smaller than the the default stack size.

instrumented slicing

So much for generic solutions. There is another option, to still use push calls from the target machine (WebAssembly), but to transform each function to allow it to suspend and resume. This is what I think of as Joe Marshall’s stack trick (also see §4.2 of the associated paper). The idea is that although there is no primitive to read the whole stack, each frame can access its own state. If you insert a try/catch around each push call, the catch handler can access local state for activations of that function. You can slice a stack by throwing a SaveContinuation exception, in which each frame’s catch handler saves its state and re-throws. And if we want to avoid exceptions, we can use checked returns as Asyncify does.

I never understood, though, how you resume a frame. The Generalized Stack Inspection paper would seem to indicate that you need the transformation to introduce a function to run “the rest of the frame” at each push call, which becomes the Invoke virtual method on the reified frame object. To avoid code duplication you would have to make normal execution flow run these Invoke snippets as well, and that might undo much of the advantages. I understand the implementation that Joe Marshall was working on was an interpreter, though, which bounds the number of sites needing such a transformation.

cps transformation

The third option is a continuation-passing-style transformation. A CPS transform results in a program whose procedures “return” by tail-calling their “continuations”, which themselves are procedures. Taking our previous example, a naïve CPS transformation would reify the following program:

(define (f' k)
  (g' (lambda (x) (h' k x))))

Here f' (“f-prime”) receives its continuation as an argument. We call g', for whose continuation argument we pass a closure. That closure is the return continuation of g, binding a name to its result, and then tail-calls h with respect to f. We know their continuations are the same because it is the same binding, k.

Unfortunately we can’t really slice abitrary ranges of a stack with the naïve CPS transformation: we can only capture the entire continuation, and can’t really inspect its structure. There is also no way to compose a captured continuation with the current continuation. And, in a naïve transformation, we would be constantly creating lots of heap allocation for these continuation closures; a push call effectively pushes a frame onto the heap as a closure, as we did above for g'.

There is also the question of when to perform the CPS transform; most optimizing compilers would like a large first-order graph to work on, which is out of step with the way CPS transformation breaks functions into many parts. Still, there is a nugget of wisdom here. What if we preserve the conventional compiler IR for most of the pipeline, and only perform the CPS transformation at the end? In that way we can have nice SSA-style optimizations. And, for return continuations of push calls, what if instead of allocating a closure, we save the continuation data on an explicit stack. As Andrew Kennedy notes, closures introduced by the CPS transform follow a stack discipline, so this seems promising; we would have:

(define (f'' k)
  (push! k)
  (push! h'')
  (g'' (lambda (x)
         (define h'' (pop!))
         (define k (pop!))
         (h'' k x))))

The explicit stack allows for generic slicing, which makes it a win for implementing delimited continuations.

hoot and cps

Hoot takes the CPS transformation approach with stack-allocated return closures. In fact, Hoot goes a little farther, too far probably:

(define (f''')
  (push! k)
  (push! h''')
  (push! (lambda (x)
           (define h'' (pop!))
           (define k (pop!))
           (h'' k x)))
  (g'''))

Here instead of passing the continuation as an argument, we pass it on the stack of saved values. Returning pops off from that stack; for example, (lambda () 42) would transform as (lambda () ((pop!) 42)). But some day I should go back and fix it to pass the continuation as an argument, to avoid excess stack traffic for leaf function calls.

There are some gnarly details though, which I know you are here for!

splits

For our function f, we had to break it into two pieces: the part before the push-call to g and the part after. If we had two successive push-calls, we would instead split into three parts. In general, each push-call introduces a split; let us use the term tails for the components produced by a split. (You could also call them continuations.) How many tails will a function have? Well, one for the entry, one for each push call, and one any time control-flow merges between two tails. This is a fixpoint problem, given that the input IR is a graph. (There is also some special logic for call-with-prompt but that is too much detail for even this post.)

where to save the variables

Guile is a dynamically-typed language, having a uniform SCM representation for every value. However in the compiler and run-time we can often unbox some values, generally as u64/s64/f64 values, but also raw pointers of some specific types, some GC-managed and some not. In native Guile, we can just splat all of these data members into 64-bit stack slots and rely on the compiler to emit stack maps to determine whether a given slot is a double or a tagged heap object reference or what. In WebAssembly though there is no sum type, and no place we can put either a u64 or a (ref eq) value. So we have not one stack but three (!) stacks: one for numeric values, implemented using a Wasm memory; one for (ref eq) values, using a table; and one for return continuations, because the func type hierarchy is disjoin from eq. It’s.... it’s gross? It’s gross.

what variables to save

Before a push-call, you save any local variables that will be live after the call. This is also a flow analysis problem. You can leave off constants, and instead reify them anew in the tail continuation.

I realized, though, that we have some pessimality related to stacked continuations. Consider:

(define (q x)
  (define y (f))
  (define z (f))
  (+ x y z))

Hoot’s CPS transform produces something like:

(define (q0 x)
  (save! x)
  (save! q1)
  (f))

(define (q1 y)
  (restore! x)
  (save! x)
  (save! y)
  (save! q2)
  (f))

(define (q2 z)
  (restore! x)
  (restore! y)
  ((pop!) (+ x y z)))

So q0 saved x, fine, indeed we need it later. But q1 didn’t need to restore x uselessly, only to save it again on q2‘s behalf. Really we should be applying a stack discipline for saved data within a function. Given that the source IR is a graph, this means another flow analysis problem, one that I haven’t thought about how to solve yet. I am not even sure if there is a solution in the literature, given that the SSA-like flow graphs plus tail calls / CPS is a somewhat niche combination.

calling conventions

The continuations introduced by CPS transformation have associated calling conventions: return continuations may have the generic varargs type, or the compiler may have concluded they have a fixed arity that doesn’t need checking. In any case, for a return, you call the return continuation with the returned values, and the return point then restores any live-in variables that were previously saved. But for a merge between tails, you can arrange to take the live-in variables directly as parameters; it is a direct call to a known continuation, rather than an indirect call to an unknown call site.

cps soup?

Guile’s intermediate representation is called CPS soup, and you might wonder what relationship that CPS has to this CPS. The answer is not much. The continuations in CPS soup are first-order; a term in one function cannot continue to a continuation in another function. (Inlining and contification can merge graphs from different functions, but the principle is the same.)

It might help to explain that it is the same relationship as it would be if Guile represented programs using SSA: the Hoot CPS transform runs at the back-end of Guile’s compilation pipeline, where closures representations have already been made explicit. The IR is still direct-style, just that syntactically speaking, every call in a transformed program is a tail call. We had to introduce save and restore primitives to implement the saved variable stack, and some other tweaks, but generally speaking, the Hoot CPS transform ensures the run-time all-tail-calls property rather than altering the compile-time language; a transformed program is still CPS soup.

fin

Did we actually make the right call in going for a CPS transformation?

I don’t have good performance numbers at the moment, but from what I can see, the overhead introduced by CPS transformation can impose some penalties, even 10x penalties in some cases. But some results are quite good, improving over native Guile, so I can’t be categorical.

But really the question is, is the performance acceptable for the functionality, and there I think the answer is more clear: we have a port of Fibers that I am sure Spritely colleagues will be writing more about soon, we have good integration with JavaScript promises while not relying on JSPI or Asyncify or anything else, and we haven’t had to compromise in significant ways regarding the source language. So, for now, I am satisfied, and looking forward to experimenting with the stack slicing proposal as it becomes available.

Until next time, happy hooting!

type folding in guile

2015-10-29T22:13:27Z

A-hey hey hey, my peeps! Today's missive is about another optimization pass in Guile that we call "type folding". There's probably a more proper name for this, but for the moment we go with "type folding" as it's shorter than "abstract constant propagation, constant folding, and branch folding based on flow-sensitive type and range analysis".

on types

A word of warning to the type-system enthusiasts among my readers: here I'm using "type" in the dynamic-languages sense, to mean "a property about a value". For example, whether a value is a vector or a pair is a property of that value. I know that y'all use that word for other purposes, but there are other uses that do not falute so highly, and it's in the more pedestrian sense that I'm interested here.

To back up a bit: what are the sources of type information in dynamic languages? In Guile, there are three ways the compiler can learn about a value's type.

One source of type information is the compiler's knowledge of the result types of expressions in the language, especially constants and calls to the language's primitives. For example, in the Scheme definition (define y (vector-length z)), we know that y is a non-negative integer, and we probably also know a maximum value for z too, given that vectors have a maximum size.

Conditional branches with type predicates also provide type information. For example, in consider this Scheme expression:

(lambda (x)
  (if (pair? x)
      (car x)
      (error "not a pair" x)))

Here we can say that at the point of the (car x) expression, x is definitely a pair. Conditional branches are interesting because they add a second dimension to type analysis. The question is no longer "what is the type of all variables", but "what is the type of all variables at all points in the program".

Finally, we have the effect of argument type checks in function calls. For example in the (define y (vector-length z)) definition, after (vector-length z) has been evaluated, we know that z is indeed a vector, because if it weren't, then the primitive call would raise an exception.

In summary, the information that we would like to have is what type each variable has at each program point (label). This information comes from where the variables are defined (the first source of type information), conditional branches and control-flow joins (the second source), and variable use sites that imply type checks (the third). It's a little gnarly but in essence it's a classic flow analysis. We treat the "type" of a variable as a set of possible types. A solution to the flow equations results in a set of types for each variable at each label. We use the intmap data structures to share space between the solution at different program points, resulting in an O(n log n) space complexity.

In Guile we also solve for the range of values a variable may take, at the same time as solving for type. I started doing this as part of the Compost hack a couple years ago, where I needed to be able to prove that the operand to sqrt was non-negative in order to avoid sqrt producing complex numbers. Associating range with type turns out to generalize nicely to other data types which may be thought of as having a "magnitude" -- for example a successful (vector-ref v 3) implies that v is at least 4 elements long. Guile can propagate this information down the flow graph, or propagate it in the other way: if we know the vector was constructed as being 10 elements long, then a successful (vector-ref v n) can only mean that n is between 0 and 9.

what for the typing of the things

Guile's compiler uses type analysis in a few ways at a few different stages. One basic use is in dead code elimination (DCE). An expression can be eliminated from a program if its value is never used and if it causes no side effects. Guile models side effects (and memory dependencies between expressions) with effects analysis. I quote:

We model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

In an expression like (vector-ref v n), type analysis may compute that in fact v is indeed a vector and n is an integer, and also that n is within the range of valid indexes of v. In that case we can remove the type check (T) bit from the expression's effects, opening up the expression for DCE.

Getting back to the topic of this article, Guile's "type folding" pass uses type inference in three ways.

The first use of type information is if we determine that, at a given use site, a variable has precisely one type and one value. In that case we can do constant folding over that expression, replacing its uses with its value. For example, let's say we have the expression (define len (vector-length v)). If we know that v is a vector of length length 5, we can replace any use of len with the constant, 5. As an implementation detail we actually keep the definition of len in place and let DCE remove it later. We can consider this to be abstract constant propagation: abstract in the sense that it folds over abstract values, represented just as type sets and ranges, and which materializes a concrete value only if it is able to do so. Since ranges propagate through operators as well, it can also be considered as abstract constant folding; the type inference operators act as constant folders.

Another use of type information is in branches. If Guile sees (if (< n (vector-length v)) 1 2) and n and v have the right types and disjoint ranges, then we can fold the test and choose 1 or 2 depending on how the test folds.

Finally type information can enable strength reduction. For example it's a common compiler trick to want to reduce (* n 16) to (ash n 4), but if n isn't an integer this isn't going to work. Likewise, (* n 0) can be 0, 0.0, 0.0+0.0i, something else, or an error, depending on the type of n and whether the * operator has been extended to apply over non-number types. Type folding uses type information to reduce the strength of operations like these, but only where it can prove that the transformation is valid.

So that's type folding! It's a pretty neat pass that does a few things as once. Code here, and code for the type inference itself here.

type-driven unboxing

Guile uses type information in one other way currently, and that is to determine when to unbox floating-point numbers. The current metric is that whenever an arithmetic operation will produce a floating-point number -- in Scheme parlance, an inexact real -- then that operation should be unboxed, if it has an unboxed counterpart. Unboxed operations on floating-point numbers are advantageous because they don't have to allocate space on the garbage-collected heap for their result. Since an unboxed operation like the fadd floating-point addition operator takes raw floating-point numbers as operands, it also will never cause a type check, unlike the polymorphic add instruction. Knowing that fadd has no effects lets the compiler do a better job at common subexpression elimination (CSE), dead code elimination, loop-invariant code motion, and so on.

To unbox an operation, its operands are unboxed, the operation itself is replaced with its unboxed counterpart, and the result is then boxed. This turns something like:

(+ a b)

into:

(f64->scm (fl+ (scm->f64 a) (scm->f64 b)))

You wouldn't think this would be an optimization, except that the CSE pass can eliminate many of these conversion pairs using its scalar elimination via fabricated expressions pass.

A proper flow-sensitive type analysis is what enables sound, effective unboxing. After arithmetic operations have been unboxed, Guile then goes through and tries to unbox loop variables and other variables with more than one definition ("phi' variables, for the elect). It mostly succeeds at this. The results are great: summing a packed vector of 10 million 32-bit floating-point values goes down from 500ms to 130ms, on my machine, with no time spent in the garbage collector. Once we start doing native compilation we should be up to about 5e8 or 10e8 floats per second in this microbenchmark, which is totally respectable and about what gcc's -O0 performance gets.

weaknesses

This kind of type inference works great in tight loops, and since that's how many programs spend most of their time, that's great. Of course, this situation is also a product of programmers knowing that tight loops are how computers go the fastest, or at least of how compilers do the best on their code.

Where this approach to type inference breaks down is at function boundaries. There are no higher-order types and no higher-order reasoning, and indeed no function types at all! This is partially mitigated by earlier partial evaluation and contification passes, which open up space in which the type inferrer can work. Method JIT compilers share this weakness with Guile; tracing JIT runtimes like LuaJIT and PyPy largely do not.

up summing

So that's the thing! I was finally motivated to dust off this draft given the recent work on unboxing in a development branch of Guile. Happy hacking and we promise we'll actually make releases so that you can use it soon soon :)

loop optimizations in guile

2015-07-28T08:10:36Z

Sup peeps. So, after the slog to update Guile's intermediate language, I wanted to land some new optimizations before moving on to the next thing. For years I've been meaning to do some loop optimizations, and I was finally able to land a few of them.

loop peeling

For a long time I have wanted to do "loop peeling". Loop peeling means peeling off the first iteration of a loop. If you have a source program that looks like this:

while foo:
  bar()
  baz()

Loop peeling turns it into this:

if foo:
  bar()
  baz()
  while foo:
    bar()
    baz()

You wouldn't think that this is actually an optimization, would you? Well on its own, it's not. But if you combine it with common subexpression elimination, then it means that the loop body is now dominated by all effects and all loop-invariant expressions that must be evaluated for the expression to loop.

In dynamic languages, this is most useful when one source expression expands to a number of low-level steps. So for example if your language runtime implements top-level variable references in three parts, one where it gets a reference to a mutable box, then it checks if the box has a value, and and the third where it unboxes it, then we would have:

if foo:
  bar_location = lookup("bar")
  bar_value = dereference(bar_location)
  if bar_value is null: throw NotFound("bar")
  call(bar_value)

  baz_location = lookup("baz")
  baz_value = dereference(baz_location)
  if baz_value is null: throw NotFound("baz")
  call(baz_value)

  while foo:
    bar_value = dereference(bar_location)
    call(bar_value)

    baz_value = dereference(baz_location)
    call(baz_value)

The result is that we have hoisted the lookups and null checks out of the loop (if a box can never transition from full back to empty). It's a really powerful transformation that can even hoist things that traditional loop-invariant code motion can't, but more on that later.

Now, the problem with loop peeling is that usually values will escape your loop. For example:

while foo:
  x = qux()
  if x then return x
...

In this little example, there is a value x, and the return x statement is actually not in the loop. It's syntactically in the loop, but the underlying representation that the compiler uses looks more like this:

function qux(k):
  label loop_header():
    fetch(foo) -gt; loop_test
  label loop_test(foo_value):
    if foo_value then -> exit else -> body
  label body():
    fetch(x) -gt; have_x
  label have_x(x_value):
    if x_value then -> return_x else -> loop_header
  label return_x():
    values(x) -> k
  label exit():
    ...

This is the "CPS soup" I described in my last post. Only the bold parts are in the loop; notably, the return is outside the loop. Point being, if we peel off the first iteration, then there are two possible values for x that we would return:

if foo:
  x1 = qux()
  if x1 then return x1
  while foo:
    x2 = qux()
    if x2 then return x2
  ...

I have them marked as x1 and x2. But I've also duplicated the return x terms, which is not what we want. We want to peel off the first iteration, which will cause code growth equal to the size of the loop body, but we don't want to have to duplicate everything that's after the loop. What we have to do is re-introduce a join point that defines x:

if foo:
  x1 = qux()
  if x1 then join(x1)
  while foo:
    x2 = qux()
    if x2 then join(x2)
  ...
label join(x)
  return x

Here I'm playing fast and loose with notation because the real terms are too gnarly. What I'm trying to get across is that for each value that flows out of a loop, you need a join point. That's fine, it's a bit more involved, but what if your loop exits to two different points, but one value is live in both of them? A value can only be defined in one place, in CPS or SSA. You could re-place a whole tree of phi variables, in SSA parlance, with join blocks and such, but it's just too hard.

However we can still get the benefits of peeling in most cases if we restrict ourselves to loops that exit to only one continuation. In that case the live variable set is the intersection of all variables defined in the loop that are live at the exit points. Easy enough, and that's what we have in Guile now. Peeling causes some code growth but the loops are smaller so it should still be a win. Check out the source, if that's your thing.

loop-invariant code motion

Usually when people are interested in moving code out of loops they talk about loop-invariant code motion, or LICM. Contrary to what you might think, LICM is complementary to peeling: some things that peeling+CSE can hoist are not hoistable by LICM, and vice versa.

Unlike peeling, LICM does not cause code growth. Instead, for each expression in a loop, LICM tries to hoist it out of the loop if it can. An expression can be hoisted if all of these conditions are true:

It doesn't cause the creation of an observably new object. In Scheme, the definition of "observable" is quite subtle, so in practice in Guile we don't hoist expressions that can cause any allocation. We could use alias analysis to improve this.
The expression cannot throw an exception, or the expression is always evaluated for every loop iteration.
The expression makes no writes to memory, or if it writes to memory, other expressions in the loop cannot possibly read from that memory. We use effects analysis for this.
The expression makes no reads from memory, or if it reads from memory, no other expression in the loop can clobber those reads. Again, effects analysis.
The expression uses only loop-invariant variables.

This definition is inductive, so once an expression is hoisted, the values it defines are then considered loop-invariant, so you might be able to hoist a whole chain of values.

Compared to loop peeling, this has the gnarly aspect of having to explicitly reason about loop invariance and manually move code, which is a pain. (Really LICM would be better named "artisanal code motion".) However it causes no code growth, which is a plus, though like peeling it can increase register pressure. But the big difference is that LICM can hoist effect-free expressions that aren't always executed. Consider:

while foo:
  x = qux() ? "hi" : "ho"

Here for some reason it could be faster to cache "hi" or "ho" in registers, which is what LICM allows:

hi, ho = "hi", "ho"
while foo:
  x = qux() ? hi : ho

On the other hand, LICM alone can't hoist the if baz is null checks in this example from above:

while foo:
  bar()
  baz()

The issue is that the call to bar() might not return, so the error that might be thrown if baz is null shouldn't be observed until bar is called. In general we can't hoist anything that might throw an exception past some non-hoisted code that might throw an exception. This specific situation happens in Guile but there are similar ones in any language, I think.

More formally, LICM will hoist effectful but loop-invariant expressions that postdominate the loop header, whereas peeling hoists those expressions that dominate all back-edges. I think? We'll go with that. Again, the source.

loop inversion

Loop inversion is a little hack to improve code generation, and again it's a little counterintuitive. If you have this loop:

while n < x:
  n++

Loop inversion turns it into:

if n < x:
  do
    n++
  while n < x

The goal is that instead of generating code that looks like this:

header:
  test n, x;
  branch-if-greater-than-or-equal done;
  x = x + 1
  goto header
done:

You make something that looks like this:

  test n, x;
  branch-if-greater-than-or-equal done;
header:
  x = x + 1
  test n, x;
  branch-if-less-than header;
done:

The upshot is that the loop body now contains one branch instead of two. It's mostly helpful for tight loops.

It turns out that you can express this transformation on CPS (or SSA, or whatever), but that like loop peeling the extra branch introduces an extra join point in your program. If your loop exits to more than one label, then we have the same problems as loop peeling. For this reason Guile restricts loop inversion (which it calls "loop rotation" at the moment; I should probably fix that) to loops with only one exit continuation.

Loop inversion has some other caveats, but probably the biggest one is that in Guile it doesn't actually guarantee that each back-edge is a conditional branch. The reason is that usually a loop has some associated loop variables, and it could be that you need to reshuffle those variables when you jump back to the top. Mostly Guile's compiler manages to avoid shuffling, allowing inversion to have the right effect, but it's not guaranteed. Fixing this is not straightforward, since the shuffling of values is associated with the predecessor of the loop header and not the loop header itself. If instead we reshuffled before the header, that might work, but each back-edge might have a different shuffling to make... anyway. In practice inversion seems to work out fine; I haven't yet seen a case where it doesn't work. Source code here.

loop identification

One final note: what is a loop anyway? Turns out this is a somewhat hard problem, especially once you start trying to identify nested loops. Guile currently does the simple thing and just computes strongly-connected components in a function's flow-graph, and says that a loop is a non-trivial SCC with a single predecessor. That won't tease apart loop nests but oh wells! I spent a lot of time last year or maybe two years ago with that "Loop identification via D-J graphs" paper but in the end simple is best, at least for making incremental steps.

Okeysmokes, until next time, loop on!

cps soup

2015-07-27T14:43:00Z

Hello internets! This blog goes out to my long-time readers who have followed my saga hacking on Guile's compiler. For the rest of you, a little history, then the new thing.

In the olden days, Guile had no compiler, just an interpreter written in C. Around 8 years ago now, we ported Guile to compile to bytecode. That bytecode is what is currently deployed as Guile 2.0. For many reasons we wanted to upgrade our compiler and virtual machine for Guile 2.2, and the result of that was a new continuation-passing-style compiler for Guile. Check that link for all the backstory.

So, I was going to finish documenting this intermediate language about 5 months ago, in preparation for making the first Guile 2.2 prereleases. But something about it made me really unhappy. You can catch some foreshadowing of this in my article from last August on common subexpression elimination; I'll just quote a paragraph here:

In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

This is exactly the same concern that Matthew Fluet and Stephen Weeks had back in 2003:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.
As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y (i.e. without putting y in the scope of x)
[MLton-devel] CPS vs SSA

To be honest I think all this talk about dominators is distracting. Dominators are but a lightweight flow analysis, and I usually find myself using full-on flow analysis to compute the set of optimizations that I can do on a piece of code. In fact the only use I had for dominators in the nested CPS language was to rewrite scope trees! The salient part of Weeks' observation is that nested scope trees are the problem, not that dominators are the solution.

So, after literally years of hemming and hawing about this, I finally decided to remove nested scope trees from Guile's CPS intermediate language. Instead, a function is now a collection of labelled continuations, with one distinguished entry continuation. There is no more $letk term to nest continuations in each other. A program is now represented as a "soup" -- basically a map from labels to continuation bodies, again with a distinguished entry. As an example, consider this expression:

function(x):
  return add(x, 1)

If we rewrote it in continuation-passing style, we'd give the function a name for its "tail continuation", ktail, and annotate each expression with its continuation:

function(ktail, x):
  add(x, 1) -> ktail

Here the -> ktail means that the add expression passes its values to the continuation ktail.

With nested CPS, it could look like:

function(ktail, x):
  letk have_one(one): add(x, one) -> ktail
    load_constant(1) -> have_one

Here the label have_one is in a scope, as is the value one. With "CPS soup", though, it looks more like this:

function(ktail, x):
  label have_one(one): add(x, one) -> ktail
  label main(x): load_constant(1) -> have_one

It's a subtle change, but it took a few months to make so it's worth pointing out what's going on. The difference is that there is no scope tree for labels or variables any more. A variable can be used at a label if it flows to the label, in a flow analysis sense. Indeed, determining the set of variables that can be used at a label requires flow analysis; that's what Weeks was getting at in his 2003 mail about the advantages of SSA, which are really the advantages of an intermediate language without nested scope trees.

The question arises, though, now that we've decided on CPS soup, how should we represent a program as a value? We've gone from a nested term to a graph term, and we need to find a way to represent it somehow that facilitates looking up labels by name, and facilitates tree rewrites.

In Guile's IR, labels and variables are both integers, so happily enough, we have such a data structure: Clojure-style maps specialized for integer keys.

Friends, if there has been one realization or revolution for me in the last year, it has been Clojure-style data structures. Here's why. In compilers, I often have to build up some kind of analysis, then use that analysis to transform data. Often I need to keep the old term around while I build a new one, but it would be nice to share state between old and new terms. With a nested tree, if a leaf changed you'd have to rebuild all surrounding terms, which is gnarly. But with Clojure-style data structures, more and more I find myself computing in terms of values: build up this value, transform this map to that set, fold over this map -- and yes, you can fold over Guile's intmaps -- and so on. By providing an expressive data structure for which I can control performance characteristics by using transients if needed, these data structures make my programs more about data and less about gnarly machinery.

As a concrete example, the old contification pass in Guile, I didn't have the mental capacity to understand all the moving parts in such a way that I could compute an optimal contification from the beginning; instead I had to iterate to a fixed point, as Kennedy did in his "Compiling with Continuations, Continued" paper. With the new CPS soup language and with Clojure-style data structures, I could actually fit more of the algorithm into my head, with the result that Guile now contifies optimally while avoiding the fixed-point transformation. Also, the old pass used hash tables to represent the analysis, which I found incredibly confusing to reason about -- I totally buy Rich Hickey's argument that place-oriented programming is the source of many evils in programs, and hash tables are nothing if not a place party. Using functional maps let me solve harder problems because they are easier for me to reason about.

Contification isn't an isolated case, either. For example, we are able to do the complete set of optimizations from the "Optimizing closures in O(0) time" paper, including closure sharing, which I think makes Guile unique besides Chez Scheme. I wasn't capable of doing it on the old representation because it was just too hard for me to think about, because my data structures weren't right.

This new "CPS soup" language is still a first-order CPS language in that each term specifies its continuation, and that variable names appear in the continuation of a definition, not the definition itself. This effectively makes every variable a phi variable, in the sense of SSA, and you have to do some work to get to a variable's definition. It could be that still this isn't the right number of names; consider this function:

function foo(k, x):
  label have_y(y) bar(y) -> k
  label y_is_two() load_constant(2) -> have_y
  label y_is_one() load_constant(1) -> have_y
  label main(x) if x -> y_is_one else -> y_is_two

Here there is no distinguished name for the value load_constant(1) versus load_constant(2): both are possible values for y. If we ended up giving them names, we'd have to reintroduce actual phi variables for the joins, which would basically complete the transformation to SSA. Until now though I haven't wanted those names, so perhaps I can put this off. On the other hand, every term has a label, which simplifies many things compared to having to contain terms in basic blocks, as is usually done in SSA. Yet another chapter in CPS is SSA is CPS is SSA, it seems.

Welp, that's all the nerdery for right now. Talk at yall later!

revisiting common subexpression elimination in guile

2014-08-25T09:48:48Z

A couple years ago I wrote about a common subexpression pass that I implemented in Guile 2.0.

To recap, Guile 2.0 has a global, interprocedural common subexpression elimination (CSE) pass.

In the context of compiler optimizations, "global" means that it works across basic block boundaries. Basic blocks are simple, linear segments of code without control-flow joins or branches. Working only within basic blocks is called "local". Working across basic blocks requires some form of understanding of how values can flow within the blocks, for example flow analysis.

"Interprocedural" means that Guile 2.0's CSE operates across closure boundaries. Guile 2.0's CSE is "context-insensitive", in the sense that any possible effect of a function is considered to occur at all call sites; there are newer CSE passes in the literature that separate effects of different call sites ("context-sensitive"), but that's not a Guile 2.0 thing. Being interprocedural was necessary for Guile 2.0, as its intermediate language could not represent (e.g.) loops directly.

The conclusion of my previous article was that although CSE could do cool things, in Guile 2.0 it was ultimately limited by the language that it operated on. Because the Tree-IL direct-style intermediate language didn't define order of evaluation, didn't give names to intermediate values, didn't have a way of explicitly representing loops and other kinds of first-order control flow, and couldn't precisely specify effects, the results, well, could have been better.

I know you all have been waiting for the last 27 months for an update, probably forgoing meaningful social interaction in the meantime because what if I posted a followup while you were gone? Be at ease, fictitious readers, because that day has finally come.

CSE over CPS

The upcoming Guile 2.2 has a more expressive language for the optimizer to work on, called continuation-passing style (CPS). CPS explicitly names all intermediate values and control-flow points, and can integrate nested functions into first-order control-flow via "contification". At the same time, the Guile 2.2 virtual machine no longer penalizes named values, which was another weak point of CSE in Guile 2.0. Additionally, the CPS intermediate language enables more fined-grained effects analysis.

All of these points mean that CSE has the possibility to work better in Guile 2.2 than in Guile 2.0, and indeed it does. The shape of the algorithm is a bit different, though, and I thought some compiler nerds might be interested in the details. I'll follow up in the next section with some things that new CSE pass can do that the old one couldn't.

So, by way of comparison, the old CSE pass was a once-through depth-first visit of the nested expression tree. As the visit proceeded, the pass built up an "environment" of available expressions -- for example, that (car a) was evaluated and bound to b, and so on. This environment could be consulted to see if a expression was already present in the environment. If so, the environment would be traversed from most-recently-added to the found expression, to see if any intervening expression invalidated the result. Control-flow joins would cause recomputation of the environment, so that it only held valid values.

This simple strategy works for nested expressions without complex control-flow. CPS, on the other hand, can have loops and other control flow that Tree-IL cannot express, so for it to build up a set of "available expressions" requires a full-on flow analysis. So that's what the pass does: a flow analysis over the labelled expressions in a function to compute the set of "available expressions" for each label. A labelled expression a is available at label b if a dominates b, and no intervening expression could have invalidated the results. An expression invalidates a result if it may write to a memory location that the result may have read. The code, such as it is, may be found here.

Once you have the set of available expressions for a function, you can proceed to the elimination phase. First, you start by creating an "eliminated variable" map, which initially maps each variable to itself, and an "equivalent expressions" table, which maps "keys" to a set of labels and bound variables. Then you visit each expression in a function, again in topologically sorted order. For each expression, you compute a "key", which is some unique representation of an expression that can be compared by structural equality. Keys that compare as equal are equivalent, and are subject to elimination.

For example, consider a call to the add primitive with variables labelled b and c as arguments. Imagine that b maps to a in the eliminated variable table. The expression as a whole would then have a key representation as the list (primcall add a c). If this key is present in the equivalent expression table, you check to see if any of the equivalent labels is available at the current label. If so, hurrah! You mark the outputs of the current label as being replaced by the outputs of the equivalent label. Otherwise you add the key to the equivalent table, associated with the current label.

This simple algorithm is enough to recursively eliminate common subexpressions. Sometimes the recursive aspect (i.e. noticing that b should be replaced by a), along with the creation of a common key, causes the technique to be called global value numbering (GVN), but CSE seems a better name to me.

The algorithm as outlined above eliminates expressions that bind values. However not all expressions do that; some are used as control-flow branches. For this reason, Guile also computes a "truthy table" with another flow analysis pass. This table computes a set of which branches have been taken to get to each program point. In the elimination phase, if a branch is reached that is equivalent to a previously taken branch, we consult the truthy table to see which continuation the previous branch may have taken. If it can be proven to have taken just one of the legs, the test is elided and replaced with a direct jump.

A few things to note before moving on. First, the "compute an analysis, then transform the function" sequence is quite common in this sort of problem. It leads to some challenges regarding space for the analysis; my last article deals with these in more detail.

Secondly, the rewriting phase assumes that a value that is available may be substituted, and that the result would be a proper CPS term. This isn't always the case; see the discussion at the end of the article on CSE in Guile 2.0 about CPS, SSA, dominators, and scope. In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

Also, consider the clobbering part of analysis, where e.g. an expression that writes a value to memory has to invalidate previously read values. Currently this is implemented by traversing all available expressions. This is suboptimal and could be quadratic in the end. A better solution is to compute a dependency graph for expressions, which links together operations on the same regions of memory; see LLVM's memory dependency analysis for an idea of how to do this.

Finally, note that this algorithm is global but intraprocedural, meaning that it doesn't propagate values across closure boundaries. It's possible to extend it to be interprocedural, though it's less necessary in the presence of contification.

scalar replacement via fabricated expressions

Let's say you get to an expression at label L, (cons a b). It binds a result c. You determine you haven't seen it before, so you add (primcall cons a b) → L, c to your equivalent expressions set. Cool. We won't be able to replace a future instance of (cons a b) with c, because that doesn't preserve object identity of the newly allocated memory, but it's definitely a cool fact, yo.

What if we add an additional mapping to the table, (car c) → L, a? That way any expression at which L is available would replace (car c) with a, which would be pretty neat. To do so, you would have to add the &read effect to the cons call's effects analysis, but since the cons wasn't really up for elimination anyway it's all good.

Similarly, for (set-car! c d) we can add a mapping of (car c) → d. Again we have to add the &read effect to the set-car, but that's OK too because the write invalidated previous reads anyway.

The same sort of transformation holds for other kinds of memory that Guile knows how to allocate and mutate. Taken together, they form a sort of store-to-load forwarding and scalar replacement that can entirely eliminate certain allocations, and many accesses as well. To actually eliminate the allocations requires a bit more work, but that will be the subject of the next article.

future work

So, that's CSE in Guile 2.0. It works pretty well. In the future I think it's probably worth considering an abstract heap-style analysis of effects; in the end, the precision of CSE is limited to how precisely we can model the effects of expressions.

The trick of using CSE to implement scalar replacement is something I haven't seen elsewhere, though I doubt that it is novel. To fully remove the intermediate allocations needs a couple more tricks, which I will write about in my next nargy dispatch. Until then, happy hacking!

effects analysis in guile

2014-05-18T19:19:29Z

OK kids, so I had a bit of time recently. I've been hacking on Guile's new CPS-based compiler, which should appear in a stable release in a few months. I have a few things to write about, but today's article is on effects analysis.

what it is, yo

The job of the optimization phase of a compiler is to remove, replace and reorder the sub-expressions of a program. The optimizer recognizes ways in which the program can be better, and then needs to check if those transformations are valid. A transformation is valid if a program exhibits the same behavior both before and after the transformation.

Effects analysis is a proof technique that builds a conservative model of how expressions can affect each other. It can be used to prove the validity of some transformations. For example, if we determine that an expression A reads the first field in a vector, and expression B sets the second field in a vector, then we can use effects analysis to show that the expressions don't affect each others' value and can be freely reordered, for example to hoist either one out of a loop.

In effects analysis, we model the program state coarsely and conservatively with a limited set of categories. The precise set of categories depends on the domain. In graphics, for example, you might have a state bit for the current coordinate system, the current color, the current fill mode, etc. In general-purpose compilation, the effects are mostly about memory and (sometimes) exceptions.

In Guile we model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

For the purposes of Guile, a "type check" is the possibility that this expression will throw an exception if its arguments are somehow out of range. For example, cons will never throw an exception, except in out-of-memory situations, but + may throw an exception if its arguments aren't numbers. However if an expression is dominated by a copy of itself -- that is, if we see that (+ a b) (which may throw an exception) is dominated by (+ a b), perhaps after CSE -- then we can determine that only the first will exhibit the type-check effects, and that the second will not.

Allocation indicates that an expression allocates a fresh object. In Scheme (and many other languages), each allocated object has its own identity: (eq? (cons 1 2) (cons 1 2)) must be false. Note that this restriction does not apply to constant literals, in Scheme; (eq? '(1 . 2) '(1 . 2)) may or may not be true. In Guile both results are possible. They are the same object when compiled (and thus deduplicated), but different when interpreted; the objects are just the ones returned from `read' and this are different. Anyway we use this allocation bit to indicate that an expression allocates a fresh object with a fresh identity.

The remaining effect kinds are "read" and "write", which indicate reads or writes to memory. So there are just 4 kinds of effects.

Allocation, read, and write effects are associated at run-time with particular memory locations. At compile-time we cannot in general know which of these locations are distinct, and which are actually the same. To simplify the problem, we simply record the type of the object, knowing that (say) a pair and a vector will never be at the same location. We also record the field in the object in the case of objects with multiple fields. There are special catch-all values to indicate "all memory kinds", when we really don't know what an expression will do (which is the case for all expression kinds without specific support in the effect analyzer), and for "all fields" when we don't know which field we are accessing.

One example of the use of this analysis is in common subexpression elimination (CSE). If at a program point P we have a set of available expressions A, then to determine which members of A are still available after the expression at P, you subtract members of A that are clobbered by P. Computation of A at each P plus value numbering is most of CSE; but more on that in some later dispatch. Anyway here's the definition of effects-clobber?.

(define (effect-clobbers? a b)
  "Return true if A clobbers B.  This is the case
if A might be a write, and B might be a read or a
write to the same location as A."
  (define (locations-same?)
    (let ((a (ash a (- &effect-kind-bits)))
          (b (ash b (- &effect-kind-bits))))
      (or (eqv? &unknown-memory-kinds
                (logand a &memory-kind-mask))
          (eqv? &unknown-memory-kinds
               (logand b &memory-kind-mask))
          (and (eqv? (logand a &memory-kind-mask)
                     (logand b &memory-kind-mask))
               ;; A negative field indicates "the
               ;; whole object".  Non-negative fields
               ;; indicate only part of the object.
               (or (< a 0) (< b 0) (= a b))))))
  (and (not (zero? (logand a &write)))
       (not (zero? (logand b (logior &read &write))))
       (locations-same?)))

This analysis is simple, small, and fast. It's also coarse and imprecise -- if you are reading from and writing to two vectors at once, you're almost sure to miss some optimization opportunities as accesses to all vectors are conflated into one bit. Oh well. If you get into this situation, you'll know it, and be able to invest a bit more time into alias analysis; there's lots of literature out there. A simple extension would be to have alias analysis create another mapping from expression to equivalence class, and to use those equivalence classes in the same-location? check above.

Of course this assumes that expressions just access one object. This is the case for Guile's CPS intermediate language, because in CPS, as in SSA or ANF, expressions don't have subexpressions.

This contrasts with direct-style intermediate languages, in which expressions may have nested subexpressions. If you are doing effects analysis on such a language, it's probably more appropriate to allocate a bit to each kind of effect on each kind of object, so that you can usefully union effects for a tree of expressions. Since you don't have to do this for CPS, we can allocate a fixed bit-budget towards more precision as to which field of an object is being accessed. The inability to be precise as to which field was being accessed due to the direct-style IL was one of the problems in Guile's old CSE pass.

Finally, a note about type checks. Guile includes type checks as part of the effects analysis for two reasons. The first is the obvious case of asking whether an expression is effect-free or not, which can lead to some optimizations in other parts of the compiler. The other is to express the potential for elision of duplicate expressions, if one dominates the other. But it's also possible to remove type checks in more cases than that: one can run a type inference pass to remove type-check effects if we can prove that the arguments to an expression are in range. Obviously this is more profitable for dynamically-typed languages, but the same considerations apply to any language with sum types.

Guile's effects analysis pass is here. V8 seems to have two effects analysis passes; one is in effects.h and typing.cc, and operates over the direct-style AST, and the other is in the value numbering pass (hydrogen-instructions.h and hydrogen-gvn.h; search for GVNFlag).

More on how effects analysis is used in Guile in a future missive. Until then, happy hacking.

a continuation-passing style intermediate language for guile

2014-01-12T21:58:00Z

Happy new year's, hackfolk!

A few weeks ago I wrote about the upcoming Guile 2.2 release, and specifically about its new register virtual machine. Today I'd like to burn some electrons on another new part in Guile 2.2, its intermediate language.

To recap, we switched from a stack machine to a register machine because, among other reasons, register machines can consume and produce named intermediate results in fewer instructions than stack machines, and that makes things faster.

To take full advantage of this new capability, it is appropriate to switch at the same time from the direct-style intermediate language (IL) that we had to an IL that names all intermediate values. This lets us effectively reason about each subexpression that goes into a computation, for example in common subexpression elimination.

As far as intermediate languages go, basically there are two choices these days: something SSA-based, or something CPS-based. I wrote an article on SSA, ANF and CPS a few years ago; if you aren't familiar with these or are feeling a little rusty, I suggest you go and take a look.

In Guile I chose a continuation-passing style language. I still don't know if I made the right choice. I'll go ahead and describe Guile's IL and then follow up with some reflections. The description below is abbreviated from a more complete version in Guile's manual.

guile's cps language

Guile's CPS language is composed of terms, expressions, and continuations. It was heavily inspired by Andrew Kennedy's Compiling with Continuations, Continued paper.

A term can either evaluate an expression and pass the resulting values to some continuation, or it can declare local continuations and contain a sub-term in the scope of those continuations.

$continue k src exp: Evaluate the expression exp and pass the resulting values (if any) to the continuation labelled k.
$letk conts body: Bind conts in the scope of the sub-term body. The continuations are mutually recursive.

Additionally, the early stages of CPS allow for a set of mutually recursive functions to be declared as a term via a $letrec term. A later pass will attempt to transform the functions declared in a $letrec into local continuations. Any remaining functions are later lowered to $fun expressions. More on "contification" later.

Here is an inventory of the kinds of expressions in Guile's CPS language. Recall that all expressions are wrapped in a $continue term which specifies their continuation.

$void: Continue with the unspecified value.
$const val: Continue with the constant value val.
$prim name: Continue with the procedure that implements the primitive operation named by name.
$fun src meta free body: Continue with a procedure. body is the $kentry $cont of the procedure entry. free is a list of free variables accessed by the procedure. Early CPS uses an empty list for free; only after closure conversion is it correctly populated.
$call proc args: Call proc with the arguments args, and pass all values to the continuation. proc and the elements of the args list should all be variable names. The continuation identified by the term's k should be a $kreceive or a $ktail instance.
$primcall name args: Perform the primitive operation identified by name, a well-known symbol, passing it the arguments args, and pass all resulting values to the continuation.
$values args: Pass the values named by the list args to the continuation.
$prompt escape? tag handler: Push a prompt on the stack identified by the variable name tag and continue with zero values. If the body aborts to this prompt, control will proceed at the continuation labelled handler, which should be a $kreceive continuation. Prompts are later popped by pop-prompt primcalls.

The remaining element of the CPS language in Guile is the continuation. In CPS, all continuations have unique labels. Since this aspect is common to all continuation types, all continuations are contained in a $cont instance:

$cont k cont: Declare a continuation labelled k. All references to the continuation will use this label.

The most common kind of continuation binds some number of values, and then evaluates a sub-term. $kargs is this kind of simple lambda.

$kargs names syms body: Bind the incoming values to the variables syms, with original names names, and then evaluate the sub-term body.

Variables (e.g., the syms of a $kargs) should be globally unique. To bind the result of an expression a variable and then use that variable, you would continue from the expression to a $kargs that declares one variable. The bound value would then be available for use within the body of the $kargs.

$kif kt kf: Receive one value. If it is a true value, branch to the continuation labelled kt, passing no values; otherwise, branch to kf.

Non-tail function calls should continue to a $kreceive continuation in order to adapt the returned values to their uses in the calling function, if any.

$kreceive arity k: Receive values from a function return. Parse them according to arity, and then proceed with the parsed values to the $kargs continuation labelled k.

$arity is a helper data structure used by $kreceive and also by $kclause, described below.

$arity req opt rest kw allow-other-keys?: A data type declaring an arity. See Guile's manual for details.

Additionally, there are three specific kinds of continuations that can only be declared at function entries.

$kentry self tail clauses: Declare a function entry. self is a variable bound to the procedure being called, and which may be used for self-references. tail declares the $cont wrapping the $ktail for this function, corresponding to the function's tail continuation. clauses is a list of $kclause $cont instances.

$ktail: A tail continuation.

$kclause arity cont: A clause of a function with a given arity. Applications of a function with a compatible set of actual arguments will continue to cont, a $kargs $cont instance representing the clause body.

reflections

Before starting Guile's compiler rewrite, I had no real-world experience with CPS-based systems. I had worked with a few SSA-based systems, and a few more direct-style systems. I had most experience with the previous direct-style system that Guile had, but never had to seriously design another kind of IL, so basically I was ignorant. It shows, I think; but time will tell if it came out OK anyway. At this point I am cautiously optimistic.

As far as fitness for purpose goes, the CPS IL works in the sense that it is part of a self-hosting compiler. I'll say no more on that point other than to mention that it has explicit support for a number of Guile semantic features: multiple-value returns; optional, rest, and keyword arguments; cheap delimited continuations; Guile-native constant literals.

Why not ANF instead? If you recall from my SSA and CPS article, I mentioned that ANF is basically CPS with fewer labels. It tries to eliminate "administrative" continuations, whereas Guile's CPS labels everything. There is no short-hand let form.

ANF proponents tout its parsimony as a strength, but I do not understand this argument. I like having labels for everything. In CPS, I have as many labels as there are expressions, plus a few for continuations that don't contain terms. I use them directly in the bytecode compiler; the compiler never has to generate a fresh label, as they are part of the CPS itself.

More importantly, labelling every control-flow point allows me to reason precisely about control flow. For example, if a function is always called with the same continuation, it can be incorporated in the flow graph of the calling function. This is called "contification". It is not the same thing as inlining, as it works for sets of recursive functions as well, and never increases code size. This is a crucial transformation for a Scheme compiler to make, as it turns function calls into gotos, and self-function calls into loop back-edges.

Guile's previous compiler did a weak form of contification, but because we didn't have names for all control points it was gnarly and I was afraid to make it any better. Now its contifier is optimal. See Fluet and Weeks' Contification using Dominators and Kennedy's CWCC, for more on contification.

One more point in favor of labelling all continuations. Many tranformations can be best cast as a two-phase process, in which you first compute a set of transformations to perform, and then you apply them. Dead-code elimination works this way; first you find the least fixed-point of live expressions, and then you residualize only those expressions. Without names, how are you going to refer to an expression in the first phase? The ubiquitous, thorough labelling that CPS provides does not have this problem.

So I am happy with CPS, relative to ANF. But what about SSA? In my previous article, I asked SSA proponents to imagine returning a block from a block. Of course it doesn't make any sense; SSA is a first-order language. But modern CPS is also first-order, is the thing! Modern CPS distinguishes "continuations" syntactically from functions, which is exactly the same as SSA's distinction between basic blocks and functions. CPS and SSA really are the same on this level.

The fundamental CPS versus SSA difference is, as Stephen Weeks noted a decade ago, one of data structures: do you group your expressions into basic blocks stored in a vector (SSA), or do you nest them into a scope tree (CPS)? It's not clear that I made the correct choice.

In practice with Guile's CPS you end up building graphs on the side that describe some aspect of your term. For example you can build a reverse-post-ordered control flow analysis that linearizes your continuations, and assigns them numbers. Then you can compute a bitvector for each continuation representing each one's reachable continuations. Then you can use this reachability analysis to determine the extent of a prompt's body, for example.

But this analysis is all on the side and not really facilitated by the structure of CPS itself; the CPS facilities that it uses are the globally unique continuation and value names of the CPS, and the control-flow links. Once the transformation is made, all of the analysis is thrown away.

Although this seems wasteful, the SSA approach of values having "implicit" names by their positions in a vector (instead of explicit ephemeral name-to-index mappings) is terrifying to me. Any graph transformation could renumber things, or leave holes, or cause vectors to expand... dunno. Perhaps I am too shy of the mutation foot-gun. I find comfort in CPS's minimalism.

One surprise I have found is that I haven't needed to do any dominator-based analysis in any of the paltry CPS optimizations I have made so far. I expect to do so once I start optimizing loops; here we see the cultural difference with SSA I guess, loops being the dominant object of study there. On the other hand I have had to solve flow equations on a few occasions, which was somewhat surprising, though enjoyable.

The optimizations I have currently implemented for CPS are fairly basic. Contification was tricky. One thing I did recently was to make all non-tail $call nodes require $kreceive continuations; if, as in the common case, extra values were unused, that was reflected in an unused rest argument. This required a number of optimizations to clean up and remove the extra rest arguments for other kinds of source expressions: dead-code elimination, the typical beta/eta reduction, and some code generation changes. It was worth it though, and now with the optimization passes things are faster than they were before.

Well, I find that I am rambling now. I know this is a lot of detail, but I hope that it helps some future compiler hacker understand more about intermediate language tradeoffs. I have been happy with CPS, but I'll report back if anything changes :) And if you are actually hacking on Guile, check the in-progress manual for all the specifics.

Happy hacking to all, and to all a good hack!

doing it wrong: cse in guile

2012-05-14T17:07:14Z

Greetings, readers! It's been a little while, but not because I haven't been doing anything, or nothing worth writing about. No, the reason I haven't written recently is because the perceived range of my ignorance has been growing faster than the domain of my expertise.

Knowledge may well be something one can dominate, but ignorance must forever be a range, stretching off to a hazy horizon. Climbing the hill of JavaScript implementations has let me see farther out on that plain. I think it's only the existence of the unknown unknowns that can let one muster up the arrogance to write at all.

But back to domains and dominators later. To begin, I note that there is very little in the way of correct, current, and universal folk wisdom as to how to implement a programming language. Compiler hackers are priests of their languages, yes, but their temples are in reality more or less isolated cults, in which the values of one's Gods may be unknown or abhorrent to those of others. Witness the attention paid to loop optimizations in some languages, compared to garbage collection in others, or closures in still others.

In my ecumenical capacity as abbot of Guile and adjunct deacon of JavaScriptCore, sometimes I'm tempted to mix rites: to sprinkle the holy water of lexical scope optimizations on JS, and, apropos of this article, to exorcise common subexpressions in Scheme.

As one might well imagine, the rites of one cult must be adapted to circumstances. I implemented CSE for Guile, but I don't know if it was actually a win. In this article I'll go into what CSE is, how it works in Guile, why it probably won't survive in its present form.

cse: common subexpression elimination

I implemented a source-to-source optimization pass in Guile that eliminates common subexpressions. It actually does both more and less than that: it propagates predicates and eliminates effect-free statements as well, and these latter optimizations are why I implemented the pass in the first place.

Let me give an example. Let's imagine we implement a binary tree in Guile, using the records facility.

(use-modules (srfi srfi-9))

(define-record-type btree
  (make-btree elt left right)
  btree?
  (elt btree-elt)
  (left btree-left)
  (right btree-right))

(define *btree-null* #f)

(define (btree-cons head tail)
  (if (btree? tail)
      (let ((elt (btree-elt tail)))
        (if (< elt head)
            (make-btree elt
                        (btree-left tail)
                        (btree-cons head (btree-right tail)))
            (make-btree elt
                        (btree-cons head (btree-left tail))
                        (btree-right tail))))
      (make-btree head
                  *btree-null*
                  *btree-null*)))

That's enough to illustrate my point, I think. We have the data type, the base case, and a constructor. Of course in Haskell or something with better data types it would be much cleaner, but hey, let's roll with this.

If you look at btree-cons, it doesn't seem to be amenable in its current form to classic common subexpression elimination. People don't tend to write duplicated code. You see that I bound the temporary elt instead of typing (btree-elt btree) each time, and that was partly because of typing, and partly out of some inner efficiency puritan, telling me I shouldn't write duplicate expressions. (Cult behavior, again!)

But, note that all of these record abstractions will probably inline, instead of calling out to procedures. (They are implemented as inlinable procedures. Yes, it would be better to have cross-module inlining, but we don't, so this is what we do.) In general, syntactic abstraction in Scheme can lead to duplicate code. Apologies in advance for this eyeball-rending torrent, but here's a listing of what btree-cons reduces to, after expansion and partial evaluation:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (if (eq? (struct-vtable tail) btree)
                     (struct-ref tail 0)
                     (throw 'wrong-type-arg
                            'btree-elt
                            "Wrong type argument: ~S"
                            (list tail)
                            (list tail)))))
        (if (< elt head)
            (let ((left (if (eq? (struct-vtable tail) btree)
                            (struct-ref tail 1)
                            (throw 'wrong-type-arg
                                   'btree-left
                                   "Wrong type argument: ~S"
                                   (list tail)
                                   (list tail))))
                  (right (btree-cons
                           head
                           (if (eq? (struct-vtable tail) btree)
                               (struct-ref tail 2)
                               (throw 'wrong-type-arg
                                      'btree-right
                                      "Wrong type argument: ~S"
                                      (list tail)
                                      (list tail))))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons
                          head
                          (if (eq? (struct-vtable tail) btree)
                              (struct-ref tail 1)
                              (throw 'wrong-type-arg
                                     'btree-left
                                     "Wrong type argument: ~S"
                                     (list tail)
                                     (list tail)))))
                  (right (if (eq? (struct-vtable tail) btree)
                             (struct-ref tail 2)
                             (throw 'wrong-type-arg
                                    'btree-right
                                    "Wrong type argument: ~S"
                                    (list tail)
                                    (list tail)))))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

Again, I'm really sorry about that, and it's not just for your eyes: it's also because that's a crapload of code for what should be a simple operation. It's also redundant! There are 6 checks that btree is in fact a btree, when only one is needed semantically. (Note that the null case is not a btree, of course.)

Furthermore, all of the checks in the first arm of the if are redundant. The code above is what the optimizer produces -- which is, you know, turrible.

So, I thought, we could run a pass over the source that tries to propagate predicates, and then tries to fold predicates whose boolean value we already know.

And that's what I did. Here's what Guile's optimizer does with the function, including the CSE pass:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (struct-ref tail 0)))
        (if (< elt head)
            (let ((left (struct-ref tail 1))
                  (right (btree-cons head (struct-ref tail 2))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons head (struct-ref tail 1)))
                  (right (struct-ref tail 2)))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

This is much better. It's quite close to the source program, except the symbolic references like btree-elt have been replaced with indexed references. The type check in the predicate of the if expression propagated to all the other type checks, causing those nested if expressions to fold.

Of course, CSE can also propagate bound lexicals:

(let ((y (car x)))
  (car x))
=> (let ((y (car x)))
     y)

This is the classic definition of CSE.

but is it a win?

I should be quite pleased with the results, except that CSE makes Guile's compiler approximately twice as slow. Granted, in the long run, this should be acceptable: code is usually run many more times than it is compiled. But this is a fairly expensive pass, and yet at the same time it's not as good as it could be.

In order to get to the heart of the matter, I need to explain a little about the implementation. CSE is a post-pass, that runs after partial evaluation (peval). I tried to make it a part of peval, as the two optimizations are synergistic -- oh yes, let's revel in that word -- are you feeling it? -- but it was too complicated in the end. The reason is that in functions like this:

(define (explode btree)
  (unless (btree? btree)
    (error "not a btree" btree))
  (values (btree-head btree)
          (btree-left btree)
          (btree-right btree)))

Here we have a sequence of two expressions. Since the first one bails out if the predicate is false, we should propagate a true predicate past the first expression. This means that running CSE on an expression returns two values: the rewritten expression, given the predicates already seen; and a new set of predicates that the expression asserts. We should be able to use these new assertions to elide the type checks in the second expression. And indeed, Guile can do this.

Perhaps you can see the size of the problem now. CSE is a pass that runs over all code, building up an ordered set of expressions that were evaluated, and in what context. When it sees a new expression in a test context -- as the predicate in an if -- it checks to see if the set contains that expression (or its negation) already, in test context, and if so tries to fold the expression to true or false. Already doing this set lookup and traversal is expensive -- at least N log N in the total size of the program, with some quadratic components in the case an expression is found, and also with high constant factors due to the need for custom hash and equality predicates.

The quadratic factor comes in when walking the set to see if the elimination is valid. Consider:

(if (car x)
    (if (car x) 10 20)
    30)

Here, we should be able to eliminate the second (car x), folding to (if (car x) 10 30). However, in this one:

(if (car x)
    (begin
      (y)
      (if (car x) 10 20))
    30)

If we don't know what (y) does, then we can't fold the second test, because perhaps (y) will change the contents of the pair, x. The information that allows us to make these decisions is effects analysis. For the purposes of Guile's optimizer, (car x) has two dependencies and can cause two effects: it depends on the contents of a mutable value, and on the value of a toplevel (x), and can cause the effect of an unbound variable error when resolving the toplevel, or a type error when accessing its car. Two expressions commute if neither depends on effects that the other causes.

I stole the idea of doing a coarse effects analysis, and representing it as bits in a small integer, from V8. Guile's version is here: effects.scm. The ordered set is a form of global value numbering. See the CSE pass here: cse.scm.

The commute test is fairly cheap, but the set traversal is currently a bit expensive.

and for what?

As I have indicated, the pass does do something useful on real programs, as in the binary tree example. But it does not do all it could, and it's difficult to fix that, for a few reasons.

Unlike traditional CSE, Guile's version of it is interprocedural. Instead of operating on just one basic block or one function, it operates across nested functions as well. However, only some dependencies can commute across a function boundary. For example:

(lambda (x)
  (if (pair? x)
      (let ((y (car x)))
        (lambda ()
          (and (pair? x) (car x))))))

Can the first pair? test propagate to the second expression? It can, because pair? does not depend on the values of mutable data, or indeed on any effect. If it's true once, it will always be true.

But can we replace the second (car x) with y? No, because (car x) has a dependency on mutable data, and because we don't do escape analysis on the closure, we don't let those dependencies commute across a procedure boundary. (In this case, even if we did escape analysis, we'd have the same conclusion.)

However, not all lambda abstractions are closures. Some of them might end up being compiled to labels in the function. Scheme uses syntactically recursive procedures to implement loops, after all. But Guile's CSE does poorly for lambda expressions that are actually labels. The reason is that lexical scope is not a dominator tree.

MLton hacker Stephen Weeks says it better than I do:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.
As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y. (i.e. without putting y in the scope of x)
[MLton-devel] CPS vs SSA

See my article on SSA and CPS for more context.

So that's one large source of lost CSE opportunities, especially in loops.

Another large source of lost opportunities is that the Tree-IL language, which is basically a macro-expanded Scheme, has the same property that Scheme does, that the order of evaluation of operands is unspecified.

Consider the base-case clause of my btree-cons above:

(let ((left *btree-null*) (right *btree-null*))
  (make-struct/no-tail btree head left right))

Now, *btree-null* is a toplevel lookup, that might have an unbound-variable effect. We should be able to eliminate one of them, though. Why doesn't the CSE pass do it? Because in Tree-IL, the order of evaluation of the right-hand-sides of left and right is unspecified. This gives Guile some useful freedoms, but little information for CSE.

This is an instance of a more general problem, that Tree-IL might be too high-level to be useful for CSE. For example, at runtime, not all lexical references are the same -- some are local, and some reference free variables. For mutated free variables, the variable is itself in a box, so to reference it you would load the box into a local and then dereference the box. CSE should allow you to eliminate duplicate loads of the box, even in the case that it can't eliminate duplicate references into the box.

conclusion

It is nice to be able to eliminate the duplicate checks, but not at any price. Currently the bootstrapping time cost is a bit irritating. I have other ideas on how to fix that, but ultimately we probably need to re-implement CSE at some lower level. More on that in a future post. Until then, happy hacking.

from ssa to native code: v8's lithium language

2011-09-05T11:25:05Z

So, friends, about a month ago I posted a look at the Hydrogen intermediate language, as part of an ongoing series on the internals of the V8 JavaScript implementation. Hydrogen is the high-level part of V8's Crankshaft optimizing compiler, where the optimizations happen. This article tells the story of Lithium, Crankshaft's low-level, machine-dependent intermediate language.

totally metal

So here's the deal: you parsed your JavaScript functions to AST objects (parse all the functions!). You emitted some simple native code, quickly but not optimized, and you sat back and waited. After a little while, your profiling told you that some function was hot, so you re-parsed it, generated a graph of Hydrogen instructions and optimized that graph. Now you need to generate code. And that's where we are! Generate all the code!

But that's not what V8 does here. Instead of going directly to native code, it translates the HGraph to a chunk of Lithium code, then emits machine code from those Lithium instructions. Unlike Hydrogen, which is an SSA form in which instructions implicitly define values, the Lithium form is closer to three-address code, with labels and gotos, and explicitly names its operands (if any) and results (if any).

These temporary values are initially declared with a number of constraints, such as "must be in a double-precision register". Later, a register allocator considers the constraints and liveness ranges, allocating each value to a specific register. Code generation happens after register allocation.

an example

Let us take a look at an example to see how these things interact. Given that Lithium is target-specific, we'll just choose one of the targets; ia32 and x64 are the most advanced, so we'll just take x64. (That's what they call it.)

LInstruction* LChunkBuilder::DoAdd(HAdd* instr) {
  if (instr->representation().IsInteger32()) {
    ASSERT(instr->left()->representation().IsInteger32());
    ASSERT(instr->right()->representation().IsInteger32());
    LOperand* left =
        UseRegisterAtStart(instr->LeastConstantOperand());
    LOperand* right =
        UseOrConstantAtStart(instr->MostConstantOperand());
    LAddI* add = new LAddI(left, right);
    LInstruction* result = DefineSameAsFirst(add);
    if (instr->CheckFlag(HValue::kCanOverflow)) {
      result = AssignEnvironment(result);
    }
    return result;
  } else if (instr->representation().IsDouble()) {
    return DoArithmeticD(Token::ADD, instr);
  } else {
    ASSERT(instr->representation().IsTagged());
    return DoArithmeticT(Token::ADD, instr);
  }
  return NULL;
}

As an aside, this is fairly typical for Crankshaft code: tight, lots of asserts, custom type predicates instead of RTTI, self-explanatory enumerated values, and much more oriented towards function calls than in-place mutations. The new objects are created in a "zone", under the covers, so that once the compiler is done, it can free the whole zone at once. I reflowed the left and right lines; otherwise it really is a paragraph of text.

Unfortunately there are not a lot of big-picture comments, hence this blog series.

Anyway, here we have three cases. The first two are for when the HValue has been has been given an unboxed representation, as an int32 or as a double. The last is for tagged values, which can be small integers (Smi values) or heap-allocated doubles. In these latter cases, we dispatch to the DoArithmeticD and DoArithmeticT helpers, which create generic LArithmeticD and LArithmeticT instructions. But we see that in the first case, there is a special LAddI instruction.

When there is a special case in Crankshaft, it either means that we are working around some strange corner of JavaScript, or that something is being optimized. In this case, it is an optimization, reflecting the importance of integer addition. The concrete optimizations are:

Immediate operands. If one of the addends is a constant at compile-time, it can be encoded into the instruction stream directly.
Overflow check removal. If the range analysis (see my previous article) showed that the value would not overflow, then no overflow check is emitted. (Otherwise, an overflow check will be inserted, branching to a deoptimization bailout. The bailout will need to know what variables are live, in that case, hence the AssignEnvironment call; more on that later.)

The UseRegisterAtStart helper call indicates that the operand should be in a register, and that it will be live at the start of the instruction but not the end. This latter constraint is needed because x86 instructions typically clobber the first operand. UseRegisterOrConstant only allocates a register if the operand is not constant and so cannot be immediate. Finally DefineSameAsFirst constrains the output register to be the same as the first operand. All of these calls additionally inform the register allocator of the uses and definitions, in SSA parlance, so that the allocator can build accurate liveness ranges.

Just to be complete, the double arithmetic case (not shown) is bit more straightforward, in that there are no overflow checks or immediates. Tagged arithmetic, on the other hand, always dispatches to an inline cache. (I find this to be fascinating. Typically language implementations inline fixnum operations, but V8 does not because their inline caches also perform type feedback, allowing unboxing optimizations where possible. In V8, tagged integer arithmetic is never inlined, by default.)

allocate all the registers?

Why bother with this Lithium thing? Why not just generate code directly? The answer is that having a target-specific Lithium language allows target-specific code generation and temporary allocation, while permitting the register allocator itself to be generic. As an example, see the ARM DoAdd implementation.

Most everyone knows, but the idea of register allocation is to store your temporaries in machine registers instead of in memory, so that things run fast. It's tricky though, because nested procedure calls can trash your registers, so you need to "spill" them to memory in certain cases. You also need to avoid thrashing values between registers and memory. V8 also tries to use registers when calling internal runtime helpers, which can result in a fair amount of register shuffling.

That's all traditional compiler stuff. But in a garbage-collected language, we also have the complication of knowing which registers hold garbage-collected values and which don't. For every point in the code in which GC could run, the collector will need to mark those pointers, and possibly relocate them. This complicates the allocator.

Another complication is that besides the normal entry point, optimized functions often have an additional entry point, due to on-stack replacement. That entry point has to take values from memory and pack them into registers in a way that lets the computation proceed.

Finally, the allocator plays a role in permitting so-called "lazy deoptimization". V8 hacker Søren Gjesse writes, in a rare but pleasantly exegetic message to the list:

Lazy deoptimization is what happens when all code in a context is forced to be deoptimized either due to starting debugging or due to a global assumption on the optimized code that no longer holds. The reason for the term "lazy" is that actual deoptimization will not happen until control returns to the optimized function. The way we do this lazy deoptimization is somewhat fragile and we have had quite a few bugs in that code. It is based on destructively patching the optimized code placing a call just after each call forcing deoptimization. And actually the patching is not right after the return point but after the gap code inserted at the return point. For the deferred stack check the code up to restoring the registers (popad in IA-32) is considered the gap code.

gaps?

The gap code that Søren referred to is the set of moves that shuffles registers back into place after a call. Translation of Hydrogen to Lithium inserts LGap instructions between each LInstruction. When the allocator decides that it needs to spill, restore, or shuffle some registers, it records these moves on an LGap.

A later pass serializes these parallel moves into a specific order, trying to use the minimum number of temporaries. I link to Xavier Leroy's paper above, not because V8 uses Coq, but because of the nice introduction to the problem.

Anyway, the actual algorithm used by the Lithium register allocator appears to be a version of linear-scan, with live range splitting. This part of V8 actually is well-commented, so if you can understand the Sarkar paper, then head on over to ye olde google code and check it out.

generate all the code!

The result of register allocation is that all LOperands have been assigned specific machine registers, and some moves have been recorded in the LGap instructions. Now, finally, we generate code: each Lithium instruction is traversed in order, and they all write assembly into a buffer. See the LAddI code generator, as an example.

All of the JavaScript engines generate code using assemblers written in C++. Some people still assume that compilers must generate assembly, in text, but it's much easier and faster just to call methods on an object that writes bytes into a buffer. This approach also makes it easy to add macro instructions, just by subclassing the assembler. Finally, since modern JS implementations have to muck around with the bits all the time (disassembly, source debugging, backtraces, garbage collection (tracing and moving), deoptimization, OSR, inline caches, type feedback), it's easiest to maintain your invariants if you control the bit generation.

V8's assembler originally came from the Strongtalk project, where some of the V8 hackers came from, long ago. This assembler writes code into memory, going forward, and then when it's finished, it appends "relocation information" to the code, written backwards. Relocation information lets V8 mark interesting places in the generated code: pointers that might need to be relocated (after garbage collection, for example), correspondences between the machine program counter and source locations, etc.

As anyone who's hacked on debuggers knows, the problem with debug information is that it is large. Without clever encodings, the amount of debugging data generated by a compiler can kill runtime performance, simply because of the amount of data that needs to be paged into memory. Of course this concern also applies at compile-time, and especially so for JIT compilers. The Strongtalk assembler's main contribution to V8 (I think) is a complicated bit-encoding that tries to compress this data as far as possible. The assembler also provides iterators for traversing this data at runtime.

finish line

And now, dear readers, I believe that we have come full circle on this look at V8. We started low, looking at the assembly generated by V8 for a simple loop. We elaborated this low-level view with a look at on-stack replacement, though without understanding how such information was available. So we jumped back to the top, taking a high-level look at V8's compilers. We drilled down to Hydrogen, and here we covered Lithium, register allocation, and code generation.

I hope it's been an entertaining series to read; it has certainly been enlightening to write. Thanks very much to my employer Igalia for supporting my work on this. Comments and corrections are welcome, as always, and happy hacking.

a closer look at crankshaft, v8's optimizing compiler

2011-08-02T21:21:01Z

Continuing in my series of articles on V8, Google's JavaScript engine, in this article I'd like to take a closer look at V8's optimizing compiler, with a focus on the Hydrogen intermediate language.

brief summary

As my readers might recall, Crankshaft is the marketing name of V8's optimizing compiler. It complements the "full" compiler by optimizing hot functions, while ignoring uncommon cases. It translates the JavaScript abstract syntax tree to a high-level static single-assignment (SSA) representation called Hydrogen, and tries to optimize that Hydrogen graph. Then the Hydrogen is translated to the machine-specific Lithium low-level language, which facilitates register allocation and, finally, code generation.

The point of the Hydrogen translation is threefold:

To permit inlining. Inlining is truly the mother of all optimizations, in that it permits more optimizations to occur.
To permit temporary values to be represented as untagged integer or double values. In JavaScript, you need type feedback to do this with any degree of success. Unboxing reduces memory pressure and permits Crankshaft to use efficient native machine instructions. But as you would expect, this unboxing needs to be done carefully.
To facilitate loop-invariant code motion and common subexpression elimination.

The Hydrogen SSA representation facilitates all of these operations. This article takes a closer look at how it does so.

static single assignment

(click for accessible SVG)

Hydrogen consists of values--instructions and phis--contained in basic blocks, which themselves are contained in a graph.

The standard SSA notation associates a named value with every expression:

t1 = x + y;
t2 = 3;
t3 = t1 + t2;
...

It's useful to write it out this way, but these temporary names are not strictly necessary: since each instruction defines a temporary, we can use the identity (address) of each instruction as its name, for the purpose of SSA.

A later pass will allocate storage for all of these temporaries -- on the stack, on the heap, or unboxed in general-purpose or SSE/Neon registers. At first I was quite confused about the term "instruction", as we functional programmers tend to use "expression", but it's something you get used to after a while, and it's useful when you use it to refer to the named definition in addition to the expression that produces it.

An instruction is a kind of value -- an entity that defines a name. There other kind of value is the phi. Phis occur at control-flow joins, and are logically associated with the beginning of a block. My previous article discusses phis in detail.

All values have a pointer to their containing block and to their uses. A use is when a later instruction uses the value computed (defined) by an earlier instruction. Values also have a pointer to the next instruction in their block. It is cheaper to traverse instructions forward, in the direction of control and data flow, than it is to go backward. Values also have fields for their type, their representation, a flags bitfield, and a pointer to a range structure. We will look at these in more detail later.

The last instruction in a block is a control instruction. No one uses the value defined by a control expression -- their purpose is to transfer control to another basic block. The set of possible blocks that may follow a block in control-flow order are the successors of a block. Preceding blocks are predecessors. Every block has an array of pointers to its predecessors and successors, and also to its dominator block, and to all the blocks that it dominates.

Now, friends, when I first saw this I was disgusted. What a rat's nest of pointers! But then I read Kennedy's great Compiling With Continuations, Continued paper and it became clear that it is very difficult to have an efficient functional-style intermediate representation. The diagrams above are from a part of his paper in which he suggests a mutable IR. In my opinion, if he's gone that far, he might as well finish the job and eliminate temporary identifiers entirely, as Hydrogen does.

electrolysis

Crankshaft has a funny job to do. Not only does it have to optimize JavaScript, which is hard enough; it also has to do so in a way that permits on-stack replacement to optimize a long-running loop, without exiting the loop.

In practice what this means is that Crankshaft needs to walk the abstract syntax tree (AST) to emit Hydrogen code in the same way as the full compiler. It needs to simulate what the full compiler is doing: what is on the stack, which variables are heap-allocated and which are not, and, for named local variables, what HValue is bound to that variable. This simulation is kept in an environment.

This AST-to-Hydrogen translation process also handles phi insertion. So unlike the textbook dominance-frontier algorithm for optimal phi insertion, the HGraphBuilder simply adds phi values whenever a block has more than one predecessor, for every variable in the environment that is not always bound to the same value. For loops, a phi is added for every variable live in the environment at that time. A later pass attempts to prune the number of phis, where it can.

type feedback

The AST that was given to the crankshaft compiler has integer identifiers assigned to each node. These identifiers are assigned predictably, so they are the same identifiers that the "full-codegen" compiler saw, and they are embedded in the inline caches used by the full-codegen code. This correspondence allows crankshaft to know the types of the values that have been seen at any given call site, property access, etc.

This information is used by the graph builder to initialize the type field of each HValue, to help in type inference. Additionally, property accesses and function calls can record the precise type of object being accessed, or the function that was called, so that Crankshaft has an opportunity to inline these function and method calls.

inlining

When the graph builder reaches a call, it can try to inline it. Note that this happens very early, before any optimizations are made on the source code. This allows the optimizations to have more visibility, and allows for more code motion. (In general, V8 will not move instructions across a call boundary.)

There are lots of heuristics and conditions that affect inlining, and these are in flux. Some conditions that must currently be met for inlining to proceed are:

The function to be inlined is less than 600 characters long.
Neither the inner nor the outer functions may contain heap-allocated variables. (In practice, this means that neither function may contain lexical closures.)
Inlining of for-in, with, and some other expression types is not currently supported.
There is a limit on the maximum inlining depth.
A function's call to itself will not be inlined.
There is a limit to the number of nodes added to the AST.

source-to-source optimization passes

As parsing proceeds, Crankshaft generates specific HInstruction nodes and packs them into HBasicBlocks. There is a large set of about 120 instructions, which allows Crankshaft to reason precisely about the various operations and their representations.

At this point, inlining is complete, so Crankshaft moves to focus on representation of temporary values. The goal is to unbox as many values as possible, allowing the use of native machine instructions, and decreasing the allocation rate of tagged doubles. The Hydrogen SSA language directly facilitates this control-flow and data-flow analysis.

The following sections discuss the various source-to-source passes that are made on the Hydrogen IR. It's a lot of detail, but I'm sure that someone will find it useful.

block reordering

Crankshaft reorders the array of blocks in the HGraph to appear in reverse post-order, equivalent to a topological sort. In this way, iterating the array of blocks from beginning to end traverses them in data-flow order.

dominator assignment

The reverse post-order sort from the previous step facilitates calculation of the dominator for each block. The entry block dominates its successors. Other blocks are iterated over, in order, assigning them the dominator of their predecessors, or the common dominator of their predecessors if there is more than one.

mark dead subgraphs

It can often be the case that a function being optimized does not have full type information. Some branches in the original function may never have been executed, and so code on those branches does not have any type feedback. If the parser reaches such a branch -- and it will know at parse-time, because that's when it has the type-feedback information -- then it inserts an unconditional soft deoptimization, to force the collection of more type information, should that branch be reached in optimized code.

However, instead of causing a branch to a terminal deoptimization block, the parse continues, presumably because Crankshaft needs to proceed with its abstract interpretation of the AST. For that reason, the soft deoptimization appears in the instruction stream, not as the last instruction in the block.

To avoid the compiler wasting its time trying to optimize loops and move code in parts of the graph that have never been reached, this dead-subgraph pass puts a mark on all blocks dominated by a block containing an unconditional soft deoptimization. As you can see, this analysis is facilitated by the SSA dominator relation.

redundant phi elimination

The naïve phi placement algorithm detailed above can put phis where they are not needed. This pass eliminates phi values whose input is always the same, replacing the uses with the use of the value.

Like a number of phi-centered optimizations, elimination of one phi can expose other phis for elimination. For this reason, this and many other algorithms use a worklist.

In this case, all phis are placed on a work list. Then, while there are any phis on the worklist, a phi is taken off the list, and if it is eliminatable, it is replaced. Now here's the trick: if the phi was in turn used by any other phi, the phi uses are pushed onto the list, if they weren't there already. Iteration proceeds until the worklist is empty.

To keep the big-O order of worklist operations down, you need a flag for each element indicating whether the element is already in the list. The bit should either be in the objects being traversed, or in a separate bitvector. In this particular case -- and in this case only, as far as I can tell -- there is no such bit, so the algorithm can take too much time, in certain cases.

dead phi elimination, and phi collection

Some phi values are dead, having no real uses, neither direct nor indirect (through some other phi). This pass eliminates any phi that doesn't have a real use. It's somewhat like a garbage collection algorithm, and also uses a worklist.

Once this pass has run, we have minimized the set of phis, so we can collect them all into an array and assign them indices. These indices can be used in the future bitvectors, to contain worklist-driven algorithms.

As the phis are collected, if any undefined value is found to reach a phi, or arguments can reach a phi only via some inputs, then optimization is aborted entirely.

representation inference

Every HValue has a type field, representing a coarse approximation of the types seen at a particular AST node, and a representation field, representing a specific choice of how to represent a value. The goal, as I have said, is to represent temporaries as untagged int32 or double values. Type and representation are related, but they are treated separately in Crankshaft. This pass is about representation inference.

Firstly, Crankshaft looks at all of the phis, and sees which ones are connected. This is an N² fixed-point iteration. Then, for each phi, if any operand of the phi or any connected phi may not be converted to an int32, Crankshaft pessimistically marks the phi as not convertible to an int32. This pass tries to prevent too many representation changes for phi values.

Finally, for all values with flexible representation--phis, Math.abs, and binary bitwise and arithmetic operations--Crankshaft determines the representation to use for that value. If the representation can be inferred directly from the inputs, then that representation is used. Otherwise, some heuristics are run on the representations needed by the use sites of the value. If all uses are untagged, then the representation will be untagged, as an int32 if possible, unless any use treats the value as a double.

The end result is that, hopefully, many values have been unboxed to int32 or double values.

range analysis

This pass looks at each value, and tries to determine its range, if Crankshaft was actually able to allocate a specialized representation for it. In practice this is mainly used for values represented as integers. It allows HInstructions to assert various properties about the instruction, such as lack of overflow, or lack of negative zero values. These properties can influence code generation.

The algorithm propagates constraints forward on all control paths, so that control instructions can produce different ranges for the same value on different control paths. Ranges reaching a phi node are unioned together.

type inference and canonicalization

I mentioned that representation was related to type. Here I should be more specific. In Crankshaft, types are logically associated with objects. One can treat any object as an array, for the purposes of a calculation, but that treatment does not coerce the object itself to another type. Representation is about how to use an object; type is about the object itself.

Type inference can help eliminate runtime checks. If we can infer that a certain value is not only a tagged value, but is not a smi (small integer--a fixnum), then we can access its map directly without having to do the smi check. If we are storing an element in an array, and we know it is a smi, we don't need to emit a write barrier. So, type inference is another forward control-flow propagation algorithm, with a fixed-point iteration for phis in loop headers.

After type inference is run, each instruction in the whole graph is asked to canonicalize itself, which basically eliminates useless instructions: HToInt32 of a value with int32 representation, HCheckNonSmi of a value with non-smi type, etc.

deoptimization marks

When can you coerce undefined to an unboxed double? When you need a double for math, but you aren't going to eventually compare it to another double. The problem is that undefined coerces to NaN, but undefined == undefined and NaN != NaN. So, crankshaft marks any value used by a numeric compare as deoptimize-on-undefined. If the value is a phi, all connected phis must check for undefined, which can be a pretty big lose. But hey, that's JavaScript!

representation changes

Up to this point we've been dealing with values on a fairly high level, assuming that all of the definitions will be compatible with the uses. But as we chose to represent some temporaries as untagged values, there will need to be explicit conversions between representations -- hopefully not too many, but they will be needed. This pass inserts these representation changes.

This pass uses the truncating-to-int32 and deoptimize-on-undefined flags calculated in previous phases, to tell the changes what conditions they need to watch out for. Representation changes for phis happen just before branching to the block with the phi -- effectively, just before calling the block.

minus zero checks

The previous pass probably inserted a number of HChange instructions. For any HChange from int32 (which must be to tagged or double), Crankshaft traces dataflow backwards to mark the HChange(s) that produced the int32 as deoptimizing on negative zero. This allows parts of the graph to use fast int32 operations, while preserving the semantics of -0 should it appear, albeit more slowly (via a deoptimization).

stack check elimination

Loops need to be interruptable, and V8 does so by placing a stack check at the beginning of each loop iteration. If the runtime wants to interrupt a loop, it resets the stack limit of the process, and waits for the process's next stack check.

However if a call dominates all backward branches of a loop, then the loop can be interrupted by the stack check in the callee's prologue, so the stack check in the loop itself is unneeded and can be removed.

global value numbering: licm

I've been trying to get here for all this long page. Thanks to readers that have kept with it!

At this point, Hydrogen has served two of its three purposes: inlining and unboxing. The third is loop-invariant code motion (LICM) and common subexpression elimination (CSE). These important optimizations are performed by the perhaps misnamed global value numbering (GVN) pass.

LICM is about code motion: moving code out of hot loops. So what prevents an HInstruction from being moved? Two things: data dependencies, and side effects.

The LICM algorithm takes care of both of these. For every instruction in a loop, if it only depends on values defined outside the loop, and the side effects of the loop as a whole do not prevent the move, then LICM hoists the instruction to the loop header. If you do this in reverse post-order, you can hoist whole chains of expressions out of loops.

Every hydrogen instruction has a bitfield indicating the side effects that it causes, and a corresponding bitfield for the side effects that can affect it. In addition there is a flag indicating whether code motion is allowed for that instruction or not. These flags constitute a simple effects analysis -- if instruction B directly follows A, and A does not cause any side effects that B depends on, and B does not use the value from A, then B may be moved up before A.

Since the effects fields are simple bitfields, one can easily compute the side effects of a whole block by iterating over the instructions, doing a bitwise OR at each step. This is what LICM does -- it computes the side effects for the entire loop, and moves any instructions that it can to the loop header.

global value numbering: cse

Global value numbering, as implemented in V8, uses this technique in the opposite direction: instead of trying to hoist expressions up, it propagates them down the control-flow graph, trying to kill common subexpressions. A common subexpression is a Hydrogen value B that is "equal" in some sense to a previous instruction A, and whose value will not be affected by any side effects that occur on the control-flow paths between A and B. In this context, "value numbering" essentially means "define appropriate hash and equality functions and stick all the instructions you've seen in a hash table."

As GVN goes through the instructions of each block, in reverse-post-order, it puts them into this hash table (HValueMap). When GVN reaches an instruction that causes side effects, it flushes out the entries that depend on those side effects from the table. In that way, if ever it sees an instruction B that is equal to an instuction A in the table, GVN can remove B from its block, and replace every use of B with a use of A.

In standard SSA terminology you can consider instructions to be "definitions", because each instruction defines a value. Global value numbering works across basic blocks, hence the "global" in its name. Specifically, it eliminates equivalent definitions. One of the invariants of SSA is that "definitions dominate uses". So the part of the graph that should be searched for equivalent definitions is the part that is dominated by the original definition.

checked value replacement

In optimized code, accessing an array out of its bounds will cause a deoptimization. So the AST-to-Hydrogen translation will not emit a use of an index directly -- it inserts a bounds check, and has the array access use the result of the bounds check. But the purpose of this instruction is just to cause the side effect of deoptimization, and the resulting value is just a copy of the incoming index, which increases register allocation pressure for no reason. So now that code motion is complete, Crankshaft replaces uses of HBoundsCheck with the instruction that defines the index value, relying on the HBoundsCheck's position in the instruction stream to cause deopt as needed.

conclusions

In the end, the practical goals of Crankshaft are simple, and the optimizations are too, for the most part. Java people like to trot out all their gerundic compiler passes (see slide 7) and declare victory, and it is true that V8 and other modern JavaScript implementations will be beaten by the HotSpot JVM -- once it is allowed to warm up, of course. (V8's 5ms startup is particularly impressive, here.) But Crankshaft definitely has the basics, and throws in some range analysis to squeeze out a few more percent on top.

Finally, if you've gotten here, thanks again for reading this far. Comments and corrections are most welcome.

I have enjoyed being able to do this serious code reading, and some small bit of V8 hacking, as part of my work at Igalia. I think I'll have one more article about the platform-specific Lithium language, and its register allocator, and then see where I should move on from there. Happy hacking to all, and to all a good hack!