17 January 2018 10:30 AM (guile | compilers | instruction explosion | loop optimizations | cps)
Greetings, fellow Schemers and compiler nerds: I bring fresh nargery!
A couple years ago I made a list of compiler tasks for Guile. Most of these are still open, but I've been chipping away at the one labeled "instruction explosion":
Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do this, which would be a lose overall because you have more instruction dispatch.
Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.
Now that I'm getting close to finished I wanted to share some thoughts. Previous progress reports on the mailing list.
a simple loop
As an example, consider this loop that sums the 32-bit floats in a bytevector. I've annotated the code with lines and columns so that you can correspond different pieces to the assembly.
0 8 12 19 +-v-------v---v------v- | 1| (use-modules (rnrs bytevectors)) 2| (define (f32v-sum bv) 3| (let lp ((n 0) (sum 0.0)) 4| (if (< n (bytevector-length bv)) 5| (lp (+ n 4) 6| (+ sum (bytevector-ieee-single-native-ref bv n))) 7| sum)))
The assembly for the loop before instruction explosion went like this:
L1: 17 (handle-interrupts) at (unknown file):5:12 18 (uadd/immediate 0 1 4) 19 (bv-f32-ref 1 3 1) at (unknown file):6:19 20 (fadd 2 2 1) at (unknown file):6:12 21 (s64<? 0 4) at (unknown file):4:8 22 (jnl 8) ;; -> L4 23 (mov 1 0) at (unknown file):5:8 24 (j -7) ;; -> L1
So, already Guile's compiler has hoisted the (bytevector-length bv) and unboxed the loop index n and accumulator sum. This work aims to simplify further by exploding bv-f32-ref.
exploding the loop
In practice, instruction explosion happens in CPS conversion, as we are converting the Scheme-like Tree-IL language down to the CPS soup language. When we see a Tree-Il primcall (a call to a known primitive), instead of lowering it to a corresponding CPS primcall, we inline a whole blob of code.
In the concrete case of bv-f32-ref, we'd inline it with something like the following:
(unless (and (heap-object? bv) (eq? (heap-type-tag bv) %bytevector-tag)) (error "not a bytevector" bv)) (define len (word-ref bv 1)) (define ptr (word-ref bv 2)) (unless (and (<= 4 len) (<= idx (- len 4))) (error "out of range" idx)) (f32-ref ptr len)
As you can see, there are four branches hidden in the bv-f32-ref: two to check that the object is a bytevector, and two to check that the index is within range. In this explanation we assume that the offset idx is already unboxed, but actually unboxing the index ends up being part of this work as well.
One of the goals of instruction explosion was that by breaking the operation into a number of smaller, more orthogonal parts, native code generation would be easier, because the compiler would only have to know about those small bits. However without an optimizing compiler, it would be better to reify a call out to a specialized bv-f32-ref runtime routine instead of inlining all of this code -- probably whatever language you write your runtime routine in (C, rust, whatever) will do a better job optimizing than your compiler will.
But with an optimizing compiler, there is the possibility of removing possibly everything but the f32-ref. Guile doesn't quite get there, but almost; here's the post-explosion optimized assembly of the inner loop of f32v-sum:
L1: 27 (handle-interrupts) 28 (tag-fixnum 1 2) 29 (s64<? 2 4) at (unknown file):4:8 30 (jnl 15) ;; -> L5 31 (uadd/immediate 0 2 4) at (unknown file):5:12 32 (u64<? 2 7) at (unknown file):6:19 33 (jnl 5) ;; -> L2 34 (f32-ref 2 5 2) 35 (fadd 3 3 2) at (unknown file):6:12 36 (mov 2 0) at (unknown file):5:8 37 (j -10) ;; -> L1
The first thing to note is that unlike the "before" code, there's no instruction in this loop that can throw an exception. Neat.
Next, note that there's no type check on the bytevector; the peeled iteration preceding the loop already proved that the bytevector is a bytevector.
And indeed there's no reference to the bytevector at all in the loop! The value being dereferenced in (f32-ref 2 5 2) is a raw pointer. (Read this instruction as, "sp = *(float*)((byte*)sp + (uptrdiff_t)sp)".) The compiler does something interesting; the f32-ref CPS primcall actually takes three arguments: the garbage-collected object protecting the pointer, the pointer itself, and the offset. The object itself doesn't appear in the residual code, but including it in the f32-ref primcall's inputs keeps it alive as long as the f32-ref itself is alive.
Then there are the limitations. Firstly, instruction 28 tags the u64 loop index as a fixnum, but never uses the result. Why is this here? Sadly it's because the value is used in the bailout at L2. Recall this pseudocode:
(unless (and (<= 4 len) (<= idx (- len 4))) (error "out of range" idx))
Here the error ends up lowering to a throw CPS term that the compiler recognizes as a bailout and renders out-of-line; cool. But it uses idx as an argument, as a tagged SCM value. The compiler untags the loop index, but has to keep a tagged version around for the error cases.
The right fix is probably some kind of allocation sinking pass that sinks the tag-fixnum to the bailouts. Oh well.
Additionally, there are two tests in the loop. Are both necessary? Turns out, yes :( Imagine you have a bytevector of length 1025. The loop continues until the last ref at offset 1024, which is within bounds of the bytevector but there's one one byte available at that point, so we need to throw an exception at this point. The compiler did as good a job as we could expect it to do.
is is worth it? where to now?
On the one hand, instruction explosion is a step sideways. The code is more optimal, but it's more instructions. Because Guile currently has a bytecode VM, that means more total interpreter overhead. Testing on a 40-megabyte bytevector of 32-bit floats, the exploded f32v-sum completes in 115 milliseconds compared to around 97 for the earlier version.
On the other hand, it is very easy to imagine how to compile these instructions to native code, either ahead-of-time or via a simple template JIT. You practically just have to look up the instructions in the corresponding ISA reference, is all. The result should perform quite well.
I will probably take a whack at a simple template JIT first that does no register allocation, then ahead-of-time compilation with register allocation. Getting the AOT-compiled artifacts to dynamically link with runtime routines is a sufficient pain in my mind that I will put it off a bit until later. I also need to figure out a good strategy for truly polymorphic operations like general integer addition; probably involving inline caches.
So that's where we're at :) Thanks for reading, and happy hacking in Guile in 2018!