wingologA mostly dorky weblog by Andy Wingo2023-11-14T22:10:58Ztekutihttps://wingolog.org/feed/atomAndy Wingohttps://wingolog.org/whiffle, a purpose-built schemehttps://wingolog.org/2023/11/14/whiffle-a-purpose-built-scheme2023-11-14T22:10:58Z2023-11-14T22:10:58Z

Yesterday I promised an apology but didn’t actually get past the admission of guilt. Today the defendant takes the stand, in the hope that an awkward cross-examination will persuade the jury to take pity on a poor misguided soul.

Which is to say, let’s talk about Whiffle: what it actually is, what it is doing for me, and why on earth it is that [I tell myself that] writing a new programming language implementation is somehow preferable than re-using an existing one.

graphic designgarbage collection is my passion

Whiffle is purpose-built to test the Whippet garbage collection library.

Whiffle lets me create Whippet test cases in C, without actually writing C. C is fine and all, but the problem with it and garbage collection is that you have to track all stack roots manually, and this is an error-prone process. Generating C means that I can more easily ensure that each stack root is visitable by the GC, which lets me make test cases with more confidence; if there is a bug, it is probably not because of an untraced root.

Also, Whippet is mostly meant for programming language runtimes, not for direct use by application authors. In this use-case, probably you can use less “active” mechanisms for ensuring root traceability: instead of eagerly recording live values in some kind of handlescope, you can keep a side table that is only consulted as needed during garbage collection pauses. In particular since Scheme uses the stack as a data structure, I was worried that using handle scopes would somehow distort the performance characteristics of the benchmarks.

Whiffle is not, however, a high-performance Scheme compiler. It is not for number-crunching, for example: garbage collectors don’t care about that, so let’s not. Also, Whiffle should not go to any effort to remove allocations (sroa / gvn / cse); creating nodes in the heap is the purpose of the test case, and eliding them via compiler heroics doesn’t help us test the GC.

I settled on a baseline-style compiler, in which I re-use the Scheme front-end from Guile to expand macros and create an abstract syntax tree. I do run some optimizations on that AST; in the spirit of the macro writer’s bill of rights, it does make sense to provide some basic reductions. (These reductions can be surprising, but I am used to the Guile’s flavor of cp0 (peval), and this project is mostly for me, so I thought it was risk-free; I was almost right!).

Anyway the result is that Whiffle uses an explicit stack. A safepoint for a thread simply records its stack pointer: everything between the stack base and the stack pointer is live. I do have a lingering doubt about the representativity of this compilation strategy; would a conclusion drawn from Whippet apply to Guile, which uses a different stack allocation strategy? I think probably so but it’s an unknown.

what’s not to love

Whiffle also has a number of design goals that are better formulated in the negative. I mentioned compiler heroics as one thing to avoid, and in general the desire for a well-understood correspondence between source code and run-time behavior has a number of other corrolaries: Whiffle is a pure ahead-of-time (AOT) compiler, as just-in-time (JIT) compilation adds noise. Worse, speculative JIT would add unpredictability, which while good on the whole would be anathema to understanding an isolated piece of a system like the GC.

Whiffle also should produce stand-alone C files, without a thick run-time. I need to be able to understand and reason about the residual C programs, and depending on third-party libraries would hinder this goal.

Oddly enough, users are also an anti-goal: as a compiler that only exists to test a specific GC library, there is no sense in spending too much time making Whiffle nicer for other humans, humans whose goal is surely not just to test Whippet. Whiffle is an interesting object, but is not meant for actual use or users.

corners: cut

Another anti-goal is completeness with regards to any specific language standard: the point is to test a GC, not to make a useful Scheme. Therefore Whippet gets by just fine without flonums, fractions, continuations (delimited or otherwise), multiple return values, ports, or indeed any library support at all. All of that just doesn’t matter for testing a GC.

That said, it has been useful to be able to import standard Scheme garbage collection benchmarks, such as earley or nboyer. These have required very few modifications to run in Whippet, mostly related to Whippet’s test harness that wants to spawn multiple threads.

and so?

I think this evening we have elaborated a bit more about the “how”, complementing yesterday’s note about the “what”. Tomorrow (?) I’ll see if I can dig in more to the “why”: what questions does Whiffle let me ask of Whippet, and how good of a job does it do at getting those answers? Until then, may all your roots be traced, and happy hacking.

Andy Wingohttps://wingolog.org/cross-module inlining in guilehttps://wingolog.org/2021/05/13/cross-module-inlining-in-guile2021-05-13T11:25:03Z2021-05-13T11:25:03Z

Greetings, hackers of spaceship Earth! Today's missive is about cross-module inlining in Guile.

a bit of history

Back in the day... what am I saying? I always start these posts with loads of context. Probably you know it all already. 10 years ago, Guile's partial evaluation pass extended the macro-writer's bill of rights to Schemers of the Guile persuasion. This pass makes local function definitions free in many cases: if they should be inlined and constant-folded, you are confident that they will be. peval lets you write clear programs with well-factored code and still have good optimization.

The peval pass did have a limitation, though, which wasn't its fault. In Guile, modules have historically been a first-order concept: modules are a kind of object with a hash table inside, which you build by mutating. I speak crassly but that's how it is. In such a world, it's hard to reason about top-level bindings: what module do they belong to? Could they ever be overridden? When you have a free reference to a, and there's a top-level definition of a in the current compilation unit, is that the a that's being referenced, or could it be something else? Could the binding be mutated in the future?

During the Guile 2.0 and 2.2 cycles, we punted on this question. But for 3.0, we added the notion of declarative modules. For these modules, bindings which are defined once in a module and which are not mutated in the compilation unit are declarative bindings, which can be reasoned about lexically. We actually translate them to a form of letrec*, which then enables inlining via peval, contification, and closure optimization -- in descending order of preference.

The upshot is that with Guile 3.0, top-level bindings are no longer optimization barriers, in the case of declarative modules, which are compatible enough with historic semantics and usage that they are on by default.

However, module boundaries have still been an optimization barrier. Take (srfi srfi-1), a little utility library on lists. One definition in the library is xcons, which is cons with arguments reversed. It's literally (lambda (cdr car) (cons car cdr)). But does the compiler know that? Would it know that (car (xcons x y)) is the same as y? Until now, no, because no part of the optimizer will look into bindings from outside the compilation unit.

mr compiler, tear down this wall

But no longer! Guile can now inline across module boundaries -- in some circumstances. This feature will be part of a future Guile 3.0.8.

There are actually two parts of this. One is the compiler can identify a set of "inlinable" values from a declarative module. An inlinable value is a small copyable expression. A copyable expression has no identity (it isn't a fresh allocation), and doesn't reference any module-private binding. Notably, lambda expressions can be copyable, depending on what they reference. The compiler then extends the module definition that's residualized in the compiled file to include a little procedure that, when passed a name, will return the Tree-IL representation of that binding. The design of that was a little tricky; we want to avoid overhead when using the module outside of the compiler, even relocations. See compute-encoding in that module for details.

With all of that, we can call ((module-inlinable-exports (resolve-interface '(srfi srfi-1))) 'xcons) and get back the Tree-IL equivalent of (lambda (cdr car) (cons car cdr)). Neat!

The other half of the facility is the actual inlining. Here we lean on peval again, causing <module-ref> forms to trigger an attempt to copy the term from the imported module to the residual expression, limited by the same effort counter as the rest of peval.

The end result is that we can be absolutely sure that constants in imported declarative modules will inline into their uses, and fairly sure that "small" procedures will inline too.

caveat: compiled imported modules

There are a few caveats about this facility, and they are sufficiently sharp that I should probably fix them some day. The first one is that for an imported module to expose inlinable definitions, the imported module needs to have been compiled already, not loaded from source. When you load a module from source using the interpreter instead of compiling it first, the pipeline is optimized for minimizing the latency between when you ask for the module and when it is available. There's no time to analyze the module to determine which exports are inlinable and so the module exposes no inlinable exports.

This caveat is mitigated by automatic compilation, enabled by default, which will compile imported modules as needed.

It could also be fixed for modules by topologically ordering the module compilation sequence; this would allow some parallelism in the build but less than before, though for module graphs with cycles (these exist!) you'd still have some weirdness.

caveat: abi fragility

Before Guile supported cross-module inlining, there was only explicit inlining across modules in Guile, facilitated by macros. If you write a module that has a define-inlinable export and you think about its ABI, then you know to consider any definition referenced by the inlinable export, and you know by experience that its code may be copied into other compilation units. Guile doesn't automatically recompile a dependent module when a macro that it uses changes, currently anyway. Admittedly this situation leans more on culture than on tools, which could be improved.

However, with automatically inlinable exports, this changes. Any definition in a module could be inlined into its uses in other modules. This may alter the ABI of a module in unexpected ways: you think that module C depends on module B, but after inlining it may depend on module A as well. Updating module B might not update the inlined copies of values from B into C -- as in the case of define-inlinable, but less lexically apparent.

At higher optimization levels, even private definitions in a module can be inlinable; these may be referenced if an exported macro from the module expands to a term that references a module-private variable, or if an inlinable exported binding references the private binding. But these optimization levels are off by default precisely because I fear the bugs.

Probably what this cries out for is some more sensible dependency tracking in build systems, but that is a topic for another day.

caveat: reproducibility

When you make a fresh checkout of Guile from git and build it, the build proceeds in the following way.

Firstly, we build libguile, the run-time library implemented in C.

Then we compile a "core" subset of Scheme files at optimization level -O1. This subset should include the evaluator, reader, macro expander, basic run-time, and compilers. (There is a bootstrap evaluator, reader, and macro expander in C, to start this process.) Say we have source files S0, S1, S2 and so on; generally speaking, these files correspond to Guile modules M0, M1, M2 etc. This first build produces compiled files C0, C1, C2, and so on. When compiling a file S2 implementing module M2, which happens to import M1 and M0, it may be M1 and M0 are provided by compiled files C1 and C0, or possibly they are loaded from the source files S1 and S0, or C1 and S0, or S1 and C0.

The bootstrap build uses make for parallelism, with each compile process starts afresh, importing all the modules that comprise the compiler and then using them to compile the target file. As the build proceeds, more and more module imports can be "serviced" by compiled files instead of source files, making the build go faster and faster. However this introduces system-specific nondeterminism as to the set of compiled files available when compiling any other file. This strategy works because it doesn't really matter whether module M1 is provided by compiled file C1 or source file S1; the compiler and the interpreter implement the same language.

Once the compiler is compiled at optimization level -O1, Guile then uses that freshly built compiler to build everything at -O2. We do it in this way because building some files at -O1 then all files at -O2 takes less time than going straight to -O2. If this sounds weird, that's because it is.

The resulting build is reproducible... mostly. There is a bug in which some unique identifiers generated as part of the implementation of macros can be non-reproducible in some cases, and that disabling parallel builds seems to solve the problem. The issue being that gensym (or equivalent) might be called a different number of times depending on whether you are loading a compiled module, or whether you need to read and macro-expand it. The resulting compiled files are equivalent under alpha-renaming but not bit-identical. This is a bug to fix.

Anyway, at optimization level -O1, Guile will record inlinable definitions. At -O2, Guile will actually try to do cross-module inlining. We run into two issues when compiling Guile; one is if we are in the -O2 phase, and we compile a module M which uses module N, and N is not in the set of "core" modules. In that case depending on parallelism and compile order, N may be loaded from source, in which case it has no inlinable exports, or from a compiled file, in which case it does. This is not a great situation for the reliability of this optimization. I think probably in Guile we will switch so that all modules are compiled at -O1 before compiling at -O2.

The second issue is more subtle: inlinable bindings are recorded after optimization of the Tree-IL. This is more optimal than recording inlinable bindings before optimization, as a term that is not inlinable due to its size in its initial form may become small enough to inline after optimization. However, at -O2, optimization includes cross-module inlining! A term that is inlinable at -O1 may become not inlinable at -O2 because it gets slightly larger, or vice-versa: terms that are too large at -O1 could shrink at -O2. We don't even have a guarantee that we will reach a fixed point even if we repeatedly recompile all inputs at -O2, because we allow non-shrinking optimizations.

I think this probably calls for a topological ordering of module compilation inside Guile and perhaps in other modules. That would at least give us reproducibility, provided we avoid the feedback loop of keeping around -O2 files compiled from a previous round, even if they are "up to date" (their corresponding source file didn't change).

and for what?

People who have worked on inliners will know what I mean that a good inliner is like a combine harvester: ruthlessly efficient, a qualitative improvement compared to not having one, but there is a pointy end with lots of whirling blades and it's important to stop at the end of the row. You do develop a sense of what will and won't inline, and I think Dybvig's "Macro writer's bill of rights" encompasses this sense. Luckily people don't lose fingers or limbs to inliners, but inliners can maim expectations, and cross-module inlining more so.

Still, what it buys us is the freedom to be abstract. I can define a module like:

(define-module (elf)
  #:export (ET_NONE ET_REL ET_EXEC ET_DYN ET_CORE))

(define ET_NONE		0)		; No file type
(define ET_REL		1)		; Relocatable file
(define ET_EXEC		2)		; Executable file
(define ET_DYN		3)		; Shared object file
(define ET_CORE		4)		; Core file

And if a module uses my (elf) module and references ET_DYN, I know that the module boundary doesn't prevent the value from being inlined as a constant (and possibly unboxed, etc).

I took a look and on our usual microbenchmark suite, cross-module inlining doesn't make a difference. But that's both a historical oddity and a bug: firstly that the benchmark suite comes from an old Scheme world that didn't have modules, and so won't benefit from cross-module inlining. Secondly, Scheme definitions from the "default" environment that aren't explicitly recognized as primitives aren't inlined, as the (guile) module isn't declarative. (Probably we should fix the latter at some point.)

But still, I'm really excited about this change! Guile developers use modules heavily and have been stepping around this optimization boundary for years. I count 100 direct uses of define-inlinable in Guile, a number of them inside macros, and many of these are to explicitly hack around the optimization barrier. I really look forward to seeing if we can remove some of these over time, to go back to plain old define and just trust the compiler to do what's needed.

by the numbers

I ran a quick analysis of the modules include in Guile to see what the impact was. Of the 321 files that define modules, 318 of them are declarative, and 88 contain inlinable exports (27% of total). Of the 6519 total bindings exported by declarative modules, 566 of those are inlinable (8.7%). Of the inlinable exports, 388 (69%) are functions (lambda expressions), 156 (28%) are constants, and 22 (4%) are "primitives" referenced by value and not by name, meaning definitions like (define head car) (instead of re-exporting car as head).

On the use side, 90 declarative modules import inlinable bindings (29%), resulting in about 1178 total attempts to copy inlinable bindings. 902 of those attempts are to copy a lambda expressions in operator position, which means that peval will attempt to inline their code. 46 of these attempts fail, perhaps due to size or effort constraints. 191 other attempts end up inlining constant values. 20 inlining attempts fail, perhaps because a lambda is used for a value. Somehow, 19 copied inlinable values get elided because they are evaluated only for their side effects, probably to clean up let-bound values that become unused due to copy propagation.

All in all, an interesting endeavor, and one to improve on going forward. Thanks for reading, and catch you next time!

Andy Wingohttps://wingolog.org/doing it wrong: cse in guilehttps://wingolog.org/2012/05/14/doing-it-wrong-cse-in-guile2012-05-14T17:07:14Z2012-05-14T17:07:14Z

Greetings, readers! It's been a little while, but not because I haven't been doing anything, or nothing worth writing about. No, the reason I haven't written recently is because the perceived range of my ignorance has been growing faster than the domain of my expertise.

Knowledge may well be something one can dominate, but ignorance must forever be a range, stretching off to a hazy horizon. Climbing the hill of JavaScript implementations has let me see farther out on that plain. I think it's only the existence of the unknown unknowns that can let one muster up the arrogance to write at all.

But back to domains and dominators later. To begin, I note that there is very little in the way of correct, current, and universal folk wisdom as to how to implement a programming language. Compiler hackers are priests of their languages, yes, but their temples are in reality more or less isolated cults, in which the values of one's Gods may be unknown or abhorrent to those of others. Witness the attention paid to loop optimizations in some languages, compared to garbage collection in others, or closures in still others.

In my ecumenical capacity as abbot of Guile and adjunct deacon of JavaScriptCore, sometimes I'm tempted to mix rites: to sprinkle the holy water of lexical scope optimizations on JS, and, apropos of this article, to exorcise common subexpressions in Scheme.

As one might well imagine, the rites of one cult must be adapted to circumstances. I implemented CSE for Guile, but I don't know if it was actually a win. In this article I'll go into what CSE is, how it works in Guile, why it probably won't survive in its present form.

cse: common subexpression elimination

I implemented a source-to-source optimization pass in Guile that eliminates common subexpressions. It actually does both more and less than that: it propagates predicates and eliminates effect-free statements as well, and these latter optimizations are why I implemented the pass in the first place.

Let me give an example. Let's imagine we implement a binary tree in Guile, using the records facility.

(use-modules (srfi srfi-9))

(define-record-type btree
  (make-btree elt left right)
  btree?
  (elt btree-elt)
  (left btree-left)
  (right btree-right))

(define *btree-null* #f)

(define (btree-cons head tail)
  (if (btree? tail)
      (let ((elt (btree-elt tail)))
        (if (< elt head)
            (make-btree elt
                        (btree-left tail)
                        (btree-cons head (btree-right tail)))
            (make-btree elt
                        (btree-cons head (btree-left tail))
                        (btree-right tail))))
      (make-btree head
                  *btree-null*
                  *btree-null*)))

That's enough to illustrate my point, I think. We have the data type, the base case, and a constructor. Of course in Haskell or something with better data types it would be much cleaner, but hey, let's roll with this.

If you look at btree-cons, it doesn't seem to be amenable in its current form to classic common subexpression elimination. People don't tend to write duplicated code. You see that I bound the temporary elt instead of typing (btree-elt btree) each time, and that was partly because of typing, and partly out of some inner efficiency puritan, telling me I shouldn't write duplicate expressions. (Cult behavior, again!)

But, note that all of these record abstractions will probably inline, instead of calling out to procedures. (They are implemented as inlinable procedures. Yes, it would be better to have cross-module inlining, but we don't, so this is what we do.) In general, syntactic abstraction in Scheme can lead to duplicate code. Apologies in advance for this eyeball-rending torrent, but here's a listing of what btree-cons reduces to, after expansion and partial evaluation:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (if (eq? (struct-vtable tail) btree)
                     (struct-ref tail 0)
                     (throw 'wrong-type-arg
                            'btree-elt
                            "Wrong type argument: ~S"
                            (list tail)
                            (list tail)))))
        (if (< elt head)
            (let ((left (if (eq? (struct-vtable tail) btree)
                            (struct-ref tail 1)
                            (throw 'wrong-type-arg
                                   'btree-left
                                   "Wrong type argument: ~S"
                                   (list tail)
                                   (list tail))))
                  (right (btree-cons
                           head
                           (if (eq? (struct-vtable tail) btree)
                               (struct-ref tail 2)
                               (throw 'wrong-type-arg
                                      'btree-right
                                      "Wrong type argument: ~S"
                                      (list tail)
                                      (list tail))))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons
                          head
                          (if (eq? (struct-vtable tail) btree)
                              (struct-ref tail 1)
                              (throw 'wrong-type-arg
                                     'btree-left
                                     "Wrong type argument: ~S"
                                     (list tail)
                                     (list tail)))))
                  (right (if (eq? (struct-vtable tail) btree)
                             (struct-ref tail 2)
                             (throw 'wrong-type-arg
                                    'btree-right
                                    "Wrong type argument: ~S"
                                    (list tail)
                                    (list tail)))))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

Again, I'm really sorry about that, and it's not just for your eyes: it's also because that's a crapload of code for what should be a simple operation. It's also redundant! There are 6 checks that btree is in fact a btree, when only one is needed semantically. (Note that the null case is not a btree, of course.)

Furthermore, all of the checks in the first arm of the if are redundant. The code above is what the optimizer produces -- which is, you know, turrible.

So, I thought, we could run a pass over the source that tries to propagate predicates, and then tries to fold predicates whose boolean value we already know.

And that's what I did. Here's what Guile's optimizer does with the function, including the CSE pass:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (struct-ref tail 0)))
        (if (< elt head)
            (let ((left (struct-ref tail 1))
                  (right (btree-cons head (struct-ref tail 2))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons head (struct-ref tail 1)))
                  (right (struct-ref tail 2)))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

This is much better. It's quite close to the source program, except the symbolic references like btree-elt have been replaced with indexed references. The type check in the predicate of the if expression propagated to all the other type checks, causing those nested if expressions to fold.

Of course, CSE can also propagate bound lexicals:

(let ((y (car x)))
  (car x))
=> (let ((y (car x)))
     y)

This is the classic definition of CSE.

but is it a win?

I should be quite pleased with the results, except that CSE makes Guile's compiler approximately twice as slow. Granted, in the long run, this should be acceptable: code is usually run many more times than it is compiled. But this is a fairly expensive pass, and yet at the same time it's not as good as it could be.

In order to get to the heart of the matter, I need to explain a little about the implementation. CSE is a post-pass, that runs after partial evaluation (peval). I tried to make it a part of peval, as the two optimizations are synergistic -- oh yes, let's revel in that word -- are you feeling it? -- but it was too complicated in the end. The reason is that in functions like this:

(define (explode btree)
  (unless (btree? btree)
    (error "not a btree" btree))
  (values (btree-head btree)
          (btree-left btree)
          (btree-right btree)))

Here we have a sequence of two expressions. Since the first one bails out if the predicate is false, we should propagate a true predicate past the first expression. This means that running CSE on an expression returns two values: the rewritten expression, given the predicates already seen; and a new set of predicates that the expression asserts. We should be able to use these new assertions to elide the type checks in the second expression. And indeed, Guile can do this.

Perhaps you can see the size of the problem now. CSE is a pass that runs over all code, building up an ordered set of expressions that were evaluated, and in what context. When it sees a new expression in a test context -- as the predicate in an if -- it checks to see if the set contains that expression (or its negation) already, in test context, and if so tries to fold the expression to true or false. Already doing this set lookup and traversal is expensive -- at least N log N in the total size of the program, with some quadratic components in the case an expression is found, and also with high constant factors due to the need for custom hash and equality predicates.

The quadratic factor comes in when walking the set to see if the elimination is valid. Consider:

(if (car x)
    (if (car x) 10 20)
    30)

Here, we should be able to eliminate the second (car x), folding to (if (car x) 10 30). However, in this one:

(if (car x)
    (begin
      (y)
      (if (car x) 10 20))
    30)

If we don't know what (y) does, then we can't fold the second test, because perhaps (y) will change the contents of the pair, x. The information that allows us to make these decisions is effects analysis. For the purposes of Guile's optimizer, (car x) has two dependencies and can cause two effects: it depends on the contents of a mutable value, and on the value of a toplevel (x), and can cause the effect of an unbound variable error when resolving the toplevel, or a type error when accessing its car. Two expressions commute if neither depends on effects that the other causes.

I stole the idea of doing a coarse effects analysis, and representing it as bits in a small integer, from V8. Guile's version is here: effects.scm. The ordered set is a form of global value numbering. See the CSE pass here: cse.scm.

The commute test is fairly cheap, but the set traversal is currently a bit expensive.

and for what?

As I have indicated, the pass does do something useful on real programs, as in the binary tree example. But it does not do all it could, and it's difficult to fix that, for a few reasons.

Unlike traditional CSE, Guile's version of it is interprocedural. Instead of operating on just one basic block or one function, it operates across nested functions as well. However, only some dependencies can commute across a function boundary. For example:

(lambda (x)
  (if (pair? x)
      (let ((y (car x)))
        (lambda ()
          (and (pair? x) (car x))))))

Can the first pair? test propagate to the second expression? It can, because pair? does not depend on the values of mutable data, or indeed on any effect. If it's true once, it will always be true.

But can we replace the second (car x) with y? No, because (car x) has a dependency on mutable data, and because we don't do escape analysis on the closure, we don't let those dependencies commute across a procedure boundary. (In this case, even if we did escape analysis, we'd have the same conclusion.)

However, not all lambda abstractions are closures. Some of them might end up being compiled to labels in the function. Scheme uses syntactically recursive procedures to implement loops, after all. But Guile's CSE does poorly for lambda expressions that are actually labels. The reason is that lexical scope is not a dominator tree.

MLton hacker Stephen Weeks says it better than I do:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y. (i.e. without putting y in the scope of x)

[MLton-devel] CPS vs SSA

See my article on SSA and CPS for more context.

So that's one large source of lost CSE opportunities, especially in loops.

Another large source of lost opportunities is that the Tree-IL language, which is basically a macro-expanded Scheme, has the same property that Scheme does, that the order of evaluation of operands is unspecified.

Consider the base-case clause of my btree-cons above:

(let ((left *btree-null*) (right *btree-null*))
  (make-struct/no-tail btree head left right))

Now, *btree-null* is a toplevel lookup, that might have an unbound-variable effect. We should be able to eliminate one of them, though. Why doesn't the CSE pass do it? Because in Tree-IL, the order of evaluation of the right-hand-sides of left and right is unspecified. This gives Guile some useful freedoms, but little information for CSE.

This is an instance of a more general problem, that Tree-IL might be too high-level to be useful for CSE. For example, at runtime, not all lexical references are the same -- some are local, and some reference free variables. For mutated free variables, the variable is itself in a box, so to reference it you would load the box into a local and then dereference the box. CSE should allow you to eliminate duplicate loads of the box, even in the case that it can't eliminate duplicate references into the box.

conclusion

It is nice to be able to eliminate the duplicate checks, but not at any price. Currently the bootstrapping time cost is a bit irritating. I have other ideas on how to fix that, but ultimately we probably need to re-implement CSE at some lower level. More on that in a future post. Until then, happy hacking.