inline cache applications in scheme

2012-05-29T08:07:39Z

The inline cache is a dynamic language implementation technique that originated in Smalltalk 80 and Self, and made well-known by JavaScript implementations. It is fundamental for getting good JavaScript performance.

a cure for acute dynamic dispatch

A short summary of the way inline caches work is that when you see an operation, like x + y, you don't compile in a procedure call to a generic addition subroutine. Instead, you compile a call to a procedure stub: the inline cache (IC). When the IC is first called, it will generate a new procedure specialized to the particular types that flow through that particular call site. On the next call, if the types are the same, control flows directly to the previously computed implementation. Otherwise the process repeats, potentially resulting in a polymorphic inline cache (one with entries for more than one set of types).

An inline cache is called "inline" because it is specific to a particular call site, not to the operation. Also, adaptive optimization can later inline the stub in place of the call site, if that is considered worthwhile.

Inline caches are a win wherever you have dynamic dispatch: named field access in JavaScript, virtual method dispatch in Java, or generic arithmetic -- and here we get to Scheme.

the skeptical schemer

What is the applicability of inline caches to Scheme? The only places you have dynamic dispatch in Scheme are in arithmetic and in ports.

Let's take arithmetic first. Arithmetic operations in Scheme can operate on number of a wide array of types: fixnums, bignums, single-, double-, or multi-precision floating point numbers, complex numbers, rational numbers, etc. Scheme systems are typically compiled ahead-of-time, so in the absence of type information, you always want to inline the fixnum case and call out [of line] for other cases. (Which line is this? The line of flow control: the path traced by a program counter.) But if you end up doing a lot of floating-point math, this decision can cost you. So inline caches can be useful here.

Similarly, port operations like read-char and write can operate on any kind of port. If you are always writing UTF-8 data to a file port, you might want to be able to inline write for UTF-8 strings and file ports, possibly inlining directly to a syscall. It's probably a very small win in most cases, but a win nonetheless.

These little wins did not convince me that it was worthwhile to use ICs in a Scheme implementation, though. In the context of Guile, they're even less applicable than usual, because Guile is a bytecode-interpreted implementation with a self-hosted compiler. ICs work best when implemented as runtime-generated native code. Although it probably will by the end of the year, Guile doesn't generate native code yet. So I was skeptical.

occam's elf

Somehow, through all of this JavaScript implementation work, I managed to forget the biggest use of inline caches in GNU systems. Can you guess?

The PLT!

You may have heard how this works, but if you haven't, you're in for a treat. When you compile a shared library that has a reference to printf, from the C library, the compiler doesn't know where printf will be at runtime. So even in C, that most static of languages, we have a form of dynamic dispatch: a call to an unknown callee.

When the dynamic linker loads a library at runtime, it could resolve all the dynamic references, but instead of doing that, it does something more clever: it doesn't. Instead, the compiler and linker collude to make the call to printf call a stub -- an inline cache. The first time that stub is called, it will resolve the dynamic reference to printf, and replace the stub with an indirect call to the procedure. In this way we trade off a faster loading time for dynamic libraries at the cost of one indirection per call site, for the inline cache. This stub, this inline cache, is sometimes called the PLT entry. You might have seen it in a debugger or a disassembler or something.

I found this when I was writing an ELF linker for Guile's new virtual machine. More on that at some point in the future. ELF is interesting: I find that if I can't generate good code in the ELF format, I'm generating the wrong kind of code. Its idiosyncrasies remind me of what happens at runtime.

lambda: the ultimate inline cache

So, back to Scheme. Good Scheme implementations are careful to have only one way of calling a procedure. Since the only kind of callable object in the Scheme language is generated by the lambda abstraction, Scheme implementations typically produce uniform code for procedure application: load the procedure, prepare the arguments, and go to the procedure's entry point.

However, if you're already eating the cost of dynamic linking -- perhaps via separately compiled Scheme modules -- you might as well join the operations of "load a dynamically-linked procedure" and "go to the procedure's entry point" into a call to an inline cache, as in C shared libraries. In the cold case, the inline cache resolves the dynamic reference, updates the cache, and proceeds with the call. In the hot case, the cache directly dispatches to the call.

One benefit of this approach is that it now becomes cheap to support other kinds of applicable objects. One can make hash tables applicable, if that makes sense. (Clojure folk seem to think that it does.) Another example would be to more efficiently support dynamic programming idioms, like generic functions. Inline caches in Scheme would allow generic functions to have per-call-site caches instead of per-operation caches, which could be a big win.

It seems to me that this dynamic language implementation technique could allow Guile programmers to write different kinds of programs. The code to generate an inline cache could even itself be controlled by a meta-object protocol, so that the user could precisely control application of her objects. The mind boggles, but pleasantly so!

Thanks to Erik Corry for provoking this thought, via a conversation at JSConf EU last year. All blame to me, of course.

as PLT_HULK would say

NOW THAT'S AN APPLICATION OF AN INLINE CACHE! HA! HA HA!

dynamic dispatch: a followup

2008-10-19T22:40:00Z

It seems that the 8-hash technique for dynamic dispatch that I mentioned in my last essay actually has a longer pedigree. At least 10 years before GOOPS' implementation, the always-excellent Gregor Kiczales wrote, with Luis H Rodriguez Jr.:

If we increase the size class wrappers slightly, we can add more hash seeds to each wrapper. If n is the number of hash seeds stored in each wrapper, we can think of each generic function selecting some number x less than n and using the xth hash seed from each wrapper. Currently we store 8 hash seeds in each wrapper, resulting in very low average probe depths.
The additional hash seeds increase the probability that a generic function will be able to have a low average probe depth in its memoization table. If one set of seeds doesn't produce a good distribution, the generic function can select one of the other sets instead. In effect, we are increasing the size of class wrappers in order to decrease the size of generic function memoization tables. This tradeoff is attractive since typical systems seem to have between three and five times as many generic functions as classes.
Efficient method dispatch in PCL

So it seems that Mikael Djurfeldt, the GOOPS implementor, appears to have known about CLOS implementation strategies. But it's interesting how this knowledge percolates out -- it's not part of the computer science canon. When you read these papers, it's always "Personal communication from Dave Moon this" and "I know about this Kiczales paper that". (Now you do too.)

Also interesting about the Kiczales paper is the focus on the user, the programmer, in the face of redefinitions -- truly a different culture than the one that is dominant now.

polymorphic inline caches buzz buzz buzz

This reference comes indirectly via Keith Rarick, who writes to mention a beautiful paper by Hölzle, Chambers, and Ungar, introducing polymorphic inline caches, a mechanism to dispatch based on runtime types, as GOOPS does.

PICs take dispatch one step further: instead of indirect table lookups as GOOPS does, a PIC is a runtime-generated procedure that performs the lookups directly in code. This difference between data-driven processing and direct execution is the essence of compilation -- compilation pushes all of the caching and branching logic as close to the metal as possible.

Furthermore, PICs can be a source of data as well as a dispatch mechanism:

The presence of PIC-based type information fundamentally alters the nature of optimization of dynamically-typed object-oriented languages. In “traditional” systems such as the current SELF compiler, type information is scarce, and consequently the compiler is designed to make the best possible use of the type information. This effort is expensive both in terms of compile time and compiled code space, since the heuristics in the compiler are tuned to spend time and space if it helps extract or preserve type information. In contrast, a PIC-based recompiling system has a veritable wealth of type information: every message has a set of likely receiver types associated with it derived from the previously compiled version’s PICs. The compiler’s heuristics and perhaps even its fundamental design should be reconsidered once the information in PICs becomes available [...].
Optimizing Dynamically-Typed Object-Oriented Programming Languages with Polymorphic Inline Caches

The salient point is that in latent-typed languages, all of the static type analysis techniques that we know are insufficient. Only runtime analysis and runtime recompilation can capture the necessary information for efficient compilation.

Read both of these articles! But if you just read one, make it the Ungar/Chambers/Hölzle -- it is well-paced, clearly-written, and illuminating.

Happy hacking!

wingolog

inline cache applications in scheme

dynamic dispatch: a followup