wingolog

inline cache applications in scheme

2012-05-29T08:07:39Z

The inline cache is a dynamic language implementation technique that originated in Smalltalk 80 and Self, and made well-known by JavaScript implementations. It is fundamental for getting good JavaScript performance.

a cure for acute dynamic dispatch

A short summary of the way inline caches work is that when you see an operation, like x + y, you don't compile in a procedure call to a generic addition subroutine. Instead, you compile a call to a procedure stub: the inline cache (IC). When the IC is first called, it will generate a new procedure specialized to the particular types that flow through that particular call site. On the next call, if the types are the same, control flows directly to the previously computed implementation. Otherwise the process repeats, potentially resulting in a polymorphic inline cache (one with entries for more than one set of types).

An inline cache is called "inline" because it is specific to a particular call site, not to the operation. Also, adaptive optimization can later inline the stub in place of the call site, if that is considered worthwhile.

Inline caches are a win wherever you have dynamic dispatch: named field access in JavaScript, virtual method dispatch in Java, or generic arithmetic -- and here we get to Scheme.

the skeptical schemer

What is the applicability of inline caches to Scheme? The only places you have dynamic dispatch in Scheme are in arithmetic and in ports.

Let's take arithmetic first. Arithmetic operations in Scheme can operate on number of a wide array of types: fixnums, bignums, single-, double-, or multi-precision floating point numbers, complex numbers, rational numbers, etc. Scheme systems are typically compiled ahead-of-time, so in the absence of type information, you always want to inline the fixnum case and call out [of line] for other cases. (Which line is this? The line of flow control: the path traced by a program counter.) But if you end up doing a lot of floating-point math, this decision can cost you. So inline caches can be useful here.

Similarly, port operations like read-char and write can operate on any kind of port. If you are always writing UTF-8 data to a file port, you might want to be able to inline write for UTF-8 strings and file ports, possibly inlining directly to a syscall. It's probably a very small win in most cases, but a win nonetheless.

These little wins did not convince me that it was worthwhile to use ICs in a Scheme implementation, though. In the context of Guile, they're even less applicable than usual, because Guile is a bytecode-interpreted implementation with a self-hosted compiler. ICs work best when implemented as runtime-generated native code. Although it probably will by the end of the year, Guile doesn't generate native code yet. So I was skeptical.

occam's elf

Somehow, through all of this JavaScript implementation work, I managed to forget the biggest use of inline caches in GNU systems. Can you guess?

The PLT!

You may have heard how this works, but if you haven't, you're in for a treat. When you compile a shared library that has a reference to printf, from the C library, the compiler doesn't know where printf will be at runtime. So even in C, that most static of languages, we have a form of dynamic dispatch: a call to an unknown callee.

When the dynamic linker loads a library at runtime, it could resolve all the dynamic references, but instead of doing that, it does something more clever: it doesn't. Instead, the compiler and linker collude to make the call to printf call a stub -- an inline cache. The first time that stub is called, it will resolve the dynamic reference to printf, and replace the stub with an indirect call to the procedure. In this way we trade off a faster loading time for dynamic libraries at the cost of one indirection per call site, for the inline cache. This stub, this inline cache, is sometimes called the PLT entry. You might have seen it in a debugger or a disassembler or something.

I found this when I was writing an ELF linker for Guile's new virtual machine. More on that at some point in the future. ELF is interesting: I find that if I can't generate good code in the ELF format, I'm generating the wrong kind of code. Its idiosyncrasies remind me of what happens at runtime.

lambda: the ultimate inline cache

So, back to Scheme. Good Scheme implementations are careful to have only one way of calling a procedure. Since the only kind of callable object in the Scheme language is generated by the lambda abstraction, Scheme implementations typically produce uniform code for procedure application: load the procedure, prepare the arguments, and go to the procedure's entry point.

However, if you're already eating the cost of dynamic linking -- perhaps via separately compiled Scheme modules -- you might as well join the operations of "load a dynamically-linked procedure" and "go to the procedure's entry point" into a call to an inline cache, as in C shared libraries. In the cold case, the inline cache resolves the dynamic reference, updates the cache, and proceeds with the call. In the hot case, the cache directly dispatches to the call.

One benefit of this approach is that it now becomes cheap to support other kinds of applicable objects. One can make hash tables applicable, if that makes sense. (Clojure folk seem to think that it does.) Another example would be to more efficiently support dynamic programming idioms, like generic functions. Inline caches in Scheme would allow generic functions to have per-call-site caches instead of per-operation caches, which could be a big win.

It seems to me that this dynamic language implementation technique could allow Guile programmers to write different kinds of programs. The code to generate an inline cache could even itself be controlled by a meta-object protocol, so that the user could precisely control application of her objects. The mind boggles, but pleasantly so!

Thanks to Erik Corry for provoking this thought, via a conversation at JSConf EU last year. All blame to me, of course.

as PLT_HULK would say

NOW THAT'S AN APPLICATION OF AN INLINE CACHE! HA! HA HA!

on-stack replacement in v8

2011-06-20T13:14:03Z

A recent post looked at what V8 does with a simple loop:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

It turns out that calling f actually inlines g into the body, at runtime. But how does V8 do that, and what was the dead code about? V8 hacker Kasper Lund kindly chimed in with an explanation, which led me to write this post.

When V8 compiled an optimized version of f with g, it decided to do so when f was already in the loop. It needed to stop the loop in order to replace the slow version with the fast version. The mechanism that it uses is to reset the stack limit, causing the overflow handler in g to be called. So the computation is stuck in a call to g, and somehow must be transformed to a call to a combined f+g.

[Ed: It seems I flubbed some of the details here; the OSR of f is actually triggered by the interrupt check in the loop back-edge in f, not the method prologue in g. Thanks to V8 hacker Vyacheslav Egorov for pointing this out. OSR on one activation at a time makes the problem a bit more tractable, but unfortunately for me it invalidates some of the details of my lovely drawing. Truth, truly a harsh mistress!]

The actual mechanics are a bit complicated, so I'm going to try a picture. Here we go:

As you can see, V8 replaces the top of the current call stack with one in which the frames have been merged together. This is called on-stack replacement (OSR). OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack.

I'm using weasel words here because OSR can go both ways: optimizing (in which N frames are combined to 1) and deoptimizing (typically the other way around), as Hölzle notes in "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback". More on deoptimization some other day, perhaps.

The diagram above mentions "loop restart entry points", which is indeed the "dead code" that I mentioned previously. That code forms part of an alternate entry point to the function, used only once, when the computation is restarted after on-stack replacement.

details, godly or otherwise

Given what I know now about OSR, I'm going to try an alternate decompilation of V8's f+g optimized function. This time I am going to abuse C as a high-level assembler. (I know, I know. Play along with me :))

Let us start by defining our types.

typedef union { uint64_t bits; } Value;
typedef struct { Value map; uint64_t data[]; } Object;

All JavaScript values are of type Value. Some Values encode small integers (SMIs), and others encode pointers to Object. Here I'm going to assume we are on a 64-bit system. The data member of Object is a tail array of slots.

inline bool val_is_smi (Value v) 
{ return !(v.bits & 0x1); }

inline Value val_from_smi (int32 i)
{ return (Value) { ((uint64_t)i) << 32 }; }

inline int32 val_to_smi (Value v)
{ return v.bits >> 32; }

Small integers have 0 as their least significant bit, and the payload in the upper 32 bits. Did you see the union literal syntax here? (Value){ ((uint64_t)i) << 32 }? It's a C99 thing that most folks don't know about. Anyway.

inline bool val_is_obj (Value v)
{ return !val_is_smi (v); }

inline Value val_from_obj (Object *p)
{ return (Value) { ((uint64_t)p) + 1U }; }

inline Object* val_to_obj (Value v)
{ return (Object*) (v.bits - 1U); }

Values that are Object pointers have 01 as their least significant bits. We have to mask off those bits to get the pointer to Object.

Numbers that are not small integers are stored as Object values, on the heap. All Object values have a map field, which points to a special object that describes the value: what type it is, how its fields are laid out, etc. Much like GOOPS, actually, though not as reflective.

In V8, the double value will be the only slot in a heap number. Therefore to access the double value of a heap number, we simply check whether the Object's map is the heap number map, and in that case return the first slot, as a double.

There is a complication though; what is value of the heap number map? V8 actually doesn't encode it into the compiled function directly. I'm not sure why it doesn't. Instead it stores the heap number map in a slot in the root object, and stores a pointer into the middle of the root object in r13. It's as if we had a global variable like this:

Value *root;    // r13

const int STACK_LIMIT_IDX = 0;
const int HEAP_NUMBER_MAP_IDX = -7; // Really.

inline bool obj_is_double (Object* o)
{ return o->map == root[HEAP_NUMBER_MAP_IDX]; }

Indeed, the offset into the root pointer is negative, which is a bit odd. But hey, details! Let's add the functions to actually get a double from an object:

union cvt { double d; uint64_t u; };

inline double obj_to_double (Object* o)
{ return ((union cvt) { o->data[0] }).d; }

inline Object* double_to_obj (double d)
{
  Object *ret = malloc (sizeof Value * 2);
  ret->map = root[HEAP_NUMBER_MAP_IDX];
  ret->data[0] = ((union cvt) { d }).u;
  return ret;
}

I'm including double_to_obj here just for completeness. Also did you enjoy the union literals in this block?

So, we're getting there. Perhaps the reader recalls, but V8's calling convention specifies that the context and the function are passed in registers. Let's model that in this silly code with some global variables:

Value context;  // rsi
Value function; // rdi

Recall also that when f+g is called, it needs to check that g is actually bound to the same value. So here we go:

const Value* g_cell;
const Value expected_g = (Value) { 0x7f7b205d7ba1U };

Finally, f+g. I'm going to show the main path all in one go. Take a big breath!

Value f_and_g (Value receiver)
{
  // Stack slots (5 of them)
  Value osr_receiver, osr_unused, osr_i, osr_ret, arg_context;
  // Dedicated registers
  register int64_t i, ret;

  arg_context = context;

  // Assuming the stack grows down.
  if (&arg_context > root[STACK_LIMIT_IDX])
    // The overflow handler knows how to inspect the stack.
    // It can longjmp(), or do other things.
    handle_overflow ();

  i = 0;
  ret = 0;
  
restart_after_osr:
  if (*g_cell != expected_g)
    goto deoptimize_1;

  while (i < 10000000)
    {
      register uint64_t tmp = ret + 1;
      if ((int64_t)tmp < 0)
        goto deoptimize_2;

      i++;
      ret = tmp;

      // Check for interrupt.
      if (&arg_context > root[STACK_LIMIT_IDX])
        handle_interrupt ();
    }

  return val_from_smi (ret);

And exhale. The receiver object is passed as an argument. There are five stack slots, none of which are really used in the main path. Two locals are allocated in registers: i and ret, as we mentioned before. There's the stack check, locals initialization, the check for g, and then the loop, and the return. The first test in the loop is intended to be a jump-on-overflow check.

In the original version of this article I had forgotten to include the interrupt handler; thanks again to Vyacheslav for pointing that out. Note that the calling the interrupt handler in a loop back-edge is more expensive than the handle_overflow from the function's prologue; see line 306 of the assembly in my original article.

restart_after_osr what?

But what about OSR, and what's that label about? What happens is that when f+g is replaced on the stack, the computation needs to restart. It does so from the osr_after_inlining_g label:

osr_after_inlining_g:
  if (val_is_smi (osr_i))
    i = val_to_smi (osr_i);
  else
    {
      if (!obj_is_double (osr_i))
        goto deoptimize_3;

      double d = obj_to_double (val_to_obj (osr_i));

      i = (int64_t)trunc (d);
      if ((double) i != d || isnan (d))
        goto deoptimize_3;
    }

Here we take the value for the loop counter, as stored on the stack by OSR, and unpack it so that it is stored in the right register.

Unpacking a SMI into an integer register is simple. On the other hand unpacking a heap number has to check that the number has an integer value. I tried to represent that here with the C code. The actual assembly is just as hairy but a bit more compact; see the article for the full assembly listing.

The same thing happens for the other value that was live at OSR time, ret:

  if (val_is_smi (osr_ret))
    ret = val_to_smi (osr_ret);
  else
    {
      if (!obj_is_double (osr_ret))
        goto deoptimize_4;

      double d = obj_to_double (val_to_obj (osr_ret));

      ret = (int64_t)trunc (d);
      if ((double) ret != d || isnan (d))
        goto deoptimize_4;

      if (ret == 0 && signbit (d))
        goto deoptimize_4;
    }
  goto restart_after_osr;

Here we see the same thing, except that additionally there is a check for -0.0 at the end, because V8 could not prove to itself that ret would not be -0.0. But if all succeeded, we jump back to the restart_after_osr case, and the loop proceeds.

Finally we can imagine some deoptimization bailouts, which would result in OSR, but for deoptimization instead of optimization. They are implemented as tail calls (jumps), so we can't represent them properly in C.

deoptimize_1:
  return deopt_1 ();
deoptimize_2:
  return deopt_2 ();
deoptimize_3:
  return deopt_3 ();
deoptimize_4:
  return deopt_4 ();
}

and that's that.

I think I've gnawed all the meat that's to be had off of this bone, so hopefully that's the last we'll see of this silly loop. Comments and corrections very much welcome. Next up will probably be a discussion of how V8's optimizer works. Happy hacking.

class redefinition in guile

2009-11-09T14:16:27Z

Yes hello hello!

Long-time readers will perhaps recall this diagram:

figure zero: things as they were.

It comes from an article describing how Guile represents its objects in memory, with particular concern for Guile-GNOME.

I was hacking on this code recently, and realized that this representation was not as good as it could be. Our switch to the BDW garbage collector lets us be more flexible with the size of our allocations, and so we can actually allocate the "slots" of an object inline with the object itself:

figure one: things how maybe they could have been.

Alas, during the hack, I discovered a stumbling block: that this representation doesn't allow for classes to be redefined.

redefine a data type, what?

Yes, Guile's object-oriented programming system (GOOPS) allows you to redefine the types of your data. It's OK! CLOS lets you do this too; it's an old tradition. Redefining a class at runtime allows you to develop by incremental changes, without restarting your program.

Of course once you change a class, its instances probably need to change too, probably reallocating their slots. So we have to reintroduce the indirection -- but allowing for locality in the normal, non-redefined case. Like this:

figure two: things how they are, almost.

So updating an instance is as simple as swapping a pointer!

Almost.

Well, not really. This is really something that's unique to Lisp, as far as I can tell, and not very widely-known in the programming community, and hey, I didn't completely understand it -- so man, do I have a topic for a blog or what.

step one: make a hole in the box

The way redefinition works is that first you make a new class, then you magically swap the new for the old, then instances lazily update -- as they are accessed, they check that their class is still valid, and if not update themselves. It's involved, yo, so I made a bunch of pictures.

figure three: class redefinition begins with defining a new class.

So yeah, figure three shows the new class, lying in wait beside the old one. Then comes the magic:

figure four: same identity, different state.

What just happened here? Well we just swapped the vtable and data pointers in the old and new classes. For all practical purposes, the old class is the new class, and vice versa. All purposes except one, that is: eq?. The old class maintains its identity, so that any code that references it, in a hash table for example, will see the same object, but with new state.

The class' identity is the same, but its state has changed. That's the key thing to notice here.

Now we mark the old class's data as being out of date, and the next time its instances check their class... what? Here we reach another stumbling block. The old class has already has new state, so it is already fresh -- meaning that the instance will think nothing is wrong. It could be that the instance was allocated when its class declared two slots, but now the class says that instances have three slots. Badness, this; badness.

So what really needs to happen is for instances to point not to the identity of their classes, but to the state of their classes. In practice this means pointing directly to their slots. This is actually an efficiency win, as it removes an indirection for most use cases. Comme ça:

figure five: instances actually point to class state, not class identity.

As we see in the figure, a well-known slot in the class holds the redefinition information -- normally unset, but if the class is invalidated, it will allow the instance to know exactly which version of the class it is changing from and to.

figure six: new equilibrium.

And finally, figure six shows the new state of affairs -- in which slot access has been redirected for all redefined classes, and all of their instances, transitively.

efficiency

All in all, this is quite OK efficiency-wise. Instance data is normally local, and class data is one indirection away. A redefined instance will have nonlocal data, but hey, not much you can do otherwise, without a copying collector.

There is one efficiency hack worth mentioning. Accessors, discussed in an earlier article, don't need to check and see if their class is up to date or not. This is because they are removed from the old class and re-added to the new one as part of the redefinition machinery.

summary

Redefinition is complicated, but pretty neat.

really, that's the summary?

Yes.

international lisp conference 2009 -- day one

2009-03-23T15:36:03Z

Greetings from Cambridge! The American one, that is. I'm at the International Lisp Conference, which is an OK set of talks, but an excellent group of almost 200 Lisp hackers. What follows is not strictly journalism, but rather reflections of a more subjective nature.

Sunday was reserved for "tutorials", four sessions of an hour and a half in length, with an expository and relaxed flavor. There were three sessions running at a time, which made for some hard choices.

The first session that I went to was "Monads for the Working Lisp Programmer", which started off by explaining monads, and in the second session looked at their uses. As a working Lisp programmer, I guess it's good to be able to talk down Haskell weenies, but something about the complexity of the "plumbing" doesn't sit right. People keep talking about separating the "plumbing" from the computation -- as if programming somehow involves clogged pipes or something. The presenters did a good job, but "different strokes, different folks". I jetted out after the first session.

Next I decided to drop in on the second session of Pascal Constanza's tutorial on CLOS, generic functions, and the meta-object protocol. It was very clearly presented, and even though I've been working with GOOPS (a CLOS-a-like) for a long time, I came away with some style tips. Also seeing his LispWorks environment was very interesting. So much of programming is workflow and environment, something that often requires you to look over the shoulder of a master. I was a bit frustrated though, as I came with questions but it seemed Pascal had a curriculum he wanted to stick to. I'll corner him later.

Actually, that's the best thing about this conference, this ability to corner smart people. So far they don't have that harried, ragged look yet.

In the afternoon, I went to Rich Hickey's Clojure tutorials (parts 2 and 3). What a treat. Hickey's instincts and good taste as a language designer are evident. All languages deserve to have his persistent data structures. His discussion of state and identity was particularly lucid, as well. (Hickey say that the talk on state and time was basically the same one he gave at InfoQ.)

Personally I don't care for the JVM, and I feel Clojure's worth as a general-purpose language is actually hampered by the lack of tail recursion, but I really think that his design ideas should percolate out to all the dynamic languages of the world.

Now it's Monday morning, and I'm sitting in on the plenary technical talks and software demonstrations. They're OK, but thus far yesterday had a better feel. I'm very much looking forward to the afternoon sessions.

ecmascript for guile

2009-02-22T16:45:03Z

Ladies, gentlemen: behold, an ECMAScript compiler for Guile!

$ guile
scheme@(guile-user)> ,language ecmascript
Guile ECMAScript interpreter 3.0 on Guile 1.9.0
Copyright (C) 2001-2008 Free Software Foundation, Inc.

Enter `,help' for help.
ecmascript@(guile-user)> 42 + " good times!";
$1 = "42 good times!"
ecmascript@(guile-user)> [0,1,2,3,4,5].length * 7;
$2 = 42
ecmascript@(guile-user)> var zoink = {
                           qux: 12,
                           frobate: function (x) {
                              return random(x * 2.0) * this.qux;
                           }
                         };
ecmascript@(guile-user)> zoink.frobate("4.2")
$3 = 37.3717848761822

The REPL above parses ECMAScript expressions from the current input port, compiling them for Guile's virtual machine, then runs them -- just like any other Guile program.

Above you can see some of the elements of ECMAScript in action. The compiler implements most of ECMAScript 3.0, and with another few days' effort should implement the whole thing. (It's easier to implement a specification than to document differences to what people expect.)

The "frobate" example also shows integration with Guile -- the random function comes from current module, which is helpfully printed out in the prompt, (guile-user) above.

In addition, we can import other Guile modules as JavaScript, oops, I mean ECMAScript objects:

ecmascript@(guile-user)> require ('cairo');
$1 = #< b7192810>
ecmascript@(guile-user)> $1['cairo-version']();
$2 = 10800

I could automatically rename everything to names that are valid ES identifiers, but I figured that it's less confusing just to leave them as they are, and require ['%strange-names!'] to be accessed in brackets. Of course if the user wants, she can just rename them herself.

Neat hack, eh?

what the hell?

I realize that my readers might have a number of questions, especially those that have other things to do than to obsessively refresh my weblog. Well, since I had myself at my disposal, I decided to put some of these questions to me.

So, Andy, why did you implement this?

Well, I've been hacking on a compiler for Guile for the last year or so, and realized at some point that the compiler tower I had implemented gave me multi-language support for free.

But obviously I couldn't claim that Guile supported multiple languages without actually implementing another language, so that's what I did.

I chose ECMAScript because it's a relatively clean language, and one that doesn't have too large of a standard library -- because implementing standard libraries is a bit of a drag. Even this one isn't complete yet.

How does it perform? Is it as fast as those arachno-fish implementations I keep hearing about?

It's tough to tell, but it seems to be good enough. It's probably not as fast as compilers that produce native code, but because it hooks into Guile's compiler at a high level, as Guile's compiler improves and eventually gets native code compilation, it will be plenty fast. For now it feels snappy.

There is another way in which it feels much faster though, and that's development time -- having a real REPL with readline, a big set of library functions (Guile's), and fast compilation all make it seem like you're talking directly with the demon on the other side of the cursor.

It actually implements ECMAScript? And what about ES4?

ES3 is the current goal, though there are some bits that are lacking -- unimplemented parts of the core libraries, mainly. Probably there are some semantic differences as well, but those are considered bugs, not features. I'm just one man, except in interviews!

Well, there is one difference: how could you deny the full numeric tower to a language?

ecmascript@(guile-user)> 2 / 3 - 1 / 6;
$3 = 1/2

And regarding future standards of ECMAScript, who knows what will happen. ES4 looks like a much larger language. Still, Guile is well-positioned to implement it -- we already have a powerful object system with multimethod support, and a compiler and runtime written in a high-level language, which count for a lot.

Awesome! So I can run my jQuery comet stuff on Guile!!1!!

You certainly could, in theory -- if you implemented XMLHttpRequest and the DOM and all the other things that JavaScript-in-a-web-browser implements. But that's not what I'm interested in, so you won't get that implementation from me!

Where do you see this going?

I see this compiler leading to me publishing a paper at some conference!

More seriously, I think it will have several effects. One will be that users of applications with Guile extensions will now be able to extend their applications in ECMAScript in addition to Scheme. Many more people know ECMAScript than Scheme, so this is a good thing.

Also, developers that want to support ES extensions don't have to abandon Scheme to do so. There are actually many people like this, who prefer Scheme, but who have some users that prefer ECMAScript. I'm a lover, not a fighter.

The compiler will also be an example for upcoming Elisp support. I think that Guile is the best chance we have at modernizing Emacs -- we can compile Elisp to Guile's VM, write some C shims so that all existing C code works, then we replace the Elisp engine with Guile. That would bring Scheme and ECMAScript and whatever other languages are implemented to be peers of Elisp -- and provide better typing, macros, modules, etc to Emacs. Everyone wins.

So how do I give this thing a spin?

Well, it's on a branch at the moment. Either you wait the 3 or 6 months for a Guile 2.0 release, or you check it out from git:

git clone git://git.sv.gnu.org/guile.git guile
cd guile
git fetch
git checkout -b vm origin/vm
./autogen.sh && ./configure && make
./pre-inst-guile

Once you're in Guile, type ,language ecmascript to switch languages. This will be better integrated in the future.

Why haven't you answered my mail?

Mom, I've been hacking on a compiler! I'll call tonight ;-)

visualizing statistical profiles with chartprof

2009-02-09T15:04:46Z

Greetings, hackers of the good hack!

In recent weeks, my good hack has been Guile's compiler and virtual machine. It's almost regression-free, and we're looking for a merge to master within a few weeks.

Things are looking really good. Compiled coded conses significantly less than the evaluator, loads quickly, runs quickly of course, and on top of that has all kinds of fun features. Recently we made metadata have no cost, as we write it after the program text of compiled procedures, which are normally just mmap'd from disk. (I stole this idea from a paper on Self.)

In addition to our tower of language compilers, I recently added a tower of decompilers: from value, to objcode, to bytecode, to assembly... with the possibility of adding future decompilers, potentially going all the way back to Scheme itself. We have all of the debugging information to do it nicely.

However, there are still some regressions. Probably the biggest one is that GOOPS, Guile's object system, actually loads up more slowly with the VM than with the evaluator. So I spend last week giving GOOPS a closer look.

the hunt begins

Turns out, the slowness is in the compiler. But why should the compiler be running at runtime, you ask? Well, it's the dynamic recompilation stuff I spoke of before.

GOOPS compiles implementations of methods for each set of types that it sees at runtime, which is pretty neat. The PIC paper by Hölzle, Chambers, and Ungar describe the advantages of this approach:

The presence of PIC-based [runtime] type information fundamentally alters the nature of optimization of dynamically-typed object-oriented languages. In “traditional” systems such as the current SELF compiler, type information is scarce, and consequently the compiler is designed to make the best possible use of the type information. This effort is expensive both in terms of compile time and compiled code space, since the heuristics in the compiler are tuned to spend time and space if it helps extract or preserve type information. In contrast, a PIC-based recompiling system has a veritable wealth of type information: every message has a set of likely receiver types associated with it derived from the previously compiled version’s PICs. The compiler’s heuristics and perhaps even its fundamental design should be reconsidered once the information in PICs becomes available...

(Thanks to Keith for pointing me to that paper.)

Anyway, recompilation was slow. So I then started to look at exactly why compilation was slow. I had callgrind, which is good, but doesn't give you enough information about *why* you are in those specific C procedures -- it's a C profile, not a Scheme profile. I had statprof, which is good, but doesn't give you enough information about the specific calltrees.

chartprof

So what to do? Well, since statistical profilers already have to walk the stacks, it was a simple matter to just to make statprof squirrel away the full stacks as they were sampled. Then, after the profiling run is done, you can process those stacks into call trees, and visualize that.

But how to visualize the call trees? I had some basic ideas of what I wanted, but no tool that I knew of that could present the information easily. But what I did have was an excellent toolbox: guile-cairo, and guile itself. So voilà chartprof:

Chartprof takes full call trees and produces a representation of where the time is spent. The cascading red part represents the control flow, as nested procedure invocations, and the numbers inside the red part indicate the cumulative time percentage spent in those procedure calls. The numbers out of the red part indicate "self time", when the sampler caught program execution in that procedure instead of in one of its call children.

If you click on the thumbnail on the left, you can download the whole thing. It's big: 1.2 megabytes. There's lots of information in there, is the thing. I should figure out how to prune that information, if that can be done so usefully.

I drew horizontal lines at any call that did not always dispatch to exactly one subcall. It's interesting, you do want to line up procedures and their call sites, but you don't want too many lines. This way seems to be a good compromise, though of course it's not the last word.

analysis

It seems that the culprit is a bit unfortunate. GOOPS, when it loads, enables extensibility in about 200 of Guile's primitive procedures, e.g. equal? and for-each, which allows those procedures to dispatch to methods of generic functions. Unfortunately, this slows down pattern matching in the compilers, as the pattern matcher uses equal?, and that ends up calling out to Scheme just to see if something is equal? to a symbol... badness, that.

The equal? is particularly egregious, as a call to scm_equal_p will do a number of built-in equality checks, then if they all fail it will dispatch to the equal? generic, which dispatches to a method that calls eqv?, which does some built-in checks, then dispatches to an eqv? generic, which finally returns #f just to say that 'foo is not the same as 'bar, something that should be a quick pointer comparison.

So I'm not exactly sure what I'm going to do to fix this, but at least now I know what to think about. I would have had no clue that it was the pattern matcher if it weren't for the the graphical visualization that chartprof gave me. So yay for turning optimization into a tools problem.

code

The code is in (charting prof) from git guile-charting, which in turn needs guile-lib from git.

I'd be interested in hearing feedback about the visualizations, particularly if people have other ideas about how to visualize call graphs. I also have some other information that I could present somehow: the arguments to the procedure applications, and the source locations of the call sites.

Happy hacking!

dynamic dispatch: a followup

2008-10-19T22:40:00Z

It seems that the 8-hash technique for dynamic dispatch that I mentioned in my last essay actually has a longer pedigree. At least 10 years before GOOPS' implementation, the always-excellent Gregor Kiczales wrote, with Luis H Rodriguez Jr.:

If we increase the size class wrappers slightly, we can add more hash seeds to each wrapper. If n is the number of hash seeds stored in each wrapper, we can think of each generic function selecting some number x less than n and using the xth hash seed from each wrapper. Currently we store 8 hash seeds in each wrapper, resulting in very low average probe depths.
The additional hash seeds increase the probability that a generic function will be able to have a low average probe depth in its memoization table. If one set of seeds doesn't produce a good distribution, the generic function can select one of the other sets instead. In effect, we are increasing the size of class wrappers in order to decrease the size of generic function memoization tables. This tradeoff is attractive since typical systems seem to have between three and five times as many generic functions as classes.
Efficient method dispatch in PCL

So it seems that Mikael Djurfeldt, the GOOPS implementor, appears to have known about CLOS implementation strategies. But it's interesting how this knowledge percolates out -- it's not part of the computer science canon. When you read these papers, it's always "Personal communication from Dave Moon this" and "I know about this Kiczales paper that". (Now you do too.)

Also interesting about the Kiczales paper is the focus on the user, the programmer, in the face of redefinitions -- truly a different culture than the one that is dominant now.

polymorphic inline caches buzz buzz buzz

This reference comes indirectly via Keith Rarick, who writes to mention a beautiful paper by Hölzle, Chambers, and Ungar, introducing polymorphic inline caches, a mechanism to dispatch based on runtime types, as GOOPS does.

PICs take dispatch one step further: instead of indirect table lookups as GOOPS does, a PIC is a runtime-generated procedure that performs the lookups directly in code. This difference between data-driven processing and direct execution is the essence of compilation -- compilation pushes all of the caching and branching logic as close to the metal as possible.

Furthermore, PICs can be a source of data as well as a dispatch mechanism:

The presence of PIC-based type information fundamentally alters the nature of optimization of dynamically-typed object-oriented languages. In “traditional” systems such as the current SELF compiler, type information is scarce, and consequently the compiler is designed to make the best possible use of the type information. This effort is expensive both in terms of compile time and compiled code space, since the heuristics in the compiler are tuned to spend time and space if it helps extract or preserve type information. In contrast, a PIC-based recompiling system has a veritable wealth of type information: every message has a set of likely receiver types associated with it derived from the previously compiled version’s PICs. The compiler’s heuristics and perhaps even its fundamental design should be reconsidered once the information in PICs becomes available [...].
Optimizing Dynamically-Typed Object-Oriented Programming Languages with Polymorphic Inline Caches

The salient point is that in latent-typed languages, all of the static type analysis techniques that we know are insufficient. Only runtime analysis and runtime recompilation can capture the necessary information for efficient compilation.

Read both of these articles! But if you just read one, make it the Ungar/Chambers/Hölzle -- it is well-paced, clearly-written, and illuminating.

Happy hacking!

dispatch strategies in dynamic languages

2008-10-17T16:54:03Z

In a past dispatch, I described in passing how accessors in Scheme can be efficiently dispatched into direct vector accesses. In summary, if I define a type, and make an instance:

(define-class  ()
  (x #:init-keyword #:x #:accessor x)
  (y #:init-keyword #:y #:accessor y))

(define p (make  #:x 10 #:y 20))

Access to the various bits of what I call the object closure of p occurs via lazy, type-specific compilation, such that:

(y p)

internally dispatches to

(@slot-ref p 1)

So this is all very interesting and such, but how do we actually do the dispatch? It turns out that this is a quite interesting problem, one that the JavaScript people have been taking up recently in other contexts. The basic problem is: given a generic operation and a set of parameters with specific types, how do you determine the exact procedure to apply to the arguments?

dynamic dispatch from a moppy perspective

This problem is known as dynamic dispatch. GOOPS, Guile's object system, approaches it from the perspective of the meta-object protocol, the MOP. A MOP is an an extensible, layered protocol that allows the user to replace and extend parts of a system's behaviour, while still allowing the system to maintain performance. (The latter is what I refer to as the "negative specification" -- the set of optimizations that the person writing the specifications doesn't say, but has in the back of her mind.)

The MOP that specifies GOOPS' behavior states that, on an abstract level, the process of applying a generic function to a particular set of arguments is performed by the apply-generic procedure, which itself is a generic function:

(apply-generic (gf ) args)

The default implementation looks something like this:

(define-method (apply-generic (gf ) args)
  (let ((methods (compute-applicable-methods gf args)))
    (if methods
	(apply-methods
         gf (sort-applicable-methods gf methods args) args)
	(no-applicable-method gf args))))

That is to say, first we figure out which methods actually apply to specific arguments, then sort them, then apply them. Of course, there are many ways that you might want the system to apply the methods. For some methods, you might want to invoke all applicable methods, in order from most to least specific; others, the same but in opposite order; or, in the normal case, simply apply the most specific method, and allow that method to explicitly chain up.

In effect, you want to parameterize the operation of the various components of apply-generic, depending on the type of the operation. With a MOP, you effect this parameterization by specifying that compute-applicable-methods, apply-methods, sort-applicable-methods, and no-applicable-method should themselves be generic functions. They should have default implementations that make sense, but those implementations should be replaceable by the user.

At this point the specification is awesome in its power -- completely general, completely overridable... but what about efficiency? Do you really have to go through this entire process just to invoke a method?

performance in dynamic languages

There are four general techniques for speeding up an algorithm: caching, compiling, delaying computation, and indexing.
Peter Norvig, A Retrospective on PAIP, lesson 21.

Just because the process has been specified in a particular way at a high level does not mean that the actual implementation has to perform all of the steps all of the time. My previous article showed an example of compilation, illustrating how high-level specifications can compile down to pointer math. In this article, what I'm building to is an exegesis of GOOPS' memoization algorithm, which to me is an amazing piece of work.

The first point to realize is that in the standard case, in which the generic function is an instance of and not a subclass thereof, application of a generic function to a set of arguments will map to the invocation of a single implementing method.

Furthermore, the particular method to invoke depends entirely on the concrete types of the arguments at the call site. For example, let's define a cartesian distance generic:

(define-method (distance (p1 ) (p2 ))
  (define (square z)
    (* z z))
  (sqrt (+ (square (- (x p1) (x p2)))
           (square (- (y p1) (y p2))))))

Now, if we invoke this generic on some set of arguments:

(distance a b)

The particular method to be applied can be determined entirely from the types of a and b. Specifically, if both types are subtypes of , the above-defined method will apply, and if not, no method that we know about will apply.

Therefore, whatever the result of the dispatch is, we can cache (or memoize) that result, and use it the next time, avoiding invocation of the entire protocol of methods.

The only conditions that might invalidate this memoization would be adding other methods to the distance generic, or redefining methods in the meta-object protocol itself. Both of these can be detected by the runtime object system, and can then trigger cache invalidation.

implementation

So, how to implement this memoization, then? The obvious place to store the cached data is on the generic itself, the same object that has a handle on all of the methods anyway. In the end, dynamic dispatch is always at least one level indirected -- even C++ people can't get around their vmethod table.

Algorithms in computer science are fundmentally about tradeoffs between space and time. If you can improve in one without affecting the other, assuming correctness, then you can say that one algorithm is better than the other. However after the initial optimizations are made, you are left with a tradeoff between the two.

This point is particularly salient in the context of generic functions, some of which might have only one implementing method, while others might have hundreds. How to choose an algorithm that makes the right tradeoff for this wide range in input?

Guile does something really interesting in this regard. First, it recognizes that the real input to the algorithm is the set of types being dispatched, not the set of types for which a generic function is specialized.

This is based on the realization that at runtime, a given piece of code will only see a certain, reduced set of types -- perhaps even one set of types.

linear search

In the degenerate but common case in which dispatch only sees one or two sets of types, you can do a simple linear search of a vector containing typeset-method pairs, comparing argument types. If this search works, and the vector is short enough, you can dispatch with a few cmp instructions.

This observation leads to a simple formulation of the method cache, as a list of a list of classes, with each class list tailed by the particular method implemention.

In Algol-like pseudocode, because this algorithm is indeed implemented in C:

def dispatch(types, cache):
    for (i=0; i(The tradeoff between vectors and lists is odd at short lengths; the former requires contiguous blocks of memory and goes through malloc, whereas the latter can take advantage of the efficient uniform allocation and marking infrastructure provided by GC. In this case Guile actually uses a vector of lists, due to access patterns.)
hash cache
In the general case in which we see more than one or two different sets of types, you really want to avoid the linear search, as this code is in the hot path, called every time you dispatch a generic function.
For this reason, Guile offers a second strategy, optimistic hashing. There is still a linear search through a cache vector, but instead of starting at index 0, we start at an index determined from hashing all of the types being dispatched.
With luck, if we hit the cache, the linear search succeeds after the first check; and if not, the search continues, wrapping around at the end of the vector, and stops with a cache miss if we wrap all the way around.
So, the algorithm is about the same, but rotated:
def dispatch(types, cache):
    i = h = hash(types) % len(cache)
    do
        entry=cache[i]
        x = types
        while 1:
           if null(x) && null(cdr(entry)): return car(entry)
           if null(x): break
           if car(x) != car(entry): break
           x = cdr(x); entry = cdr(entry)
        i = (i + 1) % len(cache)
    while i != h
    return false
OK, all good. But: how to compute the hash value?
One straightforward way to do it would be to associate some random-ish variable with each type object, given that they are indeed first-class objects, and simply add together all of those values. This strategy would not distinguish hash values on argument order, but otherwise would be pretty OK.
The standard source for this random-ish number would be the address of the class. There are two problems with this strategy: one, the lowest bits are likely to be the same for all classes, as they are aligned objects, but we can get around that one. The more important problem is that this strategy is likely to produce collisions in the cache vector between different sets of argument types, especially if dispatch sees a dozen or more combinations.
Here is where Guile does some wild trickery that I didn't fully appreciate until this morning when following all of the code paths. It associates a vector of eight random integers with each type object, called the "hashset". These values are the possible hash values of the class.
Then, when adding a method to the dispatch cache, Guile first tries rehashing with the first element of each type's hashset. If there are no collisions, the cache is saved, along with the index into the hashset. If there are collisions, it continues on, eventually selecting the cache and hashset index with the fewest collisions.
This necessitates storing the hashset index on the generic itself, so that dispatch knows how the cache has been hashed.
I think this strategy is really neat! Actually I wrote this whole article just to get to here. So if you missed my point and actually are interested, let me know and I can see about explaining better.
fizzleout
Once you know exactly what types are running through your code, it's not just about dispatch -- you can compile your code given those types, as if you were in a static language but with much less work. That's what Psyco does, and I plan on getting around to the same at some point.
Of course there are many more applications of this in the dynamic languages community, but since the Fortran family of languages has had such a stranglehold on programmer's minds for so many years, the techniques remain obscure to many.
Thankfully, as the years pass, other languages approach Lisp more and more. Here's to 50 years of the Buddhist virus!

object closure and the negative specification

2008-04-22T21:58:58Z

Guile-GNOME was the first object-oriented framework that I had ever worked with in Scheme. I came to it with all kinds of bogus ideas, mostly inherited from my C to Python formational trajectory. I'd like to discuss one of those today: the object closure. That is, if an object is code bound up with data, how does the code have access to data?

In C++, object closure is a non-problem. If you have an object, w, and you want to access some data associated with it, you dereference the widget structure to reach the member that you need:

char *str = w->name;

Since the compiler knows the type of w, it knows the exact layout of the memory pointed to by w. The ->name dereference compiles into a memory fetch from a fixed offset from the widget pointer.

In constrast, data access in Python is computationally expensive. A simple expression like w.name must perform the following steps:

look up the class of w (call it W)
loop through all of the classes in W's "method resolution order" --- an ordered set of all of W's superclasses --- to see if the class defines a "descriptor" for this property. In some cases, this descriptor might be called to get the value for name.
find the "dictionary", a hash table, associated with w. If the dictionary contains a value for name, return that.
otherwise if there was a descriptor, call the descriptor to see what to do.

This process is run every time you see a . between two letters in python. OK, so getattr does have an opcode to itself in CPython's VM instruction set, and the above code is implemented mostly in C (see Objects/object.c:PyObject_GenericGetAttr). But that's about as fast as it can possibly get, because the structure of the Python language definition prohibits any implementation of Python from ever having enough information to implement the direct memory access that is possible in C++.

But, you claim, that's just what you get when you program in a dynamic language! What do you want to do, go back to C++?

straw man says nay

"First, do no harm", said a practitioner of another profession. Fundamental data structures should be chosen in such a way that needed optimizations are possible. Constructs such as Python's namespaces-as-dicts actively work against important optimizations, effectively putting an upper bound on how fast code can run.

So for example in the case of the object closure, if we are to permit direct memory access, we should allow data to be allocated at a fixed offset into the object's memory area.

Then, the basic language constructs that associate names with values should be provided in such a way that the compiler can determine what the offset is for each data element.

In dynamic languages, types and methods are defined and redefined at runtime. New object layouts come into being, and methods which operated on layouts of one type will see objects of new types as the program evolves. All of this means that to maintain this direct-access characteristic, the compiler must be present at runtime as well.

So, in my silly w.name example, there are two cases: one, in which the getattr method is seeing the combination of the class W and the slot name for the first time, and one in which we have seen this combination already. In the first case, the compiler runs, associating this particular combination of types with a new procedure, newly compiled to perform the direct access corresponding to where the name slot is allocated in instances of type W. Once this association is established, or looked up as in the second case, we jump into the compiled access procedure.

Note that at this point, we haven't specified what the relationship is between layouts and subclassing. We could further specify that subclasses cannot alter the layout of slots defined by superclasses. Or, we could just leave it as it is, which is what Guile does.

Guile, you say? That slow, interpreted Scheme implementation? Well yes, I recently realized (read: was told) that Guile in fact implements this exact algorithm for dispatching its generic functions. Slot access does indeed compile down to direct access, as far as can be done in a semi-interpreted Scheme, anyway. The equivalent of the __mro__ traversal mentioned in the above description of python's getattr, which would be performed by slot-ref, is compiled out in Guile's slot accessor generics.

In fact, as a theoretical aside, since Guile dispatches lazily on the exact types of the arguments given to generic functions (and not just the specializer types declared on the individual methods), it can lazily compile methods knowing exactly what types they are operating on, with all the possiblities for direct access and avoidance of typechecking that that entails. But this optimization has not yet entered the realm of practice.

words on concision

Python did get one thing right, however: objects' code access their data via a single character.

It is generally true that we tend to believe that the expense of a programming construct is proportional to the amount of writer's cramp that it causes us (by "belief" I mean here an unconscious tendency rather than a fervent conviction). Indeed, this is not a bad psychological principle for language designers to keep in mind. We think of addition as cheap partly because we can notate it with a single character: "+". Even if we believe that a construct is expensive, we will often prefer it to a cheaper one if it will cut our writing effort in half.
Guy Steele, Debunking the 'Expensive Procedure Call' Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO (p.9)

Since starting with Guile, over 5 years ago now, I've struggled a lot with object-oriented notation. The problem has been to achieve that kind of Python-like concision while maintaining schemeliness. I started with the procedural slot access procedures:

(slot-ref w 'name)
(slot-set! w 'name "newname")

But these procedures are ugly and verbose. Besides that, since they are not implemented as generic functions, they prevent the lazy compilation mentioned above.

GOOPS, Guile's object system, does allow you to define slot accessor generic functions. So when you define the class, you pass the #:accessor keyword inside the slot definition:

(define-class  ()
  (bar #:init-keyword #:bar #:accessor bar))

(define x (make  #:bar 3))
(bar x) => 3
(set! (bar x) 4)

Now for me, typographically, this is pretty good. In addition, it's compilable, as mentioned above, and it's mappable: one can (map bar list-of-x), which compares favorably to the Python equivalent, [x.name for x in list_of_x].

My problem with this solution, however, is its interaction with namespaces and modules. Suppose that your module provides the type, , or, more to the point, . If has 54 slots, and you define accessors for all of those slots, you have to export 54 more symbols as part of your module's interface.

This heavy "namespace footprint" is partly psychological, and partly real.

It is "only" psychological inasmuch as methods of generic functions do not "occupy" a whole name; they only specify what happens when a procedure is called with particular types of arguments. Thus, if opacity is an accessor, it doesn't occlude other procedures named opacity, it just specifies what happens when you call (opacity x) for certain types of x. It does conflict with other types of interface exports however (variables, classes, ...), although classes have their own . *Global-variables* do as well, and other kinds of exports are not common. So in theory the footprint is small.

On the other hand, there are real impacts to reading code written in this style. You read the code and think, "where does bar come from?" This mental computation is accompanied with machine computation. First, because in a Scheme like Guile that starts from scratch every time it's run, the accessor procedures have to be allocated and initialized every time the program runs. (The alternatives would be an emacs-like dump procedure, or R6RS-like separate module compilation.) Second, because the (drastically) increased number of names in the global namespace slows down name resolution.

lexical accessors

Recently, I came upon a compromise solution that works well for me: the with-accessors macro. For example, to scale the opacity of a window by a ratio, you could do it like this:

(define (scale-opacity w ratio)
  (with-accessors (opacity)
    (set! (opacity w)
          (* (opacity w) ratio))))

This way you have all of the benefits of accessors, with the added benefit that you (and the compiler) can see lexically where the opacity binding comes from.

Well, almost all of the benefits, anyway: for various reasons, for this construct to be implemented with accessors, Guile would need to support subclasses of generic functions, which is does not yet. But the user-level code is correct.

Note that opacity works on instances of any type that has an opacity slot, not just windows.

Also note that the fact that we allow slots to be allocated in the object's memory area does not prohibit other slot allocations. In the case of , the getters and setters for the opacity slot actually manipulate the opacity GObject property. As you would expect, no memory is allocated for the slot in the Scheme wrapper.

For posterity, here is a defmacro-style definition of with-accessors, for Guile:

(define-macro (with-accessors names . body)
  `(let (,@(map (lambda (name)
                  `(,name ,(make-procedure-with-setter
                            (lambda (x) (slot-ref x name))
                            (lambda (x y) (slot-set! x name y)))))
                names))
     ,@body))

final notes

Interacting with a system with a meta-object protocol has been a real eye-opener for me. Especially interesting has been the interplay between the specification, which specifies the affordances of the object system, and the largely unwritten "negative specification", which is the set of optimizations that the specification hopes to preserve. Interested readers may want to check out Gregor Kiczales' work on meta-object protocols, the canonical work being his "The Art of the Metaobject Protocol". All of Kiczales' work is beautiful, except the aspect-oriented programming stuff.

For completeness, I should mention the java-dot notation, which has been adopted by a number of lispy languages targetting the JVM or the CLR. Although I guess it meshes well with the underlying library systems, I find it to be ugly and non-Schemey.

And regarding Python, lest I be accused of ignoring __slots__: the getattr lookup process described is the same, even if your class defines __slots__. The __slots__ case is handled by descriptors in step 2. This is specified in the language definition. If slots were not implemented using descriptors, then you would still have to do the search to see if there were descriptors, although some version of the lazy compilation technique could apply.

allocate, memory (part 2 of n)

2008-04-11T18:11:17Z

In my last installment, I started by saying something about a patch, then got lost in the vagaries of garbage collection. Hopefully in this writing product I can reign in the dangling pointers.

bit twiddling

If you are programming in an environment with a garbage collector, and you want to expose C resource to your managed environment, you will have to cooperate with the garbage collector. Specifically, you will need to know when an object becomes garbage, so that you can deallocate its resources.

In Guile, the historic way to do this is via a specific kind of boxed value, the "small object", or "SMOB". A SMOB is a double- or quad-word object whose first word is a tag designating the type of the object, and the rest is for C code to manipulate. SMOB types have to be registered with the Guile runtime, and have type-specific free, printing, and marking functions.

SMOBs are impoverished objects, however. There are only 8 bits in which to store the SMOB type, and they must be registered manually in C. It would be impossible to associate one SMOB type with each GType, for instance. So for Guile-GNOME, which is where I'm really getting with all of this, you have to wrap C objects on two levels: one generic SMOB for GTypeInstances, and one more object-oriented wrapper that exposes the GType.

This two-level wrapping is ugly, confusing, and bug-prone. Thinking about this, and looking at the (gnome gobject) module with an eye to a stable, supportable interface, I wondered: is there not some other way to get free() notifications from the garbage collector?

As you might guess, the answer is yes. But to fully explain, in my most verbose fashion, we'll have to take a look at how objects are represented in the Lisp family of languages. The following discussion is specific to Guile, but shares fundamental characteristics with Common Lisp and other standard Lisp systems.

objects in scheme

Guile's object system, GOOPS, is derived from TinyCLOS, via STklos. This includes a full meta-object protocol, so all details of e.g. instance, slot, and class allocation are fully extensible, while maintaining compilability. The following graphic describes the memory layout produced by the default allocation protocol:

Instances contain two important words, one to point their classes, and one to point to a data array. The data array is as long as the number of Scheme objects that need be associated with the instance: the set of slots associated with the object. Of course, other slot allocation strategies are possible, for example storing slot values in a hash table as Python does by default.

By way of illustration, the process of slot access, via (slot-ref foo 'bar) goes like this:

Get the class of the object (dereferencing the vtable pointer)
Find the allocation information for the slot
- It is part of the class' "slot-definitions" slot
If the slot is allocated in the data array, there is a fast-path array access
Otherwise the slot-definition provides getter and setter procedures for the slot

In effect, the class tells you how to read the instance. The allocation lookup process can be optimized: if the class of the object is known beforehand, either via inferencing or declaration, the lookup can be performed at compilation time. Otherwise, and more likely, an accessor can be made that compiles getters and setters on the first lookup.

and then I was like, anyway..

It turns out that two little-known facts enable us to shove C pointers into GOOPS objects. First, slots in the default data array may be allocated either as Scheme values (the normal case), or as raw, untagged words. The latter case allows us to put random C data in Scheme object without confusing the garbage collector or the interpreter. Second, classes have such a "raw word" slot containing a pointer to a free function that will be called when instances are freed, ostensibly to free the data array. However we can override this value to perform type-specific deallocation, and then free the data array.

This realization let me collapse the old memory layout:

into this:

The sum total is that now a wrapper for a GObject is just 3 words: the vtable and data pointers, and a 1-element array to hold the GTypeInstance* pointer. That is, 12 or 24 bytes, for 32- or 64-bit words, respectively. (There are a couple of inefficiencies that increase this number of words to about 6, but those will go away with time.)

That patch took me about 3.5 months to bake fully, and is present in the newly-released Guile-GNOME 2.15.97. I haven't had time to update the online documentation yet, but the number of exported procedures and variables is down by about 100 or 150, with no loss in expressive power. I'm pretty pleased!