wingolog

coarse or lazy?

7 August 2022 9:44 AM (garbage collection | immix | guile | gnu | igalia | evacuation | concurrency | sweeping)

sweeping, coarse and lazy

One of the things that had perplexed me about the Immix collector was how to effectively defragment the heap via evacuation while keeping just 2-3% of space as free blocks for an evacuation reserve. The original Immix paper states:

To evacuate the object, the collector uses the same allocator as the mutator, continuing allocation right where the mutator left off. Once it exhausts any unused recyclable blocks, it uses any completely free blocks. By default, immix sets aside a small number of free blocks that it never returns to the global allocator and only ever uses for evacuating. This headroom eases defragmentation and is counted against immix's overall heap budget. By default immix reserves 2.5% of the heap as compaction headroom, but [...] is fairly insensitive to values ranging between 1 and 3%.

To Immix, a "recyclable" block is partially full: it contains surviving data from a previous collection, but also some holes in which to allocate. But when would you have recyclable blocks at evacuation-time? Evacuation occurs as part of collection. Collection usually occurs when there's no more memory in which to allocate. At that point any recyclable block would have been allocated into already, and won't become recyclable again until the next trace of the heap identifies the block's surviving data. Of course after the next trace they could become "empty", if no object survives, or "full", if all lines have survivor objects.

In general, after a full allocation cycle, you don't know much about the heap. If you could easily know where the live data and the holes were, a garbage collector's job would be much easier :) Any algorithm that starts from the assumption that you know where the holes are can't be used before a heap trace. So, I was not sure what the Immix paper is meaning here about allocating into recyclable blocks.

Thinking on it again, I realized that Immix might trigger collection early sometimes, before it has exhausted the previous cycle's set of blocks in which to allocate. As we discussed earlier, there is a case in which you might want to trigger an early compaction: when a large object allocator runs out of blocks to decommission from the immix space. And if one evacuating collection didn't yield enough free blocks, you might trigger the next one early, reserving some recyclable and empty blocks as evacuation targets.

when do you know what you know: lazy and eager

Consider a basic question, such as "how many bytes in the heap are used by live objects". In general you don't know! Indeed you often never know precisely. For example, concurrent collectors often have some amount of "floating garbage" which is unreachable data but which survives across a collection. And of course you don't know the difference between floating garbage and precious data: if you did, you would have collected the garbage.

Even the idea of "when" is tricky in systems that allow parallel mutator threads. Unless the program has a total ordering of mutations of the object graph, there's no one timeline with respect to which you can measure the heap. Still, Immix is a stop-the-world collector, and since such collectors synchronously trace the heap while mutators are stopped, these are times when you can exactly compute properties about the heap.

Let's retake the question of measuring live bytes. For an evacuating semi-space, knowing the number of live bytes after a collection is trivial: all survivors are packed into to-space. But for a mark-sweep space, you would have to compute this information. You could compute it at mark-time, while tracing the graph, but doing so takes time, which means delaying the time at which mutators can start again.

Alternately, for a mark-sweep collector, you can compute free bytes at sweep-time. This is the phase in which you go through the whole heap and return any space that wasn't marked in the last collection to the allocator, allowing it to be used for fresh allocations. This is the point in the garbage collection cycle in which you can answer questions such as "what is the set of recyclable blocks": you know what is garbage and you know what is not.

Though you could sweep during the stop-the-world pause, you don't have to; sweeping only touches dead objects, so it is correct to allow mutators to continue and then sweep as the mutators run. There are two general strategies: spawn a thread that sweeps as fast as it can (concurrent sweeping), or make mutators sweep as needed, just before they allocate (lazy sweeping). But this introduces a lag between when you know and what you know—your count of total live heap bytes describes a time in the past, not the present, because mutators have moved on since then.

For most collectors with a sweep phase, deciding between eager (during the stop-the-world phase) and deferred (concurrent or lazy) sweeping is very easy. You don't immediately need the information that sweeping allows you to compute; it's quite sufficient to wait until the next cycle. Moving work out of the stop-the-world phase is a win for mutator responsiveness (latency). Usually people implement lazy sweeping, as it is naturally incremental with the mutator, naturally parallel for parallel mutators, and any sweeping overhead due to cache misses can be mitigated by immediately using swept space for allocation. The case for concurrent sweeping is less clear to me, but if you have cores that would otherwise be idle, sure.

eager coarse sweeping

Immix is interesting in that it chooses to sweep eagerly, during the stop-the-world phase. Instead of sweeping irregularly-sized objects, however, it sweeps over its "line mark" array: one byte for each 128-byte "line" in the mark space. For 32 kB blocks, there will be 256 bytes per block, and line mark bytes in each 4 MB slab of the heap are packed contiguously. Therefore you get relatively good locality, but this just mitigates a cost that other collectors don't have to pay. So what does eager marking over these coarse 128-byte regions buy Immix?

Firstly, eager sweeping buys you eager identification of empty blocks. If your large object space needs to steal blocks from the mark space, but the mark space doesn't have enough empties, it can just trigger collection and then it knows if enough blocks are available. If no blocks are available, you can grow the heap or signal out-of-memory. If the lospace (large object space) runs out of blocks before the mark space has used all recyclable blocks, that's no problem: evacuation can move the survivors of fragmented blocks into these recyclable blocks, which have also already been identified by the eager coarse sweep.

Without eager empty block identification, if the lospace runs out of blocks, firstly you don't know how many empty blocks the mark space has. Sweeping is a kind of wavefront that moves through the whole heap; empty blocks behind the wavefront will be identified, but those ahead of the wavefront will not. Such a lospace allocation would then have to either wait for a concurrent sweeper to advance, or perform some lazy sweeping work. The expected latency of a lospace allocation would thus be higher, without eager identification of empty blocks.

Secondly, eager sweeping might reduce allocation overhead for mutators. If allocation just has to identify holes and not compute information or decide on what to do with a block, maybe it go brr? Not sure.

lines, lines, lines

The original Immix paper also notes a relative insensitivity of the collector to line size: 64 or 256 bytes could have worked just as well. This was a somewhat surprising result to me but I think I didn't appreciate all the roles that lines play in Immix.

Obviously line size affect the worst-case fragmentation, though this is mitigated by evacuation (which evacuates objects, not lines). This I got from the paper. In this case, smaller lines are better.

Line size affects allocation-time overhead for mutators, though which way I don't know: scanning for holes will be easier with fewer lines in a block, but smaller lines would contain more free space and thus result in fewer collections. I can only imagine though that with smaller line sizes, average hole size would decrease and thus medium-sized allocations would be harder to service. Something of a wash, perhaps.

However if we ask ourselves the thought experiment, why not just have 16-byte lines? How crazy would that be? I think the impediment to having such a precise line size would mainly be Immix's eager sweep, as a fine-grained traversal of the heap would process much more data and incur possibly-unacceptable pause time overheads. But, in such a design you would do away with some other downsides of coarse-grained lines: a side table of mark bytes would make the line mark table redundant, and you eliminate much possible "dark matter" hidden by internal fragmentation in lines. You'd need to defer sweeping. But then you lose eager identification of empty blocks, and perhaps also the ability to evacuate into recyclable blocks. What would such a system look like?

Readers that have gotten this far will be pleased to hear that I have made some investigations in this area. But, this post is already long, so let's revisit this in another dispatch. Until then, happy allocations in all regions.

unintentional concurrency

20 July 2022 9:26 PM (garbage collection | immix | guile | gnu | igalia | evacuation | concurrency)

Good evening, gentle hackfolk. Last time we talked about heuristics for when you might want to compact a heap. Compacting garbage collection is nice and tidy and appeals to our orderly instincts, and it enables heap shrinking and reallocation of pages to large object spaces and it can reduce fragmentation: all very good things. But evacuation is more expensive than just marking objects in place, and so a production garbage collector will usually just mark objects in place, and only compact or evacuate when needed.

Today's post is more details!

dedication

Just because it's been, oh, a couple decades, I would like to reintroduce a term I learned from Marnanel years ago on advogato, a nerdy group blog kind of a site. As I recall, there is a word that originates in the Oxbridge social environment, "narg", from "Not A Real Gentleman", and which therefore denotes things that not-real-gentlemen do: nerd out about anything that's not, like, fox-hunting or golf; or generally spending time on something not because it will advance you in conventional hierarchies but because you just can't help it, because you love it, because it is just your thing. Anyway, in the spirit of pursuits that are really not What One Does With One's Time, this post is dedicated to the word "nargery".

side note, bis: immix-style evacuation versus mark-compact

In my last post I described Immix-style evacuation, and noted that it might take a few cycles to fully compact the heap, and that it has a few pathologies: the heap might never reach full compaction, and that Immix might run out of free blocks in which to evacuate.

With these disadvantages, why bother? Why not just do a single mark-compact pass and be done? I implicitly asked this question last time but didn't really answer it.

For some people will be, yep, yebo, mark-compact is the right answer. And yet, there are a few reasons that one might choose to evacuate a fraction of the heap instead of compacting it all at once.

The first reason is object pinning. Mark-compact systems assume that all objects can be moved; you can't usefully relax this assumption. Most algorithms "slide" objects down to lower addresses, squeezing out the holes, and therefore every live object's address needs to be available to use when sliding down other objects with higher addresses. And yet, it would be nice sometimes to prevent an object from being moved. This is the case, for example, when you grant a foreign interface (e.g. a C function) access to a buffer: if garbage collection happens while in that foreign interface, it would be nice to be able to prevent garbage collection from moving the object out from under the C function's feet.

Another reason to want to pin an object is because of conservative root-finding. Guile currently uses the Boehm-Demers-Weiser collector, which conservatively scans the stack and data segments for anything that looks like a pointer to the heap. The garbage collector can't update such a global root in response to compaction, because you can't be sure that a given word is a pointer and not just an integer with an inconvenient value. In short, objects referenced by conservative roots need to be pinned. I would like to support precise roots at some point but part of my interest in Immix is to allow Guile to move to a better GC algorithm, without necessarily requiring precise enumeration of GC roots. Optimistic partial evacuation allows for the possibility that any given evacuation might fail, which makes it appropriate for conservative root-finding.

Finally, as moving objects has a cost, it's reasonable to want to only incur that cost for the part of the heap that needs it. In any given heap, there will likely be some data that stays live across a series of collections, and which, once compacted, can't be profitably moved for many cycles. Focussing evacuation on only the part of the heap with the lowest survival rates avoids wasting time on copies that don't result in additional compaction.

(I should admit one thing: sliding mark-compact compaction preserves allocation order, whereas evacuation does not. The memory layout of sliding compaction is more optimal than evacuation.)

multi-cycle evacuation

Say a mutator runs out of memory, and therefore invokes the collector. The collector decides for whatever reason that we should evacuate at least part of the heap instead of marking in place. How much of the heap can we evacuate? The answer depends primarily on how many free blocks you have reserved for evacuation. These are known-empty blocks that haven't been allocated into by the last cycle. If you don't have any, you can't evacuate! So probably you should keep some around, even when performing in-place collections. The Immix papers suggest 2% and that works for me too.

Then you evacuate some blocks. Hopefully the result is that after this collection cycle, you have more free blocks. But you haven't compacted the heap, at least probably not on the first try: not into 2% of total space. Therefore you tell the mutator to put any empty blocks it finds as a result of lazy sweeping during the next cycle onto the evacuation target list, and then the next cycle you have more blocks to evacuate into, and more and more and so on until after some number of cycles you fall below some overall heap fragmentation low-watermark target, at which point you can switch back to marking in place.

I don't know how this works in practice! In my test setups which triggers compaction at 10% fragmentation and continues until it drops below 5%, it's rare that it takes more than 3 cycles of evacuation until the heap drops to effectively 0% fragmentation. Of course I had to introduce fragmented allocation patterns into the microbenchmarks to even cause evacuation to happen at all. I look forward to some day soon testing with real applications.

concurrency

Just as a terminological note, in the world of garbage collectors, "parallel" refers to multiple threads being used by a garbage collector. Parallelism within a collector is essentially an implementation detail; when the world is stopped for collection, the mutator (the user program) generally doesn't care if the collector uses 1 thread or 15. On the other hand, "concurrent" means the collector and the mutator running at the same time.

Different parts of the collector can be concurrent with the mutator: for example, sweeping, marking, or evacuation. Concurrent sweeping is just a detail, because it just visits dead objects. Concurrent marking is interesting, because it can significantly reduce stop-the-world pauses by performing most of the computation while the mutator is running. It's tricky, as you might imagine; the collector traverses the object graph while the mutator is, you know, mutating it. But there are standard techniques to make this work. Concurrent evacuation is a nightmare. It's not that you can't implement it; you can. But it's very very hard to get an overall performance win from concurrent evacuation/copying.

So if you are looking for a good bargain in the marketplace of garbage collector algorithms, it would seem that you need to avoid concurrent copying/evacuation. It's an expensive product that would seem to not buy you very much.

All that is just a prelude to an observation that there is a funny source of concurrency even in some systems that don't see themselves as concurrent: mutator threads marking their own roots. To recall, when you stop the world for a garbage collection, all mutator threads have to somehow notice the request to stop, reach a safepoint, and then stop. Then the collector traces the roots from all mutators and everything they reference, transitively. Then you let the threads go again. Thing is, once you get more than a thread or four, stopping threads can take time. You'd be tempted to just have threads notice that they need to stop, then traverse their own stacks at their own safepoint to find their roots, then stop. But, this introduces concurrency between root-tracing and other mutators that might not have seen the request to stop. For marking, this concurrency can be fine: you are just setting mark bits, not mutating the roots. You might need to add an additional mark pattern that can be distinguished from marked-last-time and marked-the-time-before-but-dead-now, but that's a detail. Fine.

But if you instead start an evacuating collection, the gates of hell open wide and toothy maws and horns fill your vision. One thread could be stopping and evacuating the objects referenced by its roots, while another hasn't noticed the request to stop and is happily using the same objects: chaos! You are trying to make a minor optimization to move some work out of the stop-the-world phase but instead everything falls apart.

Anyway, this whole article was really to get here and note that you can't do ragged-stops with evacuation without supporting full concurrent evacuation. Otherwise, you need to postpone root traversal until all threads are stopped. Perhaps this is another argument that evacuation is expensive, relative to marking in place. In practice I haven't seen the ragged-stop effect making so much of a difference, but perhaps that is because evacuation is infrequent in my test cases.

Zokay? Zokay. Welp, this evening's nargery was indeed nargy. Happy hacking to all collectors out there, and until next time.

an optimistic evacuation of my wordhoard

21 June 2022 12:21 PM (garbage collection | immix | guile | gnu | igalia | mmap | evacuation | compaction)

Good morning, mallocators. Last time we talked about how to split available memory between a block-structured main space and a large object space. Given a fixed heap size, making a new large object allocation will steal available pages from the block-structured space by finding empty blocks and temporarily returning them to the operating system.

Today I'd like to talk more about nothing, or rather, why might you want nothing rather than something. Given an Immix heap, why would you want it organized in such a way that live data is packed into some blocks, leaving other blocks completely free? How bad would it be if instead the live data were spread all over the heap? When might it be a good idea to try to compact the heap? Ideally we'd like to be able to translate the answers to these questions into heuristics that can inform the GC when compaction/evacuation would be a good idea.

lospace and the void

Let's start with one of the more obvious points: large object allocation. With a fixed-size heap, you can't allocate new large objects if you don't have empty blocks in your paged space (the Immix space, for example) that you can return to the OS. To obtain these free blocks, you have four options.

  1. You can continue lazy sweeping of recycled blocks, to see if you find an empty block. This is a bit time-consuming, though.

  2. Otherwise, you can trigger a regular non-moving GC, which might free up blocks in the Immix space but which is also likely to free up large objects, which would result in fresh empty blocks.

  3. You can trigger a compacting or evacuating collection. Immix can't actually compact the heap all in one go, so you would preferentially select evacuation-candidate blocks by choosing the blocks with the least live data (as measured at the last GC), hoping that little data will need to be evacuated.

  4. Finally, for environments in which the heap is growable, you could just grow the heap instead. In this case you would configure the system to target a heap size multiplier rather than a heap size, which would scale the heap to be e.g. twice the size of the live data, as measured at the last collection.

If you have a growable heap, I think you will rarely choose to compact rather than grow the heap: you will either collect or grow. Under constant allocation rate, the rate of empty blocks being reclaimed from freed lospace objects will be equal to the rate at which they are needed, so if collection doesn't produce any, then that means your live data set is increasing and so growing is a good option. Anyway let's put growable heaps aside, as heap-growth heuristics are a separate gnarly problem.

The question becomes, when should large object allocation force a compaction? Absent growable heaps, the answer is clear: when allocating a large object fails because there are no empty pages, but the statistics show that there is actually ample free memory. Good! We have one heuristic, and one with an optimum: you could compact in other situations but from the point of view of lospace, waiting until allocation failure is the most efficient.

shrinkage

Moving on, another use of empty blocks is when shrinking the heap. The collector might decide that it's a good idea to return some memory to the operating system. For example, I enjoyed this recent paper on heuristics for optimum heap size, that advocates that you size the heap in proportion to the square root of the allocation rate, and that as a consequence, when/if the application reaches a dormant state, it should promptly return memory to the OS.

Here, we have a similar heuristic for when to evacuate: when we would like to release memory to the OS but we have no empty blocks, we should compact. We use the same evacuation candidate selection approach as before, also, aiming for maximum empty block yield.

fragmentation

What if you go to allocate a medium object, say 4kB, but there is no hole that's 4kB or larger? In that case, your heap is fragmented. The smaller your heap size, the more likely this is to happen. We should compact the heap to make the maximum hole size larger.

side note: compaction via partial evacuation

The evacuation strategy of Immix is... optimistic. A mark-compact collector will compact the whole heap, but Immix will only be able to evacuate a fraction of it.

It's worth dwelling on this a bit. As described in the paper, Immix reserves around 2-3% of overall space for evacuation overhead. Let's say you decide to evacuate: you start with 2-3% of blocks being empty (the target blocks), and choose a corresponding set of candidate blocks for evacuation (the source blocks). Since Immix is a one-pass collector, it doesn't know how much data is live when it starts collecting. It may not know that the blocks that it is evacuating will fit into the target space. As specified in the original paper, if the target space fills up, Immix will mark in place instead of evacuating; an evacuation candidate block with marked-in-place objects would then be non-empty at the end of collection.

In fact if you choose a set of evacuation candidates hoping to maximize your empty block yield, based on an estimate of live data instead of limiting to only the number of target blocks, I think it's possible to actually fill the targets before the source blocks empty, leaving you with no empty blocks at the end! (This can happen due to inaccurate live data estimations, or via internal fragmentation with the block size.) The only way to avoid this is to never select more evacuation candidate blocks than you have in target blocks. If you are lucky, you won't have to use all of the target blocks, and so at the end you will end up with more free blocks than not, so a subsequent evacuation will be more effective. The defragmentation result in that case would still be pretty good, but the yield in free blocks is not great.

In a production garbage collector I would still be tempted to be optimistic and select more evacuation candidate blocks than available empty target blocks, because it will require fewer rounds to compact the whole heap, if that's what you wanted to do. It would be a relatively rare occurrence to start an evacuation cycle. If you ran out of space while evacuating, in a production GC I would just temporarily commission some overhead blocks for evacuation and release them promptly after evacuation is complete. If you have a small heap multiplier in your Immix space, occasional partial evacuation in a long-running process would probably reach a steady state with blocks being either full or empty. Fragmented blocks would represent newer objects and evacuation would periodically sediment these into longer-lived dense blocks.

mutator throughput

Finally, the shape of the heap has its inverse in the shape of the holes into which the mutator can allocate. It's most efficient for the mutator if the heap has as few holes as possible: ideally just one large hole per block, which is the limit case of an empty block.

The opposite extreme would be having every other "line" (in Immix terms) be used, so that free space is spread across the heap in a vast spray of one-line holes. Even if fragmentation is not a problem, perhaps because the application only allocates objects that pack neatly into lines, having to stutter all the time to look for holes is overhead for the mutator. Also, the result is that contemporaneous allocations are more likely to be placed farther apart in memory, leading to more cache misses when accessing data. Together, allocator overhead and access overhead lead to lower mutator throughput.

When would this situation get so bad as to trigger compaction? Here I have no idea. There is no clear maximum. If compaction were free, we would compact all the time. But it's not; there's a tradeoff between the cost of compaction and mutator throughput.

I think here I would punt. If the heap is being actively resized based on allocation rate, we'll hit the other heuristics first, and so we won't need to trigger evacuation/compaction based on mutator overhead. You could measure this, though, in terms of average or median hole size, or average or maximum number of holes per block. Since evacuation is partial, all you need to do is to identify some "bad" blocks and then perhaps evacuation becomes attractive.

gc pause

Welp, that's some thoughts on when to trigger evacuation in Immix. Next time, we'll talk about some engineering aspects of evacuation. Until then, happy consing!

blocks and pages and large objects

20 June 2022 2:59 PM (garbage collection | immix | guile | gnu | igalia | mmap)

Good day! In a recent dispatch we talked about the fundamental garbage collection algorithms, also introducing the Immix mark-region collector. Immix mostly leaves objects in place but can move objects if it thinks it would be profitable. But when would it decide that this is a good idea? Are there cases in which it is necessary?

I promised to answer those questions in a followup article, but I didn't say which followup :) Before I get there, I want to talk about paged spaces.

enter the multispace

We mentioned that Immix divides the heap into blocks (32kB or so), and that no object can span multiple blocks. "Large" objects -- defined by Immix to be more than 8kB -- go to a separate "large object space", or "lospace" for short.

Though the implementation of a large object space is relatively simple, I found that it has some points that are quite subtle. Probably the most important of these points relates to heap size. Consider that if you just had one space, implemented using mark-compact maybe, then the procedure to allocate a 16 kB object would go:

  1. Try to bump the allocation pointer by 16kB. Is it still within range? If so we are done.

  2. Otherwise, collect garbage and try again. If after GC there isn't enough space, the allocation fails.

In step (2), collecting garbage could decide to grow or shrink the heap. However when evaluating collector algorithms, you generally want to avoid dynamically-sized heaps.

cheatery

Here is where I need to make an embarrassing admission. In my role as co-maintainer of the Guile programming language implementation, I have long noodled around with benchmarks, comparing Guile to Chez, Chicken, and other implementations. It's good fun. However, I only realized recently that I had a magic knob that I could turn to win more benchmarks: simply make the heap bigger. Make it start bigger, make it grow faster, whatever it takes. For a program that does its work in some fixed amount of total allocation, a bigger heap will require fewer collections, and therefore generally take less time. (Some amount of collection may be good for performance as it improves locality, but this is a marginal factor.)

Of course I didn't really go wild with this knob but it now makes me doubt all benchmarks I have ever seen: are we really using benchmarks to select for fast implementations, or are we in fact selecting for implementations with cheeky heap size heuristics? Consider even any of the common allocation-heavy JavaScript benchmarks, DeltaBlue or Earley or the like; to win these benchmarks, web browsers are incentivised to have large heaps. In the real world, though, a more parsimonious policy might be more appreciated by users.

Java people have known this for quite some time, and are therefore used to fixing the heap size while running benchmarks. For example, people will measure the minimum amount of memory that can allow a benchmark to run, and then configure the heap to be a constant multiplier of this minimum size. The MMTK garbage collector toolkit can't even grow the heap at all currently: it's an important feature for production garbage collectors, but as they are just now migrating out of the research phase, heap growth (and shrinking) hasn't yet been a priority.

lospace

So now consider a garbage collector that has two spaces: an Immix space for allocations of 8kB and below, and a large object space for, well, larger objects. How do you divide the available memory between the two spaces? Could the balance between immix and lospace change at run-time? If you never had large objects, would you be wasting space at all? Conversely is there a strategy that can also work for only large objects?

Perhaps the answer is obvious to you, but it wasn't to me. After much reading of the MMTK source code and pondering, here is what I understand the state of the art to be.

  1. Arrange for your main space -- Immix, mark-sweep, whatever -- to be block-structured, and able to dynamically decomission or recommission blocks, perhaps via MADV_DONTNEED. This works if the blocks are even multiples of the underlying OS page size.

  2. Keep a counter of however many bytes the lospace currently has.

  3. When you go to allocate a large object, increment the lospace byte counter, and then round up to number of blocks to decommission from the main paged space. If this is more than are currently decommissioned, find some empty blocks and decommission them.

  4. If no empty blocks were found, collect, and try again. If the second try doesn't work, then the allocation fails.

  5. Now that the paged space has shrunk, lospace can allocate. You can use the system malloc, but probably better to use mmap, so that if these objects are collected, you can just MADV_DONTNEED them and keep them around for later re-use.

  6. After GC runs, explicitly return the memory for any object in lospace that wasn't visited when the object graph was traversed. Decrement the lospace byte counter and possibly return some empty blocks to the paged space.

There are some interesting aspects about this strategy. One is, the memory that you return to the OS doesn't need to be contiguous. When allocating a 50 MB object, you don't have to find 50 MB of contiguous free space, because any set of blocks that adds up to 50 MB will do.

Another aspect is that this adaptive strategy can work for any ratio of large to non-large objects. The user doesn't have to manually set the sizes of the various spaces.

This strategy does assume that address space is larger than heap size, but only by a factor of 2 (modulo fragmentation for the large object space). Therefore our risk of running afoul of user resource limits and kernel overcommit heuristics is low.

The one underspecified part of this algorithm is... did you see it? "Find some empty blocks". If the main paged space does lazy sweeping -- only scanning a block for holes right before the block will be used for allocation -- then after a collection we don't actually know very much about the heap, and notably, we don't know what blocks are empty. (We could know it, of course, but it would take time; you could traverse the line mark arrays for all blocks while the world is stopped, but this increases pause time. The original Immix collector does this, however.) In the system I've been working on, instead I have it so that if a mutator finds an empty block, it puts it on a separate list, and then takes another block, only allocating into empty blocks once all blocks are swept. If the lospace needs blocks, it sweeps eagerly until it finds enough empty blocks, throwing away any nonempty blocks. This causes the next collection to happen sooner, but that's not a terrible thing; this only occurs when rebalancing lospace versus paged-space size, because if you have a constant allocation rate on the lospace side, you will also have a complementary rate of production of empty blocks by GC, as they are recommissioned when lospace objects are reclaimed.

What if your main paged space has ample space for allocating a large object, but there are no empty blocks, because live objects are equally peppered around all blocks? In that case, often the application would be best served by growing the heap, but maybe not. In any case in a strict-heap-size environment, we need a solution.

But for that... let's pick up another day. Until then, happy hacking!

defragmentation

15 June 2022 12:47 PM (garbage collection | immix | guile | gnu | igalia)

Good morning, hackers! Been a while. It used to be that I had long blocks of uninterrupted time to think and work on projects. Now I have two kids; the longest such time-blocks are on trains (too infrequent, but it happens) and in a less effective but more frequent fashion, after the kids are sleeping. As I start writing this, I'm in an airport waiting for a delayed flight -- my first since the pandemic -- so we can consider this to be the former case.

It is perhaps out of mechanical sympathy that I have been using my reclaimed time to noodle on a garbage collector. Managing space and managing time have similar concerns: how to do much with little, efficiently packing different-sized allocations into a finite resource.

I have been itching to write a GC for years, but the proximate event that pushed me over the edge was reading about the Immix collection algorithm a few months ago.

on fundamentals

Immix is a "mark-region" collection algorithm. I say "algorithm" rather than "collector" because it's more like a strategy or something that you have to put into practice by making a concrete collector, the other fundamental algorithms being copying/evacuation, mark-sweep, and mark-compact.

To build a collector, you might combine a number of spaces that use different strategies. A common choice would be to have a semi-space copying young generation, a mark-sweep old space, and maybe a treadmill large object space (a kind of copying collector, logically; more on that later). Then you have heuristics that determine what object goes where, when.

On the engineering side, there's quite a number of choices to make there too: probably you make some parts of your collector to be parallel, maybe the collector and the mutator (the user program) can run concurrently, and so on. Things get complicated, but the fundamental algorithms are relatively simple, and present interesting fundamental tradeoffs.


figure 1 from the immix paper

For example, mark-compact is most parsimonious regarding space usage -- for a given program, a garbage collector using a mark-compact algorithm will require less memory than one that uses mark-sweep. However, mark-compact algorithms all require at least two passes over the heap: one to identify live objects (mark), and at least one to relocate them (compact). This makes them less efficient in terms of overall program throughput and can also increase latency (GC pause times).

Copying or evacuating spaces can be more CPU-efficient than mark-compact spaces, as reclaiming memory avoids traversing the heap twice; a copying space copies objects as it traverses the live object graph instead of after the traversal (mark phase) is complete. However, a copying space's minimum heap size is quite high, and it only reaches competitive efficiencies at large heap sizes. For example, if your program needs 100 MB of space for its live data, a semi-space copying collector will need at least 200 MB of space in the heap (a 2x multiplier, we say), and will only run efficiently at something more like 4-5x. It's a reasonable tradeoff to make for small spaces such as nurseries, but as a mature space, it's so memory-hungry that users will be unhappy if you make it responsible for a large portion of your memory.

Finally, mark-sweep is quite efficient in terms of program throughput, because like copying it traverses the heap in just one pass, and because it leaves objects in place instead of moving them. But! Unlike the other two fundamental algorithms, mark-sweep leaves the heap in a fragmented state: instead of having all live objects packed into a contiguous block, memory is interspersed with live objects and free space. So the collector can run quickly but the allocator stops and stutters as it accesses disparate regions of memory.

allocators

Collectors are paired with allocators. For mark-compact and copying/evacuation, the allocator consists of a pointer to free space and a limit. Objects are allocated by bumping the allocation pointer, a fast operation that also preserves locality between contemporaneous allocations, improving overall program throughput. But for mark-sweep, we run into a problem: say you go to allocate a 1 kilobyte byte array, do you actually have space for that?

Generally speaking, mark-sweep allocators solve this problem via freelist allocation: the allocator has an array of lists of free objects, one for each "size class" (say 2 words, 3 words, and so on up to 16 words, then more sparsely up to the largest allocatable size maybe), and services allocations from their appropriate size class's freelist. This prevents the 1 kB free space that we need from being "used up" by a 16-byte allocation that could just have well gone elsewhere. However, freelists prevent objects allocated around the same time from being deterministically placed in nearby memory locations. This increases variance and decreases overall throughput for both the allocation operations but also for pointer-chasing in the course of the program's execution.

Also, in a mark-sweep collector, we can still reach a situation where there is enough space on the heap for an allocation, but that free space broken up into too many pieces: the heap is fragmented. For this reason, many systems that perform mark-sweep collection can choose to compact, if heuristics show it might be profitable. Because the usual strategy is mark-sweep, though, they still use freelist allocation.

on immix and mark-region

Mark-region collectors are like mark-sweep collectors, except that they do bump-pointer allocation into the holes between survivor objects.

Sounds simple, right? To my mind, though the fundamental challenge in implementing a mark-region collector is how to handle fragmentation. Let's take a look at how Immix solves this problem.


part of figure 2 from the immix paper

Firstly, Immix partitions the heap into blocks, which might be 32 kB in size or so. No object can span a block. Block size should be chosen to be a nice power-of-two multiple of the system page size, not so small that common object allocations wouldn't fit. Allocating "large" objects -- greater than 8 kB, for Immix -- go to a separate space that is managed in a different way.

Within a block, Immix divides space into lines -- maybe 128 bytes long. Objects can span lines. Any line that does not contain (a part of) an object that survived the previous collection is part of a hole. A hole is a contiguous span of free lines in a block.

On the allocation side, Immix does bump-pointer allocation into holes. If a mutator doesn't have a hole currently, it scans the current block (obtaining one if needed) for the next hole, via a side-table of per-line mark bits: one bit per line. Lines without the mark are in holes. Scanning for holes is fairly cheap, because the line size is not too small. Note, there are also per-object mark bits as well; just because you've marked a line doesn't mean that you've traced all objects on that line.

Allocating into a hole has good expected performance as well, as it's bump-pointer, and the minimum size isn't tiny. In the worst case of a hole consisting of a single line, you have 128 bytes to work with. This size is large enough for the majority of objects, given that most objects are small.

mitigating fragmentation

Immix still has some challenges regarding fragmentation. There is some loss in which a single (piece of an) object can keep a line marked, wasting any free space on that line. Also, when an object can't fit into a hole, any space left in that hole is lost, at least until the next collection. This loss could also occur for the next hole, and the next and the next and so on until Immix finds a hole that's big enough. In a mark-sweep collector with lazy sweeping, these free extents could instead be placed on freelists and used when needed, but in Immix there is no such facility (by design).

One mitigation for fragmentation risks is "overflow allocation": when allocating an object larger than a line (a medium object), and Immix can't find a hole before the end of the block, Immix allocates into a completely free block. So actually mutator threads allocate into two blocks at a time: one for small objects and medium objects if possible, and the other for medium objects when necessary.

Another mitigation is that large objects are allocated into their own space, so an Immix space will never be used for blocks larger than, say, 8kB.

The other mitigation is that Immix can choose to evacuate instead of mark. How does this work? Is it worth it?

stw

This question about the practical tradeoffs involving evacuation is the one I wanted to pose when I started this article; I have gotten to the point of implementing this part of Immix and I have some doubts. But, this article is long enough, and my plane is about to land, so let's revisit this on my return flight. Until then, see you later, allocators!

webassembly: the new kubernetes?

13 December 2021 3:50 PM (webassembly | kubernetes | openstack | containerization | faas | lambda | igalia)

I had an "oh, duh, of course" moment a few weeks ago that I wanted to share: is WebAssembly the next Kubernetes?

katers gonna k8s

Kubernetes promises a software virtualization substrate that allows you to solve a number of problems at the same time:

  • Compared to running services on bare metal, Kubernetes ("k8s") lets you use hardware more efficiently. K8s lets you run many containers on one hardware server, and lets you just add more servers to your cluster as you need them.

  • The "cloud of containers" architecture efficiently divides up the work of building server-side applications. Your database team can ship database containers, your backend team ships java containers, and your product managers wire them all together using networking as the generic middle-layer. It cuts with the grain of Conway's law: the software looks like the org chart.

  • The container abstraction is generic enough to support lots of different kinds of services. Go, Java, C++, whatever -- it's not language-specific. Your dev teams can use what they like.

  • The operations team responsible for the k8s servers that run containers don't have to trust the containers that they run. There is some sandboxing and security built-in.

K8s itself is an evolution on a previous architecture, OpenStack. OpenStack had each container be a full virtual machine, with a whole kernel and operating system and everything. K8s instead generally uses containers, which don't generally require a kernel in the containers. The result is that they are lighter-weight -- think Docker versus VirtualBox.

In a Kubernetes deployment, you still have the kernel at a central place in your software architecture. The fundamental mechanism of containerization is the Linux kernel process, with private namespaces. These containers are then glued together by TCP and UDP sockets. However, though one or more kernel process per container does scale better than full virtual machines, it doesn't generally scale to millions of containers. And processes do have some start-up time -- you can't spin up a container for each request to a high-performance web service. These technical contraints lead to certain kinds of system architectures, with generally long-lived components that keep some kind of state.

k8s <=? w9y

Server-side WebAssembly is in a similar space as Kubernetes -- or rather, WebAssembly is similar to processes plus private namespaces. WebAssembly gives you a good abstraction barrier and (can give) high security isolation. It's even better in some ways because WebAssembly provides "allowlist" security -- it has no capabilities to start with, requiring that the "host" that runs the WebAssembly explicitly delegate some of its own capabilities to the guest WebAssembly module. Compare to processes which by default start with every capability and then have to be restricted.

Like Kubernetes, WebAssembly also gets you Conway's-law-affine systems. Instead of shipping containers, you ship WebAssembly modules -- and some metadata about what kinds of things they need from their environment (the 'imports'). And WebAssembly is generic -- it's a low level virtual machine that anything can compile to.

But, in WebAssembly you get a few more things. One is fast start. Because memory is data, you can arrange to create a WebAssembly module that starts with its state pre-initialized in memory. Such a module can start in microseconds -- fast enough to create one on every request, in some cases, just throwing away the state afterwards. You can run function-as-a-service architectures more effectively on WebAssembly than on containers. Another is that the virtualization is provided entirely in user-space. One process can multiplex between many different WebAssembly modules. This lets one server do more. And, you don't need to use networking to connect WebAssembly components; they can transfer data in memory, sometimes even without copying.

(A digression: this lightweight in-process aspect of WebAssembly makes it so that other architectures are also possible, e.g. this fun hack to sandbox a library linked into Firefox. They actually shipped that!)

I compare WebAssembly to K8s, but really it's more like processes and private namespaces. So one answer to the question as initially posed is that no, WebAssembly is not the next Kubernetes; that next thing is waiting to be built, though I know of a few organizations that have started already.

One thing does seem clear to me though: WebAssembly will be at the bottom of the new thing, and therefore that the near-term trajectory of WebAssembly is likely to follow that of Kubernetes, which means...

  • Champagne time for analysts!

  • The Gartner ✨✨Magic Quadrant✨✨™®© rides again

  • IBM spins out a new WebAssembly division

  • Accenture starts asking companies about their WebAssembly migration plan

  • The Linux Foundation tastes blood in the waters

And so on. I see turbulent waters in the near-term future. So in that sense, in which Kubernetes is not essentially a technical piece of software but rather a nexus of frothy commercial jousting, then yes, certainly: we have a fun 5 years or so ahead of us.

cross-module inlining in guile

13 May 2021 11:25 AM (guile | scheme | inlining | partial evaluation | peval | compilers)

Greetings, hackers of spaceship Earth! Today's missive is about cross-module inlining in Guile.

a bit of history

Back in the day... what am I saying? I always start these posts with loads of context. Probably you know it all already. 10 years ago, Guile's partial evaluation pass extended the macro-writer's bill of rights to Schemers of the Guile persuasion. This pass makes local function definitions free in many cases: if they should be inlined and constant-folded, you are confident that they will be. peval lets you write clear programs with well-factored code and still have good optimization.

The peval pass did have a limitation, though, which wasn't its fault. In Guile, modules have historically been a first-order concept: modules are a kind of object with a hash table inside, which you build by mutating. I speak crassly but that's how it is. In such a world, it's hard to reason about top-level bindings: what module do they belong to? Could they ever be overridden? When you have a free reference to a, and there's a top-level definition of a in the current compilation unit, is that the a that's being referenced, or could it be something else? Could the binding be mutated in the future?

During the Guile 2.0 and 2.2 cycles, we punted on this question. But for 3.0, we added the notion of declarative modules. For these modules, bindings which are defined once in a module and which are not mutated in the compilation unit are declarative bindings, which can be reasoned about lexically. We actually translate them to a form of letrec*, which then enables inlining via peval, contification, and closure optimization -- in descending order of preference.

The upshot is that with Guile 3.0, top-level bindings are no longer optimization barriers, in the case of declarative modules, which are compatible enough with historic semantics and usage that they are on by default.

However, module boundaries have still been an optimization barrier. Take (srfi srfi-1), a little utility library on lists. One definition in the library is xcons, which is cons with arguments reversed. It's literally (lambda (cdr car) (cons car cdr)). But does the compiler know that? Would it know that (car (xcons x y)) is the same as y? Until now, no, because no part of the optimizer will look into bindings from outside the compilation unit.

mr compiler, tear down this wall

But no longer! Guile can now inline across module boundaries -- in some circumstances. This feature will be part of a future Guile 3.0.8.

There are actually two parts of this. One is the compiler can identify a set of "inlinable" values from a declarative module. An inlinable value is a small copyable expression. A copyable expression has no identity (it isn't a fresh allocation), and doesn't reference any module-private binding. Notably, lambda expressions can be copyable, depending on what they reference. The compiler then extends the module definition that's residualized in the compiled file to include a little procedure that, when passed a name, will return the Tree-IL representation of that binding. The design of that was a little tricky; we want to avoid overhead when using the module outside of the compiler, even relocations. See compute-encoding in that module for details.

With all of that, we can call ((module-inlinable-exports (resolve-interface '(srfi srfi-1))) 'xcons) and get back the Tree-IL equivalent of (lambda (cdr car) (cons car cdr)). Neat!

The other half of the facility is the actual inlining. Here we lean on peval again, causing <module-ref> forms to trigger an attempt to copy the term from the imported module to the residual expression, limited by the same effort counter as the rest of peval.

The end result is that we can be absolutely sure that constants in imported declarative modules will inline into their uses, and fairly sure that "small" procedures will inline too.

caveat: compiled imported modules

There are a few caveats about this facility, and they are sufficiently sharp that I should probably fix them some day. The first one is that for an imported module to expose inlinable definitions, the imported module needs to have been compiled already, not loaded from source. When you load a module from source using the interpreter instead of compiling it first, the pipeline is optimized for minimizing the latency between when you ask for the module and when it is available. There's no time to analyze the module to determine which exports are inlinable and so the module exposes no inlinable exports.

This caveat is mitigated by automatic compilation, enabled by default, which will compile imported modules as needed.

It could also be fixed for modules by topologically ordering the module compilation sequence; this would allow some parallelism in the build but less than before, though for module graphs with cycles (these exist!) you'd still have some weirdness.

caveat: abi fragility

Before Guile supported cross-module inlining, there was only explicit inlining across modules in Guile, facilitated by macros. If you write a module that has a define-inlinable export and you think about its ABI, then you know to consider any definition referenced by the inlinable export, and you know by experience that its code may be copied into other compilation units. Guile doesn't automatically recompile a dependent module when a macro that it uses changes, currently anyway. Admittedly this situation leans more on culture than on tools, which could be improved.

However, with automatically inlinable exports, this changes. Any definition in a module could be inlined into its uses in other modules. This may alter the ABI of a module in unexpected ways: you think that module C depends on module B, but after inlining it may depend on module A as well. Updating module B might not update the inlined copies of values from B into C -- as in the case of define-inlinable, but less lexically apparent.

At higher optimization levels, even private definitions in a module can be inlinable; these may be referenced if an exported macro from the module expands to a term that references a module-private variable, or if an inlinable exported binding references the private binding. But these optimization levels are off by default precisely because I fear the bugs.

Probably what this cries out for is some more sensible dependency tracking in build systems, but that is a topic for another day.

caveat: reproducibility

When you make a fresh checkout of Guile from git and build it, the build proceeds in the following way.

Firstly, we build libguile, the run-time library implemented in C.

Then we compile a "core" subset of Scheme files at optimization level -O1. This subset should include the evaluator, reader, macro expander, basic run-time, and compilers. (There is a bootstrap evaluator, reader, and macro expander in C, to start this process.) Say we have source files S0, S1, S2 and so on; generally speaking, these files correspond to Guile modules M0, M1, M2 etc. This first build produces compiled files C0, C1, C2, and so on. When compiling a file S2 implementing module M2, which happens to import M1 and M0, it may be M1 and M0 are provided by compiled files C1 and C0, or possibly they are loaded from the source files S1 and S0, or C1 and S0, or S1 and C0.

The bootstrap build uses make for parallelism, with each compile process starts afresh, importing all the modules that comprise the compiler and then using them to compile the target file. As the build proceeds, more and more module imports can be "serviced" by compiled files instead of source files, making the build go faster and faster. However this introduces system-specific nondeterminism as to the set of compiled files available when compiling any other file. This strategy works because it doesn't really matter whether module M1 is provided by compiled file C1 or source file S1; the compiler and the interpreter implement the same language.

Once the compiler is compiled at optimization level -O1, Guile then uses that freshly built compiler to build everything at -O2. We do it in this way because building some files at -O1 then all files at -O2 takes less time than going straight to -O2. If this sounds weird, that's because it is.

The resulting build is reproducible... mostly. There is a bug in which some unique identifiers generated as part of the implementation of macros can be non-reproducible in some cases, and that disabling parallel builds seems to solve the problem. The issue being that gensym (or equivalent) might be called a different number of times depending on whether you are loading a compiled module, or whether you need to read and macro-expand it. The resulting compiled files are equivalent under alpha-renaming but not bit-identical. This is a bug to fix.

Anyway, at optimization level -O1, Guile will record inlinable definitions. At -O2, Guile will actually try to do cross-module inlining. We run into two issues when compiling Guile; one is if we are in the -O2 phase, and we compile a module M which uses module N, and N is not in the set of "core" modules. In that case depending on parallelism and compile order, N may be loaded from source, in which case it has no inlinable exports, or from a compiled file, in which case it does. This is not a great situation for the reliability of this optimization. I think probably in Guile we will switch so that all modules are compiled at -O1 before compiling at -O2.

The second issue is more subtle: inlinable bindings are recorded after optimization of the Tree-IL. This is more optimal than recording inlinable bindings before optimization, as a term that is not inlinable due to its size in its initial form may become small enough to inline after optimization. However, at -O2, optimization includes cross-module inlining! A term that is inlinable at -O1 may become not inlinable at -O2 because it gets slightly larger, or vice-versa: terms that are too large at -O1 could shrink at -O2. We don't even have a guarantee that we will reach a fixed point even if we repeatedly recompile all inputs at -O2, because we allow non-shrinking optimizations.

I think this probably calls for a topological ordering of module compilation inside Guile and perhaps in other modules. That would at least give us reproducibility, provided we avoid the feedback loop of keeping around -O2 files compiled from a previous round, even if they are "up to date" (their corresponding source file didn't change).

and for what?

People who have worked on inliners will know what I mean that a good inliner is like a combine harvester: ruthlessly efficient, a qualitative improvement compared to not having one, but there is a pointy end with lots of whirling blades and it's important to stop at the end of the row. You do develop a sense of what will and won't inline, and I think Dybvig's "Macro writer's bill of rights" encompasses this sense. Luckily people don't lose fingers or limbs to inliners, but inliners can maim expectations, and cross-module inlining more so.

Still, what it buys us is the freedom to be abstract. I can define a module like:

(define-module (elf)
  #:export (ET_NONE ET_REL ET_EXEC ET_DYN ET_CORE))

(define ET_NONE		0)		; No file type
(define ET_REL		1)		; Relocatable file
(define ET_EXEC		2)		; Executable file
(define ET_DYN		3)		; Shared object file
(define ET_CORE		4)		; Core file

And if a module uses my (elf) module and references ET_DYN, I know that the module boundary doesn't prevent the value from being inlined as a constant (and possibly unboxed, etc).

I took a look and on our usual microbenchmark suite, cross-module inlining doesn't make a difference. But that's both a historical oddity and a bug: firstly that the benchmark suite comes from an old Scheme world that didn't have modules, and so won't benefit from cross-module inlining. Secondly, Scheme definitions from the "default" environment that aren't explicitly recognized as primitives aren't inlined, as the (guile) module isn't declarative. (Probably we should fix the latter at some point.)

But still, I'm really excited about this change! Guile developers use modules heavily and have been stepping around this optimization boundary for years. I count 100 direct uses of define-inlinable in Guile, a number of them inside macros, and many of these are to explicitly hack around the optimization barrier. I really look forward to seeing if we can remove some of these over time, to go back to plain old define and just trust the compiler to do what's needed.

by the numbers

I ran a quick analysis of the modules include in Guile to see what the impact was. Of the 321 files that define modules, 318 of them are declarative, and 88 contain inlinable exports (27% of total). Of the 6519 total bindings exported by declarative modules, 566 of those are inlinable (8.7%). Of the inlinable exports, 388 (69%) are functions (lambda expressions), 156 (28%) are constants, and 22 (4%) are "primitives" referenced by value and not by name, meaning definitions like (define head car) (instead of re-exporting car as head).

On the use side, 90 declarative modules import inlinable bindings (29%), resulting in about 1178 total attempts to copy inlinable bindings. 902 of those attempts are to copy a lambda expressions in operator position, which means that peval will attempt to inline their code. 46 of these attempts fail, perhaps due to size or effort constraints. 191 other attempts end up inlining constant values. 20 inlining attempts fail, perhaps because a lambda is used for a value. Somehow, 19 copied inlinable values get elided because they are evaluated only for their side effects, probably to clean up let-bound values that become unused due to copy propagation.

All in all, an interesting endeavor, and one to improve on going forward. Thanks for reading, and catch you next time!

guile's reader, in guile

11 April 2021 7:51 PM (guile | scheme | bootstrapping | read | parsing)

Good evening! A brief(ish?) note today about some Guile nargery.

the arc of history

Like many language implementations that started life when you could turn on the radio and expect to hear Def Leppard, Guile has a bottom half and a top half. The bottom half is written in C and exposes a shared library and an executable, and the top half is written in the language itself (Scheme, in the case of Guile) and somehow loaded by the C code when the language implementation starts.

Since 2010 or so we have been working at replacing bits written in C with bits written in Scheme. Last week's missive was about replacing the implementation of dynamic-link from using the libltdl library to using Scheme on top of a low-level dlopen wrapper. I've written about rewriting eval in Scheme, and more recently about how the road to getting the performance of C implementations in Scheme has been sometimes long.

These rewrites have a quixotic aspect to them. I feel something in my gut about rightness and wrongness and I know at a base level that moving from C to Scheme is the right thing. Much of it is completely irrational and can be out of place in a lot of contexts -- like if you have a task to get done for a customer, you need to sit and think about minimal steps from here to the goal and the gut doesn't have much of a role to play in how you get there. But it's nice to have a project where you can do a thing in the way you'd like, and if it takes 10 years, that's fine.

But besides the ineffable motivations, there are concrete advantages to rewriting something in Scheme. I find Scheme code to be more maintainable, yes, and more secure relative to the common pitfalls of C, obviously. It decreases the amount of work I will have when one day I rewrite Guile's garbage collector. But also, Scheme code gets things that C can't have: tail calls, resumable delimited continuations, run-time instrumentation, and so on.

Taking delimited continuations as an example, five years ago or so I wrote a lightweight concurrency facility for Guile, modelled on Parallel Concurrent ML. It lets millions of fibers to exist on a system. When a fiber would need to block on an I/O operation (read or write), instead it suspends its continuation, and arranges to restart it when the operation becomes possible.

A lot had to change in Guile for this to become a reality. Firstly, delimited continuations themselves. Later, a complete rewrite of the top half of the ports facility in Scheme, to allow port operations to suspend and resume. Many of the barriers to resumable fibers were removed, but the Fibers manual still names quite a few.

Scheme read, in Scheme

Which brings us to today's note: I just rewrote Guile's reader in Scheme too! The reader is the bit that takes a stream of characters and parses it into S-expressions. It was in C, and now is in Scheme.

One of the primary motivators for this was to allow read to be suspendable. With this change, read-eval-print loops are now implementable on fibers.

Another motivation was to finally fix a bug in which Guile couldn't record source locations for some kinds of datums. It used to be that Guile would use a weak-key hash table to associate datums returned from read with source locations. But this only works for fresh values, not for immediate values like small integers or characters, nor does it work for globally unique non-immediates like keywords and symbols. So for these, we just wouldn't have any source locations.

A robust solution to that problem is to return annotated objects rather than using a side table. Since Scheme's macro expander is already set to work with annotated objects (syntax objects), a new read-syntax interface would do us a treat.

With read in C, this was hard to do. But with read in Scheme, it was no problem to implement. Adapting the expander to expect source locations inside syntax objects was a bit fiddly, though, and the resulting increase in source location information makes the output files bigger by a few percent -- due somewhat to the increased size of the .debug_lines DWARF data, but also due to serialized source locations for syntax objects in macros.

Speed-wise, switching to read in Scheme is a regression, currently. The old reader could parse around 15 or 16 megabytes per second when recording source locations on this laptop, or around 22 or 23 MB/s with source locations off. The new one parses more like 10.5 MB/s, or 13.5 MB/s with positions off, when in the old mode where it uses a weak-key side table to record source locations. The new read-syntax runs at around 12 MB/s. We'll be noodling at these in the coming months, but unlike when the original reader was written, at least now the reader is mainly used only at compile time. (It still has a role when reading s-expressions as data, so there is still a reason to make it fast.)

As is the case with eval, we still have a C version of the reader available for bootstrapping purposes, before the Scheme version is loaded. Happily, with this rewrite I was able to remove all of the cruft from the C reader related to non-default lexical syntax, which simplifies maintenance going forward.

An interesting aspect of attempting to make a bug-for-bug rewrite is that you find bugs and unexpected behavior. For example, it turns out that since the dawn of time, Guile always read #t and #f without requiring a terminating delimiter, so reading "(#t1)" would result in the list (#t 1). Weird, right? Weirder still, when the #true and #false aliases were added to the language, Guile decided to support them by default, but in an oddly backwards-compatible way... so "(#false1)" reads as (#f 1) but "(#falsa1)" reads as (#f alsa1). Quite a few more things like that.

All in all it would seem to be a successful rewrite, introducing no new behavior, even producing the same errors. However, this is not the case for backtraces, which can expose the guts of read in cases where that previously wouldn't happen because the C stack was opaque to Scheme. Probably we will simply need to add more sensible error handling around callers to read, as a backtrace isn't a good user-facing error anyway.

OK enough rambling for this evening. Happy hacking to all and to all a good night!

sign of the times

8 April 2021 7:09 PM (gnu | guile | libtool | ffi)

Hello all! There is a mounting backlog of things that landed in Guile recently and to avoid having to eat the whole plate in one bite, I'm going to try to send some shorter missives over the next weeks.

Today's is about a silly thing, dynamic-link. This interface is dlopen, but "portable". See, back in the day -- like, 1998 -- there were lots of kinds of systems and how to make and load a shared library portably was hard. You'd have people with AIX and Solaris and all kinds of weird compilers and linkers filing bugs on your project if you hard-coded a GNU toolchain invocation when creating loadable extensions, or hard-coded dlopen or similar to use them.

Libtool provided a solution to create portable loadable libraries, which involved installing .la files alongside the .so files. You could use libtool to link them to a library or an executable, or you could load them at run-time via the libtool-provided libltdl library.

But, the .la files were a second source of truth, and thus a source of bugs. If a .la file is present, so is an .so file, and you could always just use the .so file directly. For linking against an installed shared library on modern toolchains, the .la files are strictly redundant. Therefore, all GNU/Linux distributions just delete installed .la files -- Fedora, Debian, and even Guix do so.

Fast-forward to today: there has been a winnowing of platforms, and a widening of the GNU toolchain (in which I include LLVM as well as it has a mostly-compatible interface). The only remaining toolchain flavors are GNU and Windows, from the point of view of creating loadable shared libraries. Whether you use libtool or not to create shared libraries, the result can be loaded either way. And from the user side, dlopen is the universally supported interface, outside of Windows; even Mac OS fixed their implementation a few years back.

So in Guile we have been in an unstable equilibrium: creating shared libraries by including a probably-useless libtool into the toolchain, and loading them by using a probably-useless libtool-provided libltdl.

But the use of libltdl has not been without its costs. Because libltdl intends to abstract over different platforms, it encourages you to leave off the extension when loading a library, instead promising to try a platform-specific set such as .so, .dll, .dylib etc as appropriate. In practice the abstraction layer was under-maintained and we always had some problems on Mac OS, for example.

Worse, as ltdl would search through the path for candidates, it would only report the last error it saw from the underlying dlopen interface. It was almost always the case that if A and B were in the search path, and A/foo.so failed to load because of a missing dependency, the error you would get as a user would instead be "file not found", because ltdl swallowed the first error and kept trucking to try to load B/foo.so which didn't exist.

In summary, this is a case where the benefits of an abstraction layer decline over time. For a few years now, libltdl hasn't been paying for itself. Libtool is dead, for all intents and purposes (last release in 2015); best to make plans to migrate away, somehow.

In the case of the dlopen replacement, in Guile we ended up rewriting the functionality in Scheme. The underlying facility is now just plain dlopen, for which we shim a version of dlopen on Windows, inspired by the implementation in cygwin. There are still platform-specific library extensions, but that is handled easily on the Scheme layer.

Looking forward, I think it's probably time to replace Guile's use of libtool to create its libraries and executables. I loathe the fact that libtool puts shell scripts in the place of executables in build directories and stashes the actual executables elsewhere -- like, visceral revulsion. There is no need for that nowadays. Not sure what to replace it with, nor on what timeline.

And what about autotools? That, my friends, would be a whole nother blog post. Until then, & probably sooner, happy hacking!

here we go again

25 March 2021 12:22 PM (fsf | gnu | rms | fire | hackers)

Around 18 months ago, Richard Stallman was forced to resign from the Free Software Foundation board of directors and as president. It could have been anything -- at that point he already had a history of behaving in a way that was particularly alienating to women -- but in the end it was his insinuation that it was somehow OK if his recently-deceased mentor Marvin Minsky, then in his 70s or 80s, had sex with a 17-year-old on Jeffrey Epstein's private island. A weird pick of hill to stake one's reputation on, to say the least.

At the time I was relieved that we would finally be getting some leadership renewal at the FSF, and hopeful that we could get some mission renewal as well. I was also looking forward to the practical implications would be for the GNU project, as more people agreed that GNU was about software freedom and not about its founder.

But now we're back! Not only has RMS continued through this whole time to insist that he runs the GNU project -- something that is simply not the case, in my estimation -- but this week, a majority of a small self-selected group of people, essentially a subset of current and former members of the FSF board of directors and including RMS himself, elected to reinstate RMS to the board of the Free Software Foundation. Um... read the room, FSF voting members? What kind of message are you sending?

In this context I can only agree with the calls for the entire FSF board to resign. The board is clearly not fit for purpose, if it can make choices like this.

dissociation?

I haven't (yet?) signed the open letter because I would be in an inconsistent position if I did so. The letter enjoins people to "refuse to contribute to projects related to the FSF and RMS"; as a co-maintainer of GNU Guile, which has its origins in the heady 1990s of the FSF but has nothing to do any more with RMS, but whose copyrights are entirely held by the FSF, is hosted on FSF-run servers, and is even obliged (GPLv3 §5d, as referenced by LGPLv3) to print out Copyright (C) 1995-2021 Free Software Foundation, Inc. when it starts, I must admit that I contribute to a project that is "related to the FSF". But I don't see how Guile could continue this association, if the FSF board continues as it is. It's bad for contributors and for the future of the project.

It would be very tricky to disentangle Guile from the FSF -- consider hosting, for example -- so it's not the work of a day, but it's something to think about.

Of course I would rather that the FSF wouldn't allow itself to be seen as an essentially misogynist organization. So clean house, FSF!

on the nature of fire

Reflecting on how specifically we could have gotten here -- I don't know. I don't know the set of voting members at the FSF, what discussions were made, who voted what. But, having worked as a volunteer on GNU projects for almost two decades now, I have a guess. RMS and his closest supporters see themselves as guardians of the flame of free software -- a lost world of the late 70s MIT AI lab, reborn in a flurry of mid-80s hack, but since 25 years or so, slipping further and further away. These are dark times, in their view, and having the principled founder in a leadership role can only be a good thing.

(Of course, the environment in the AI lab was only good for some. The treatment of Margaret Hamilton as recounted in Levy's Hackers shows that not all were welcome. If this were just one story, I would discount it, but looking back, it does seem to be part of a pattern.)

But is that what the FSF is for today? If so, Guile should certainly leave. I'm not here for software as perfomative nostalgia -- I'm here to have fun with friends and start a fire. The FSF should look to do the same -- look at the world we are in, look where the energy is now, and engage in real conversations about success and failure and tactics. There is a world to win and doubling down on RMS won't get us there from here.