wingolog

on yakshave, on color, on cosines, on glitchen

14 November 2014 4:49 PM (jpeg | scheme | guile | colorspaces | yaks | photos | dct | compression)

Hold on to your butts, kids, because this is epic.

on yaks

As in all great epics, our prideful, stubborn hero starts in a perfectly acceptable state of things, decides on a lark to make a small excursion, and comes back much much later to inflict upon you pictures from his journey.

So. I have a web photo gallery but I don't take many pictures these days. Dealing with photos is a bit of a drag, and the ways that are easier like Instagram or what-not give me the (peer, corporate, government: choose 3) surveillance hives. So, I had vague thoughts that I should update my web gallery. Yakpoint 1.

At the same time, my web gallery was written for mod_python on the server, and I don't like hacking in Python any more and kinda wanted to switch away from Apache. Yakpoint 2.

So I rewrote the server-side part in Scheme. (Yakpoint 3.) It worked fine but I found I needed the ability to get the dimensions of files on the server, so I wrote a quick-and-dirty JPEG parser. Yakpoint 4.

I needed EXIF data as well, as the original version displayed EXIF data, and for that I used a binding to libexif that I had written a few years ago when I thought about starting this project (Yakpoint -1). However I found some crashers in the library, because it had never really been tested in production, and instead of fixing them I said "what the hell, I'll just write an EXIF parser". (Yakpoint 5.) So I did and adapted the web gallery to use it (Yakpoint 6, for the adaptation.)

At this point, I looked back, and looked forward, and looked all around, and all was good, but what was with this uneasiness I was feeling? And indeed, I hadn't actually made anything better, and I wasn't taking more photos, and the workflow was the same.

I was also concerned about the client side of things, which was still in Python and using some breakage-prone legacy libraries to do the photo scaling and transformations and what-not, and relied on a desktop application (f-spot) of dubious future. So I started to look at what it would take to port that script to Scheme (Yakpoint 7). Well it used some legacy libraries to copy files over SSH (gnome-vfs; switching away from that would be Yakpoint 8) and I didn't want to make a Scheme GIO binding (Yakpoint 9, narrowly avoided), and I then -- and then, dear reader -- so then I said "well WTF my caching story on the server is crap anyway, I never know when the sqlite database has changed or not so I never know what responses I can cache, what I really want is a functional datastore" (Yakpoint 10), which is what I have with Git and Tekuti (Yakpoint of yore), and so why not just store my photos in Git like I do in Tekuti for blog posts and serve them from there, indexing as needed? Of course I'd need some other server software (Yakpoint of fore, by which I meantersay the future), but then I could just git push to update my photo gallery, and I wouldn't have to endure the horror that is GVFS shelling out to ssh in a FUSE daemon (Yakpoint of ne'er).

So. After mulling over these thoughts for a while I decided, during an autumnal walk on the Salève in which we had the greatest views of Mont Blanc everrrrr and yet where are the photos?, that really what I needed was new photo management software, not just a web gallery. I should be able to share photos from my phone or from my desktop, fix them up either place, tag and such, and OK woo hoo! Such is the future! And the present for many people? Thing is, I also needed good permissions management (Yakpoint what, 10 I guess?), because you know a dude just out of college is not the same as that dude many years later. Which means serving things over HTTPS (Yakpoints 11-47) in such a way that the app has some good control over who gets what.

Well. Anyway. My mind ran ahead, and runs ahead, and yet we haven't actually tasted the awesome sauce yet. So! The photo management software, whereever it lives, needs to rotate photos at least, and scale them down to a few resolutions. I smell a yak! I looked at jpegtran which can do some lossless rotations but it's not available as a library, which is odd; and really I don't like shelling out for core program functionality, because every time I deal with the file system it's the wild west of concurrent mutation. If naming things is one of the two hardest problems in computer science, the file system is the worst because you have to give a global name to every intermediate value.

At the same time to scale images, what was I to do? Make a binding to libjpeg? Well I started (Yakpoint 48) but for reals kids, libjpeg is not fun. It works great and is really clever but

  1. it's approximately impossible to use from a dynamic ffi; you want a compiler to verify that you are using the right structure definitions

  2. there has been an inane ABI and format break imposed by the official IJG libjpeg but which other implementations have not followed, but how could you know which one you are using?

  3. the error handling facility encourages longjmp in C programs; somewhat terrifying

  4. off-heap image manipulation libraries always interact poorly with GC, because the GC only sees the small pointer to the off-heap image, and so doesn't GC often enough

  5. I have zero guarantee that libjpeg won't change ABI in weird ways, and I don't want to touch this software for the next 10 years

  6. I want to do jpegtran-like lossless transformations, but that's not available as a library, and it's totes ridics that binding libjpeg does not help you out here

  7. it's still an unsafe C library, battle-tested yes, but terrifyingly unsafe, and I'd be putting it on my server and who knows?

Friends, I arrived at the pasture, and I, I chose the yak less shaven. I took my lame JPEG parser and turned it into a full decoder (Yakpoint 49), realized it wasn't much more work to do an encoder (Yakpoint 50), and implemented the lossless transformations (Yakpoint 51).

on haters

Before we go on, I know some people would think "what is this kid about". I mean, custom gallery software, a custom JPEG library of all things, all bespoke, why don't you just use off-the-shelf solutions? Why aren't you normal and use a normal language and what about the best practices and where's your business case and I can't go on about this because there's a technical term for people that say this kind of thing and it's "hater".

Thing is, when did a hater ever make anything cool? Come to think of it, when did a hater make anything at all? In my experience the most vocal haters have nothing behind their names except a long series of pseudonymous rants in other people's comment boxes. So friends, in the joyful spirit of earning-anew, let's talk about JPEG!

on color

JPEG is a funny thing. Photos are our lives and our memories, our first steps and our friends, and yet I for one didn't know very much about them. My mental model that "a JPEG is a rectangle of pixels" doesn't turn out to be quite right.

If you actually look in a normal JPEG, you see three planes of information. If I take this image, for example:

If I decode it, actually I get three images. Here's the first one:

This is just the greyscale version of the image. So, storytime! Remember black and white television? We had an old one that got moved around the house sometimes, like if Mom was working at something in the kitchen. We also had a color one in the living room, and you could watch one or the other and they showed the same stuff. Strange when you think about it though -- one being in color and the other not. Well it turns out that color was literally just added on, both historically and technically. The main broadcast was still in black and white, and then in one part of the frequency band there were separate color signals, which color TVs would pick up, mix with the black and white signal, and come out with color. Wikipedia notes that "color TV" was really just "colored TV", which is a phrase whose cleverness I respect. Big ups to the W P.

In the context of JPEG, this black-and-white signal is sometimes called "luma", but is more precisely called Y', where the "prime" (the apostrophe) indicates that the signal has gamma correction applied.

In the image above, I replaced the color planes (sometimes collectively called the "chroma") with zeroes, while losslessly keeping the luma. Below is the first color plane, with the Y' plane replaced with a uniform 50% luma, and the other color plane replaced with zeros.

This color signal is technically known as CB, which may be very imperfectly understood as the bluish component of the color. Well the original image wasn't very blue, so we don't see very much here.

Indeed, our eyes have a harder time seeing differences in color than differences in intensity. Apparently this goes all the way down to biology -- we have more receptors in our eyes for "black and white" and fewer for color.

Early broadcasters took advantage of this difference in perception by actually devoting more bandwidth in their broadcasts to luma than to chroma; if you check the Wikipedia page you will see that the area in the spectrum allocation devoted to color is much smaller than the area devoted to intensity. So it is in JPEG: the above image being half-width indicates that actually we're just encoding one CB sample for every two Y' samples.

Finally, here we have the CR color plane, which can loosely be thought of as the "redness" of the image.

These test images and crops preserve the actual encoding of this photo as it came from my camera, without re-encoding. That's partly why there's not much interesting going on; with the megapixels these days, it's hard to fit much of anything in a few hundred pixels square. This particular camera is sub-sampling in the horizontal direction, but it's also common to subsample vertically as well, producing color planes that are half-width and half-height. In my limited investigations I have found that cameras tend to sub-sample just in the X direction, producing what they call 4:2:2 images, and that standard software encoders subsample in both, producing 4:2:0.

Incidentally, properly scaling up the color planes is quite an irritating endeavor -- the standard indicates that the color is sampled between the locations of the Y' samples ("centered" chroma), but these images originally have EXIF data that indicates that the color samples are taken at the position of the first Y' sample ("co-sited" chroma). I'm pretty sure libjpeg doesn't delve into the EXIF to check this though, so it would seem that all renderings I have seen of these photos are subtly off.

But how do you get proper color out of these strange luma and chroma things? Well, the Y'CBCR colorspace is really just the same color cube as RGB, except rotated: the Y' axis traverses the diagonal from (0, 0, 0) (black) to (255, 255, 255) (white). CB and CR are perpendicular to that diagonal, pointing towards blue or red respectively. So to go back to RGB, you multiply by a matrix to rotate the cube.

It's not a very intuitive color system, as you can see from the images above. For one thing, at zero or full luma, the chroma axes have no meaning; black and white can have no hue. Indeed if you imagine trying to fit a cube corner-down into a similar-sized box, you end up either having empty space in the box, or you have to cut off corners from the cube, or both. Cut corners means that bits of the Y'CBCR signal are wasted; empty space means there are RGB colors that are not representable in Y'CBCR. I'm not sure, but I think both are true for the particular formulation of Y'CBCR used in JPEG.

There's more to say about color here but frankly I don't know enough to do so, even though I worked in digital video for many years. If this is something you are mildly interested in, I highly, highly recommend watching Wim Taymans' presentation at this year's GStreamer conference. He takes a look at color in video that is constructive, building up from biology through math to engineering. His is a principled approach rather than a list of rules. It really clarified a number of things for me (and opened doors to unknown unknowns beyond).

on cosines

Where were we? Right, JPEG. So the proper way to understand what JPEG is is to understand the encoding process. We've covered colorspace conversion from RGB to Y'CBCR and sub-sampling. Next, the image canvas is divided into equal-sized "macroblocks". (These are called "minimum coded units" (MCUs) in the JPEG context, but in video they are usually called macroblocks, and it's a better name.) Without sub-sampling, each macro-block will contain one 8-sample-by-8-sample block for each component (Y', CB, CR) of the image. In my images above, the canvas space corresponding to one chroma block is the space of two luma blocks, so the macroblocks will be 16 samples wide and 8 samples tall, and contain two Y' blocks and one each of CB and CR. If the image canvas can't be evenly divided into macroblocks, it is padded to fit, usually by duplicating the last column or row of samples.

Then to make a JPEG, each block is encoded separately, then the whole thing is just written out to a file, and you're done!

This description glosses over a couple of important points, but it's a good big-picture view to have in mind. The pipeline goes from RGB pixels, to a padded RGB canvas, to separate Y'CBCR planes, to a possibly subsampled set of those planes, to macroblocks, to encoded macroblocks, to the file. Decoding is the reverse. It's a totally doable, comprehensible thing, and that was one of the big takeaways for me from this project. I took photography classes in high school and it was really cool to see how to shoot, develop, and print film, and this is similar in many ways. The real "film" is raw-format data, which some cameras produce, but understanding JPEG is like understanding enlargers and prints and fixer baths and such things. It's smelly and dark but pretty cool stuff.

So, how do you encode a block? Well peoples, this is a kinda cool thing. Maybe you remember from some math class that, given n uniformly spaced samples, you can always represent that series as a sum of n cosine functions of equally spaced frequencies. In each litle 8-by-8 block, that's what we do: a "forward discrete cosine transformation" (FDCT), which is just multiplying together some matrices for every point in the block. The FDCT is completely separable in the X and Y directions, so the space of 8 horizontal coefficients multiplies by the space of 8 vertical coefficients at each column to yield 64 total coefficients, which is not coincidentally the number of samples in a block.

Funny thing about those coefficients: each one corresponds to a particular horizontal and vertical frequency. We can map these out as a space of functions; for example giving a non-zero coefficient to (0, 0) in the upper-left block of a 8-block-by-8-block grid, and so on, yielding a 64-by-64 pixel representation of the meanings of the individual coefficients. That's what I did in the test strip above. Here is the luma example, scaled up without smoothing:

The upper-left corner corresponds to a frequency of 0 in both X and Y. The lower-right is a frequency of 4 "hertz", oscillating from highest to lowest value in both directions four times over the 8-by-8 block. I'm actually not sure why there are some greyish pixels around the right and bottom borders; it's not a compression artifact, as I constructed these DCT arrays programmatically. Anyway. Point is, your lover's smile, your sunny days, your raw urban graffiti, your child's first steps, all of these are reified in your photos as a sum of cosine coefficients.

The odd thing is that what is reified into your pictures isn't actually all of the coefficients there are! Firstly, because the coefficients are rounded to integers. Mathematically, the FDCT is a lossless operation, but in the context of JPEG it is not because the resulting coefficients are rounded. And they're not just rounded to the nearest integer; they are probably quantized further, for example to the nearest multiple of 17 or even 50. (These numbers seem exaggerated, but keep in mind that the range of coefficients is about 8 times the range of the original samples.)

The choice of what quantization factors to use is a key part of JPEG, and it's subjective: low quantization results in near-indistinguishable images, but in middle compression levels you want to choose factors that trade off subjective perception with file size. A higher quantization factor leads to coefficients with fewer bits of information that can be encoded into less space, but results in a worse image in general.

JPEG proposes a standard quantization matrix, with one number for each frequency (coefficient). Here it is for luma:

(define *standard-luma-q-table*
  #(16 11 10 16 24 40 51 61
    12 12 14 19 26 58 60 55
    14 13 16 24 40 57 69 56
    14 17 22 29 51 87 80 62
    18 22 37 56 68 109 103 77
    24 35 55 64 81 104 113 92
    49 64 78 87 103 121 120 101
    72 92 95 98 112 100 103 99))

This matrix is used for "quality 50" when you encode an 8-bit-per-sample JPEG. You can see that lower frequencies (the upper-left part) are quantized less harshly, and vice versa for higher frequencies (the bottom right).

(define *standard-chroma-q-table*
  #(17 18 24 47 99 99 99 99
    18 21 26 66 99 99 99 99
    24 26 56 99 99 99 99 99
    47 66 99 99 99 99 99 99
    99 99 99 99 99 99 99 99
    99 99 99 99 99 99 99 99
    99 99 99 99 99 99 99 99
    99 99 99 99 99 99 99 99))

For chroma (CB and CR) we see that quantization is much more harsh in general. So not only will we sub-sample color, we will also throw away more high-frequency color variation. It's interesting to think about, but also makes sense in some way; again in photography class we did an exercise where we shaded our prints with colored pencils, and the results were remarkable. My poor, lazy coloring skills somehow rendered leaves lifelike in different hues of green; really though, they were shades of grey, colored in imprecisely. "Colored TV" indeed.

With this knowledge under our chapeaux, we can now say what the "JPEG quality" setting actually is: it's simply that pair of standard quantization matrices scaled up or down. Towards "quality 100", the matrix approaches all-ones, for no quantization, and thus minimal loss (though you still have some rounding, often subsampling as well, and RGB-to-Y'CBCR gamut loss). Towards "quality 0" they scale to a matrix full of large values, for harsh quantization.

This understanding also explains those wavey JPEG artifacts you get on low-quality images. Those artifacts look like waves because they are waves. They usually occur at sharp intensity transitions, which like a cymbal crash cause lots of high frequencies that then get harshly quantized. Incidentally I suspect (but don't know) that this is the same reason that cymbals often sound bad in poorly-encoded MP3s, because of harsh quantization in the frequency domain.

Finally, the coefficients are written out to a file as a stream of bits. Each file gets a huffman code allocated to it, which ideally is built from the distribution of quantized coefficient sizes seen in all of the blocks of an image. There are usually different encodings for luma and chroma, to reflect their different quantizations. Reading and writing this bitstream is a bit of a headache but the algorithm is specified in the JPEG standard, and all you have to do is implement it. Notably, though, there is special support for encoding a run of zero-valued coefficients, which happens often after quantization. There are rarely wavey bits in a blue blue sky.

on transforms

It's terribly common for photos to be wrongly oriented. Unfortunately, the way that many editors fix photo rotation is by setting a bit in the EXIF information of the JPEG. This is ineffectual, as web browsers don't look in the EXIF information, and silly, because it turns out you can losslessly rotate most JPEG images anyway.

Consider that the body of a JPEG is an array of macroblocks. To rotate an image, you just have to rearrange those macroblocks, then rearrange the blocks inside the macroblocks (e.g. swap the two Y' blocks in my above example), then transform the blocks themselves.

The lossless transformations that you can do on a block are transposition, vertical flipping, and horizontal flipping.

Transposition flips a block along its downward-sloping diagonal. To do so, you just swap the coefficients at (u, v) with the coefficients at (v, u). Easy peasey.

Flipping is trickier. Consider the enlarged DCT image from above. What would it take to horizontally flip the function at (0, 1)? Instead of going from light to dark, you want it to go from dark to light. Simple: you just negate the coefficients! But you only want to negate those coefficients that are "odd" in the X direction, which are those coefficients whose column is odd. And actually that's all there is to it. Flipping vertically is the same, but for coefficients whose row is odd.

I said "most images" above because those whose size is not evenly divided by the macroblock size can't be losslessly rotated -- you will end up seeing some of the hidden data that falls off the edge of the canvas. Oh well. Most raw images are properly dimensioned, and if you're downscaling, you already have to re-encode anyway.

But that's just flipping and transposition, you say! What about rotation? Well it turns out that you can express rotation in terms of these operations: rotating 90 degrees clockwise is just a transpose and a horizontal flip (in that order). Together, flipping horizontally, flipping vertically, and transposing form a group, in the same way that flipping and flopping form a group for mattresses. Yeah!

on scheme

I wrote this library in Scheme because that's my language of choice these days. I didn't run into any serious impedance mismatches; Guile has a generic multi-dimensional array facility that made it possible to express many of these operations as generic folds, unfolds, or maps over arrays. The huffman coding part was a bit irritating, but all in all things were pretty good. The speed is pretty bad, but I haven't optimized it at all, and it gives me a nice test case for the compiler. Anyway, it's been fun and it suits my needs. Check out the project page if you're interested. Yes, to shave a yak you have to get a bit bovine and smelly, but yaks live in awesome places!

Finally I will leave you with a glitch, one of many that I have produced over the last couple weeks. Comments and corrections welcome below. Happy hacking!

generators in firefox now twenty-two times faster

14 November 2014 8:41 AM (spidermonkey | firefox | mozilla | javascript | es6 | ecmascript | v8 | jandem | bloomberg | igalia)

It's with great pleasure that I can announce that, thanks to Mozilla's Jan de Mooij, the new ES6 generator functions are twenty-two times faster in Firefox!

Some back-story, for the unawares. There's a new version of JavaScript coming, ECMAScript 6 (ES6). Among the new features that ES6 brings are generator functions: functions that can suspend. Firefox's JavaScript engine, SpiderMonkey, has had support for generators for many years, long before other engines. This support was upgraded to the new ES6 standard last year, thanks to sponsorship from Bloomberg, and was shipped out to users in Firefox 26.

The generators implementation in Firefox 26 was quite basic. As you probably know, modern JavaScript implementations have a number of tiered engines. In the case of SpiderMonkey there are three tiers: the interpreter, the baseline compiler, and the optimizing compiler. Code begins execution in the interpreter, which is the quickest engine to start. If a piece of code is hot -- meaning that lots of time is being spent there -- then it will "tier up" to the next level, where it is analyzed, possibly optimized, and then compiled to machine code.

Unfortunately, generators in SpiderMonkey have always been stuck at the lowest tier, the interpreter. This is because of SpiderMonkey's choice of implementation strategy for generators. Generators were implemented as "floating interpreter stack frames": heap-allocated objects whose shape was exactly the same as a stack frame in the interpreter. This had the advantage of being fairly cheap to implement in the beginning, but ultimately it made them unable to take advantage of JIT compilation, as JIT code runs on its own stack which has a different layout. The previous strategy also relied on trampolining through a helper written in C++ to resume generators, which killed optimization opportunities.

The solution was to represent suspended generator objects as snapshots of the state of a stack frame, instead of as stack frames themselves. In order for this to be efficient, last year we did a number of block scope optimizations to try and reduce the amount of state that a generator frame would have to restore. Finally, around March of this year we were at the point where we could refactor the interpreter to implement generators on the normal interpreter stack, with normal interpreter bytecodes, with the vision of being able to JIT-compile those bytecodes.

I ran out of time before I could land that patchset; although the patches were where we wanted to go, they actually caused generators to be even slower and so they languished in Bugzilla for a few more months. Sad monkey. It was with delight, then, that a month or so ago I saw that SpiderMonkey JIT maintainer Jan de Mooij was interested in picking up the patches. Since then he has been hacking off and on at getting my old patches into shape, and ended up applying them all.

He went further, optimizing stack frames to not reserve space for "aliased" locals (locals allocated on the scope chain), speeding up object literal creation in the baseline compiler and finally has implemented baseline JIT compilation for generators.

So, after all of that perf nargery, what's the upshot? Twenty-two times faster! In this microbenchmark:

function *g(n) {
    for (var i=0; i<n; i++)
        yield i;
}
function f() {
    var t = new Date();
    var it = g(1000000);
    for (var i=0; i<1000000; i++)
	it.next();
    print(new Date() - t);
}
f();

Before, it took SpiderMonkey 980 milliseconds to complete on Jan's machine. After? Only 43! It's actually marginally faster than V8 at this point, which has (temporarily, I think) regressed to 45 milliseconds on this test. Anyway. Competition is great and as a committer to both projects I find it very satisfactory to have good implementations on both sides.

As in V8, in SpiderMonkey generators cannot yet reach the highest tier of optimization. I'm somewhat skeptical that it's necessary, too, as you expect generators to suspend fairly frequently. That said, a yield point in a generator is, from the perspective of the optimizing compiler, not much different from a call site, in that it causes all locals to be saved. The difference is that locals may have unboxed representations, so we would have to box those values when saving the generator state, and unbox on restore.

Thanks to Bloomberg for funding the initial work, and big, big thanks to Mozilla's Jan de Mooij for picking up where we left off. Happy hacking with generators!

ffconf 2014

9 November 2014 5:36 PM (javascript | javascriptcore | spidermonkey | v8 | self-hosting | bootstrapping | igalia)

Last week I had the great privilege of speaking at ffconf in Brighton, UK. It was lovely. The town put on a full demonstration of its range of November weather patterns, from blue skies to driving rain to hail (!) to sea-spray to drizzle and back again. Good times.

The conference itself was quite pleasant as well, and from the speaker perspective it was amazing. A million thanks to Remy and Julie for making it such a pleasure. ffconf is mostly a front-end development conference, so it's not directly related with the practice of my work on JS implementations -- perhaps you're unaware, but there aren't so many browser implementors that actually develop content for their browsers, and indeed fewer JS implementors that actually write JS. Me, I sling C++ all day long and the most JavaScript I write is for tests. When in the weeds, sometimes we forget we're building an amazing runtime and that people do inspiring things with it, so it's nice to check in with front-end folks at their conferences to see what people are excited about.

My talk was about the part of JavaScript implementations that are written in JavaScript itself. This is an area that isn't so well known, and it has its amusing quirks. I think it can be interesting to a curious JS hacker who wants to spelunk down a bit to see what's going on in their browsers. Intrepid explorers might even find opportunities to contribute patches. Anyway, nerdy stuff, but that's basically how I roll.

The slides are here: without images (350kB PDF) or with images (3MB PDF).

I haven't been to the UK in years, and being in a foreign country where everyone speaks my native language was quite refreshing. At the same time there was an awkward incident in which I was reminded that though close, American and English just aren't the same. I made this silly joke that if you get a polyfill into a JS implementation, then shucks, you have a "gollyfill", 'cause golly it made it in! In the US I think "golly" is just one of those milquetoast profanities, "golly" instead of "god" like saying "shucks" instead of "shit". Well in the UK that's a thing too I think, but there is also another less fortunate connotation, in which "golly" as a noun can be a racial slur. Check the Wikipedia if you're as ignorant as I was. I think everyone present understood that wasn't my intention, but if that is not the case I apologize. With slides though it's less clear, so I've gone ahead and removed the joke from the slides. It's probably not a ball to take and run with.

However I do have a ball that you can run with though! And actually, this was another terrible joke that wasn't bad enough to inflict upon my audience, but that now chance fate gives me the opportunity to use. So never fear, there are still bad puns in the slides. But, you'll have to click through to the PDF for optimal groaning.

Happy hacking, JavaScripters, and until next time.

ffs ssl

17 October 2014 2:33 PM (ssl | tls | ffs | meta | https | http | hsts | pki | networking)

I just set up SSLTLS on my web site. Everything can be had via https://wingolog.org/, and things appear to work. However the process of transitioning even a simple web site to SSL is so clownshoes bad that it's amazing anyone ever does it. So here's an incomplete list of things that can go wrong when you set up TLS on a web site.

You search "how to set up https" on the Googs and click the first link. It takes you here which tells you how to use StartSSL, which generates the key in your browser. Whoops, your private key is now known to another server on this internet! Why do people even recommend this? It's the worst of the worst of Javascript crypto.

OK so you decide to pay for a certificate, assuming that will be better, and because who knows what's going on with StartSSL. You've heard of RapidSSL so you go to rapidssl.com. WTF their price is 49 dollars for a stupid certificate? Your domain name was only 10 dollars, and domain name resolution is an actual ongoing service, unlike certificate issuance that just happens one time. You can't believe it so you click through to the prices to see, and you get this:

Whatttttttttt

OK so I'm using Epiphany on Debian and I think that uses the system root CA list which is different from what Chrome or Firefox do but Jesus this is shaking my faith in the internet if I can't connect to an SSL certificate provider over SSL.

You remember hearing something on Twitter about cheaper certs, and oh ho ho, it's rapidsslonline.com, not just RapidSSL. WTF. OK. It turns out Geotrust and RapidSSL and Verisign are all owned by Symantec anyway. So you go and you pay. Paying is the first thing you have to do on rapidsslonline, before anything else happens. Welp, cross your fingers and take out your credit card, cause SSLanta Clause is coming to town.

Recall, distantly, that SSL has private keys and public keys. To create an SSL certificate you have to generate a key on your local machine, which is your private key. That key shouldn't leave your control -- that's why the DigitalOcean page is so bogus. The certification authority (CA) then needs to receive your public key and then return it signed. You don't know how to do this, because who does? So you Google and copy and paste command line snippets from a website. Whoops!

Hey neat it didn't delete your home directory, cool. Let's assume that your local machine isn't rooted and that your server isn't rooted and that your hosting provider isn't rooted, because that would invalidate everything. Oh what so the NSA and the five eyes have an ongoing program to root servers? Um, well, water under the bridge I guess. Let's make a key. You google "generate ssl key" and this is the first result.

# openssl genrsa -des3 -out foo.key 1024

Whoops, you just made a 1024-bit key! I don't know if those are even accepted by CAs any more. Happily if you leave off the 1024, it defaults to 2048 bits, which I guess is good.

Also you just made a key with a password on it (that's the -des3 part). This is eminently pointless. In order to use your key, your web server will need the decrypted key, which means it will need the password to the key. Adding a password does nothing for you. If you lost your private key but you did have it password-protected, you're still toast: the available encryption cyphers are meant to be fast, not hard to break. Any serious attacker will crack it directly. And if they have access to your private key in the first place, encrypted or not, you're probably toast already.

OK. So let's say you make your key, and make what's called the "CRTCSR", to ask for the cert.

# openssl req -new -key foo.key -out foo.csr

Now you're presented with a bunch of pointless-looking questions like your country code and your "organization". Seems pointless, right? Well now I have to live with this confidence-inspiring dialog, because I left off the organization:

Don't mess up, kids! But wait there's more. You send in your CSR, finally figure out how to receive mail for hostmaster@yourdomain.org because that's what "verification" means (not, god forbid, control of the actual web site), and you get back a certificate. Now the fun starts!

How are you actually going to serve SSL? The truly paranoid use an out-of-process SSL terminator. Seems legit except if you do that you lose any kind of indication about what IP is connecting to your HTTP server. You can use a more HTTP-oriented terminator like bud but then you have to mess with X-Forwarded-For headers and you only get them on the first request of a connection. You could just enable mod_ssl on your Apache, but that code is terrifying, and do you really want to be running Apache anyway?

In my case I ended up switching over to nginx, which has a startlingly underspecified configuration language, but for which the Debian defaults are actually not bad. So you uncomment that part of the configuration, cross your fingers, Google a bit to remind yourself how systemd works, and restart the web server. Haich Tee Tee Pee Ess ahoy! But did you remember to disable the NULL authentication method? How can you test it? What about the NULL encryption method? These are actual things that are configured into OpenSSL, and specified by standards. (What is the use of a secure communications standard that does not provide any guarantee worth speaking of?) So you google, copy and paste some inscrutable incantation into your config, turn them off. Great, now you are a dilettante tweaking your encryption parameters, I hope you feel like a fool because I sure do.

Except things are still broken if you allow RC4! So you better make sure you disable RC4, which incidentally is exactly the opposite of the advice that people were giving out three years ago.

OK, so you took your certificate that you got from the CA and your private key and mashed them into place and it seems the web browser works. Thing is though, the key that signs your certificate is possibly not in the actual root set of signing keys that browsers use to verify the key validity. If you put just your key on the web site without the "intermediate CA", then things probably work but browsers will make an additional request to get the intermediate CA's key, slowing down everything. So you have to concatenate the text files with your key and the one with the intermediate CA's key. They look the same, just a bunch of numbers, but don't get them in the wrong order because apparently the internet says that won't work!

But don't put in too many keys either! In this image we have a cert for jsbin.com with one intermediate CA:

And here is the same but with an a different root that signed the GeoTrust Global CA certificate. Apparently there was a time in which the GeoTrust cert hadn't been added to all of the root sets yet, and it might not hurt to include them all:

Thing is, the first one shows up "green" in Chrome (yay), but the second one shows problems ("outdated security settings" etc etc etc). Why? Because the link from Equifax to Geotrust uses a SHA-1 signature, and apparently that's not a good idea any more. Good times? (Poor Remy last night was doing some basic science on the internet to bring you these results.)

Or is Chrome denying you the green because it was RapidSSL that signed your certificate with SHA-1 and not SHA-256? It won't tell you! So you Google and apply snakeoil and beg your CA to reissue your cert, hopefully they don't charge for that, and eventually all is well. Chrome gives you the green.

Or does it? Probably not, if you're switching from a web site that is also available over HTTP. Probably you have some images or CSS or Javascript that's being loaded over HTTP. You fix your web site to have scheme-relative URLs (like //wingolog.org/ instead of http://wingolog.org/), and make sure that your software can deal with it all (I had to patch Guile :P). Update all the old blog posts! Edit all the HTMLs! And finally, green! You're golden!

Or not! Because if you left on SSLv3 support you're still broken! Also, TLSv1.0, which is actually greater than SSLv3 for no good reason, also has problems; and then TLS1.1 also has problems, so you better stick with just TLSv1.2. Except, except, older Android phones don't support TLSv1.2, and neither does the Googlebot, so you don't get the rankings boost you were going for in the first place. So you upgrade your phone because that's a thing you want to do with your evenings, and send snarky tweets into the ether about scumbag google wanting to promote HTTPS but not supporting the latest TLS version.

So finally, finally, you have a web site that offers HTTPS and HTTP access. You're good right? Except no! (Catching on to the pattern?) Because what happens is that people just type in web addresses to their URL bars like "google.com" and leave off the HTTP, because why type those stupid things. So you arrange for http://www.wobsite.com to redirect https://www.wobsite.com for users that have visited the HTTPS site. Except no! Because any network attacker can simply strip the redirection from the HTTP site.

The "solution" for this is called HTTP Strict Transport Security, or HSTS. Once a visitor visits your HTTPS site, the server sends a response that tells the browser never to fetch HTTP from this site. Except that doesn't work the first time you go to a web site! So if you're Google, you friggin add your name to a static list in the browser. EXCEPT EVEN THEN watch out for the Delorean.

And what if instead they go to wobsite.com instead of the www.wobsite.com that you configured? Well, better enable HSTS for the whole site, but to do anything useful with such a web request you'll need a wildcard certificate to handle the multiple URLs, and those run like 150 bucks a year, for a one-bit change. Or, just get more single-domain certs and tack them onto your cert, using the precision tool cat, but don't do too many, because if you do you will overflow the initial congestion window of the TCP connection and you'll have to wait for an ACK on your certificate before you can actually exchange keys. Don't know what that means? Better look it up and be an expert, or your wobsite's going to be slow!

If your security goals are more modest, as they probably are, then you could get burned the other way: you could enable HSTS, something could go wrong with your site (an expired certificate perhaps), and then people couldn't access your site at all, even if they have no security needs, because HTTP is turned off.

Now you start to add secure features to your web app, safe with the idea you have SSL. But better not forget to mark your cookies as secure, otherwise they could be leaked in the clear, and better not forget that your website might also be served over HTTP. And better check up on when your cert expires, and better have a plan for embedded browsers that don't have useful feedback to the user about certificate status, and what about your CA's audit trail, and better stay on top of the new developments in security! Did you read it? Did you read it? Did you read it?

It's a wonder anything works. Indeed I wonder if anything does.

high-performance packet filtering with pflua

2 September 2014 10:15 AM (lua | networks | packet filtering | luajit | trace compilation | jit | igalia | snabb | pcap | pflang | pflua)

Greets! I'm delighted to be able to announce the release of Pflua, a high-performance packet filtering toolkit written in Lua.

Pflua implements the well-known libpcap packet filtering language, which we call pflang for short.

Unlike other packet filtering toolkits, which tend to use the libpcap library to compile pflang expressions bytecode to be run by the kernel, Pflua is a completely new implementation of pflang.

why lua?

At this point, regular readers are asking themselves why this Schemer is hacking on a Lua project. The truth is that I've always been looking for an excuse to play with the LuaJIT high-performance Lua implementation.

LuaJIT is a tracing compiler, which is different from other JIT systems I have worked on in the past. Among other characteristics, tracing compilers only emit machine code for branches that are taken at run-time. Tracing seems a particularly appropriate strategy for the packet filtering use case, as you end up with linear machine code that reflects the shape of actual network traffic. This has the potential to be much faster than anything static compilation techniques can produce.

The other reason for using Lua was because it was an excuse to hack with Luke Gorrie, who for the past couple years has been building the Snabb Switch network appliance toolkit, also written in Lua. A common deployment environment for Snabb is within the host virtual machine of a virtualized server, with Snabb having CPU affinity and complete control over a high-performance 10Gbit NIC, which it then routes to guest VMs. The administrator of such an environment might want to apply filters on the kinds of traffic passing into and out of the guests. To this end, we plan on integrating Pflua into Snabb so as to provide a pleasant, expressive, high-performance filtering facility.

Given its high performance, it is also reasonable to deploy Pflua on gateway routers and load-balancers, within virtualized networking appliances.

implementation

Pflua compiles pflang expressions to Lua source code, which are then optimized at run-time to native machine code.

There are actually two compilation pipelines in Pflua. The main one is fairly traditional. First, a custom parser produces a high-level AST of a pflang filter expression. This AST is lowered to a primitive AST, with a limited set of operators and ways in which they can be combined. This representation is then exhaustively optimized, folding constants and tests, inferring ranges of expressions and packet offset values, hoisting assertions that post-dominate success continuations, etc. Finally, we residualize Lua source code, performing common subexpression elimination as we go.

For example, if we compile the simple Pflang expression ip or ip6 with the default compilation pipeline, we get the Lua source code:

return function(P,length)
   if not (length >= 14) then return false end
   do
      local v1 = ffi.cast("uint16_t*", P+12)[0]
      if v1 == 8 then return true end
      do
         do return v1 == 56710 end
      end
   end
end

The other compilation pipeline starts with bytecode for the Berkeley packet filter VM. Pflua can load up the libpcap library and use it to compile a pflang expression to BPF. In any case, whether you start from raw BPF or from a pflang expression, the BPF is compiled directly to Lua source code, which LuaJIT can gnaw on as it pleases. Compiling ip or ip6 with this pipeline results in the following Lua code:

return function (P, length)
   local A = 0
   if 14 > length then return 0 end
   A = bit.bor(bit.lshift(P[12], 8), P[12+1])
   if (A==2048) then goto L2 end
   if not (A==34525) then goto L3 end
   ::L2::
   do return 65535 end
   ::L3::
   do return 0 end
   error("end of bpf")
end

We like the independence and optimization capabilities afforded by the native pflang pipeline. Pflua can hoist and eliminate bounds checks, whereas BPF is obligated to check that every packet access is valid. Also, Pflua can work on data in network byte order, whereas BPF must convert to host byte order. Both of these restrictions apply not only to Pflua's BPF pipeline, but also to all other implementations that use BPF (for example the interpreter in libpcap, as well as the JIT compilers in the BSD and Linux kernels).

However, though Pflua does a good job in implementing pflang, it is inevitable that there may be bugs or differences of implementation relative to what libpcap does. For that reason, the libpcap-to-bytecode pipeline can be a useful alternative in some cases.

performance

When Pflua hits the sweet spots of the LuaJIT compiler, performance screams.


(full image, analysis)

This synthetic benchmark runs over a packet capture of a ping flood between two machines and compares the following pflang implementations:

  1. libpcap: The user-space BPF interpreter from libpcap

  2. linux-bpf: The old Linux kernel-space BPF compiler from 2011. We have adapted this library to work as a loadable user-space module (source)

  3. linux-ebpf: The new Linux kernel-space BPF compiler from 2014, also adapted to user-space (source)

  4. bpf-lua: BPF bytecodes, cross-compiled to Lua by Pflua.

  5. pflua: Pflang compiled directly to Lua by Pflua.

To benchmark a pflang implementation, we use the implementation to run a set of pflang expressions over saved packet captures. The result is a corresponding set of benchmark scores measured in millions of packets per second (MPPS). The first set of results is thrown away as a warmup. After warmup, the run is repeated 50 times within the same process to get multiple result sets. Each run checks to see that the filter matches the the expected number of packets, to verify that each implementation does the same thing, and also to ensure that the loop is not dead.

In all cases the same Lua program is used to drive the benchmark. We have tested a native C loop when driving libpcap and gotten similar results, so we consider that the LuaJIT interface to C is not a performance bottleneck. See the pflua-bench project for more on the benchmarking procedure and a more detailed analysis.

The graph above shows that Pflua can stream in packets from memory and run some simple pflang filters them at close to the memory bandwidth on this machine (100 Gbit/s). Because all of the filters are actually faster than the accept-all case, probably due to work causing prefetching, we actually don't know how fast the filters themselves can run. At any case, in this ideal situation, we're running at a handful of nanoseconds per packet. Good times!


(full image, analysis)

It's impossible to make real-world tests right now, especially since we're running over packet captures and not within a network switch. However, we can get more realistic. In the above test, we run a few filters over a packet capture from wingolog.org, which mostly operates as a web server. Here we see again that Pflua beats all of the competition. Oddly, the new Linux JIT appears to fare marginally worse than the old one. I don't know why that would be.

Sadly, though, the last tests aren't running at that amazing flat-out speed we were seeing before. I spent days figuring out why that is, and that's part of the subject of my last section here.

on lua, on luajit

I implement programming languages for a living. That doesn't mean I know everything there is to know about everything, or that everything I think I know is actually true -- in particular, I was quite ignorant about trace compilers, as I had never worked with one, and I hardly knew anything about Lua at all. With all of those caveats, here are some ignorant first impressions of Lua and LuaJIT.

LuaJIT has a ridiculously fast startup time. It also compiles really quickly: under a minute. Neither of these should be important but they feel important. Of course, LuaJIT is not written in Lua, so it doesn't have the bootstrap challenges that Guile has; but still, a fast compilation is refreshing.

LuaJIT's FFI is great. Five stars, would program again.

As a compilation target, Lua is OK. On the plus side, it has goto and efficient bit operations over 32-bit numbers. However, and this is a huge downer, the result range of bit operations is the signed int32 range, not the unsigned range. This means that bit.band(0xffffffff, x) might be negative. No one in the history of programming has ever wanted this. There are sensible meanings for negative results to bit operations, but only if an argument was negative. Grr. Otherwise, Lua shares the same concerns as other languages whose numbers are defined as 64-bit doubles.

Sometimes people get upset that Lua starts its indexes (in "arrays" or strings) with 1 instead of 0. It's foreign to me, so it's sometimes a challenge, but it can work as well as anything else. The problem comes in when working with the LuaJIT FFI, which starts indexes with 0, leading me to make errors as I forget which kind of object I am working on.

As a language to implement compilers, Lua desperately misses a pattern matching facility. Otherwise, a number of small gripes but no big ones; tables and closures abound, which leads to relatively terse code.

Finally, how well does trace compilation work for this task? I offer the following graph.


(full image, analysis)

Here the tests are paired. The first test of a pair, for example the leftmost portrange 0-6000, will match most packets. The second test of a pair, for example the second-from-the-left portrange 0-5, will reject all packets. The generated Lua code will be very similar, except for some constants being different. See portrange-0-6000.md for an example.

The Pflua performance of these filters is very different: the one that matches is slower than the one that doesn't, even though in most cases the non-matching filter will have to do more work. For example, a non-matching filter probably checks both src and dst ports, whereas a successful one might not need to check the dst.

It hurts to see Pflua's performance be less than the Linux JIT compilers, and even less than libpcap at times. I scratched my head for a long time about this. The Lua code is fine, and actually looks much like the BPF code. I had taken a look at the generated assembly code for previous traces and it looked fine -- some things that were not as good as they should be (e.g. a fair bit of conversions between integers and doubles, where these traces have no doubles), but things were OK. What changed?

Well. I captured the traces for portrange 0-6000 to a file, and dove in. Trace 66 contains the inner loop. It's interesting to see that there's a lot of dynamic checks in the beginning of the trace, although the loop itself is not bad (scroll down to see the word LOOP:), though with the double conversions I mentioned before.

It seems that trace 66 was captured for a packet whose src port was within range. Later, we end up compiling a second trace if the src port check fails: trace 67. The trace starts off with an absurd amount of loads and dynamic checks -- to a similar degree as trace 66, even though trace 66 dominates trace 67. It seems that there is a big penalty for transferring from one trace to another, even though they are both compiled.

Finally, once trace 67 is done -- and recall that all it has to do is check the destination port, and then update the counters from the inner loop) -- it jumps back to the top of trace 66 instead of the top of the loop, repeating all of the dynamic checks in trace 66! I can only think this is a current deficiency of LuaJIT, and not with trace compilation in general, although the amount of state transfer points to a lack of global analysis that you would get in a method JIT. I'm sure that values are being transferred that are actually dead.

This explains the good performance for the match-nothing cases: the first trace that gets compiled residualizes the loop expecting that all tests fail, and so only matching cases or variations incur the trace transfer-and-re-loop cost.

It could be that the Lua code that Pflua residualizes is in some way not idiomatic or not performant; tips in that regard are appreciated.

conclusion

I was going to pass some possible slogans by our marketing department, but we don't really have one, so I pass them on to you and you can tell me what you think:

  • "Pflua: A Totally Adequate Pflang Implementation"

  • "Pflua: Sometimes Amazing Performance!!!!1!!"

  • "Pflua: Organic Artisanal Network Packet Filtering"

Pflua was written by Igalians Diego Pino, Javier Muñoz, and myself for Snabb Gmbh, fine purveyors of high-performance networking solutions. If you are interested in getting Pflua in a Snabb context, we'd be happy to talk; drop a note to the snabb-devel forum. For Pflua in other contexts, file an issue or drop me a mail at wingo@igalia.com. Happy hackings with Pflua, the totally adequate pflang implementation!

a wingolog user's manual

27 August 2014 8:37 AM (meta | comments)

Greetings, dear readers!

Welcome to my little corner of the internet. This is my place to share and write about things that are important to me. I'm delighted that you stopped by.

Unlike a number of other personal sites on the tubes, I have comments enabled on most of these blog posts. It's gratifying to me to hear when people enjoy an article. I also really appreciate it when people bring new information or links or things I hadn't thought of.

Of course, this isn't like some professional peer-reviewed journal; it's above all a place for me to write about my wanderings and explorations. Most of the things I find on my way have already been found by others, but they are no less new to me. As Goethe said, quoted in the introduction to The Joy of Cooking: "That which thy forbears have bequeathed to thee, earn it anew if thou wouldst possess it."

In that spirit I would enjoin my more knowledgeable correspondents to offer their insights with the joy of earning-anew, and particularly to recognize and banish the spectre of that moldy, soul-killing "well-actually" response that is present on so many other parts of the internet.

I've had a good experience with comments on this site, and I'm a bit lazy, so I take an optimistic approach to moderation. By default, comments are posted immediately. Every so often -- more often after a recent post, less often in between -- I unpublish comments that I don't feel contribute to the piece, or which I don't like for whatever reason. It's somewhat arbitrary, but hey, welcome to my corner of the internet.

This has the disadvantage that some unwanted comments end up published, then they go away. If you notice this happening to someone else's post, it's best to just ignore it, and in particular to not "go meta" and ask in the comments why a previous comment isn't there any more. If it happens to you, I'd ask you to re-read this post and refrain from unwelcome comments in the future. If you think I made an error -- it can happen -- let me know privately.

Finally, and it really shouldn't have to be said, but racism, sexism, homophobia, transphobia, and ableism are not welcome here. If you see such a comment that I should delete and have missed, let me know privately. However even among well-meaning people, and that includes me, there are ways of behaving that reinforce subtle bias. Please do point out such instances in articles or comments, either publicly or privately. Working on ableist language is a particular challenge of mine.

You can contact me via comments (anonymous or not), via email (wingo@pobox.com), twitter (@andywingo), or IRC (wingo on freenode). Thanks for reading, and happy hacking :)

revisiting common subexpression elimination in guile

25 August 2014 9:48 AM (guile | compilers | cse | scheme | gnu | effects analysis | cps | ssa | scalar replacement)

A couple years ago I wrote about a common subexpression pass that I implemented in Guile 2.0.

To recap, Guile 2.0 has a global, interprocedural common subexpression elimination (CSE) pass.

In the context of compiler optimizations, "global" means that it works across basic block boundaries. Basic blocks are simple, linear segments of code without control-flow joins or branches. Working only within basic blocks is called "local". Working across basic blocks requires some form of understanding of how values can flow within the blocks, for example flow analysis.

"Interprocedural" means that Guile 2.0's CSE operates across closure boundaries. Guile 2.0's CSE is "context-insensitive", in the sense that any possible effect of a function is considered to occur at all call sites; there are newer CSE passes in the literature that separate effects of different call sites ("context-sensitive"), but that's not a Guile 2.0 thing. Being interprocedural was necessary for Guile 2.0, as its intermediate language could not represent (e.g.) loops directly.

The conclusion of my previous article was that although CSE could do cool things, in Guile 2.0 it was ultimately limited by the language that it operated on. Because the Tree-IL direct-style intermediate language didn't define order of evaluation, didn't give names to intermediate values, didn't have a way of explicitly representing loops and other kinds of first-order control flow, and couldn't precisely specify effects, the results, well, could have been better.

I know you all have been waiting for the last 27 months for an update, probably forgoing meaningful social interaction in the meantime because what if I posted a followup while you were gone? Be at ease, fictitious readers, because that day has finally come.

CSE over CPS

The upcoming Guile 2.2 has a more expressive language for the optimizer to work on, called continuation-passing style (CPS). CPS explicitly names all intermediate values and control-flow points, and can integrate nested functions into first-order control-flow via "contification". At the same time, the Guile 2.2 virtual machine no longer penalizes named values, which was another weak point of CSE in Guile 2.0. Additionally, the CPS intermediate language enables more fined-grained effects analysis.

All of these points mean that CSE has the possibility to work better in Guile 2.2 than in Guile 2.0, and indeed it does. The shape of the algorithm is a bit different, though, and I thought some compiler nerds might be interested in the details. I'll follow up in the next section with some things that new CSE pass can do that the old one couldn't.

So, by way of comparison, the old CSE pass was a once-through depth-first visit of the nested expression tree. As the visit proceeded, the pass built up an "environment" of available expressions -- for example, that (car a) was evaluated and bound to b, and so on. This environment could be consulted to see if a expression was already present in the environment. If so, the environment would be traversed from most-recently-added to the found expression, to see if any intervening expression invalidated the result. Control-flow joins would cause recomputation of the environment, so that it only held valid values.

This simple strategy works for nested expressions without complex control-flow. CPS, on the other hand, can have loops and other control flow that Tree-IL cannot express, so for it to build up a set of "available expressions" requires a full-on flow analysis. So that's what the pass does: a flow analysis over the labelled expressions in a function to compute the set of "available expressions" for each label. A labelled expression a is available at label b if a dominates b, and no intervening expression could have invalidated the results. An expression invalidates a result if it may write to a memory location that the result may have read. The code, such as it is, may be found here.

Once you have the set of available expressions for a function, you can proceed to the elimination phase. First, you start by creating an "eliminated variable" map, which initially maps each variable to itself, and an "equivalent expressions" table, which maps "keys" to a set of labels and bound variables. Then you visit each expression in a function, again in topologically sorted order. For each expression, you compute a "key", which is some unique representation of an expression that can be compared by structural equality. Keys that compare as equal are equivalent, and are subject to elimination.

For example, consider a call to the add primitive with variables labelled b and c as arguments. Imagine that b maps to a in the eliminated variable table. The expression as a whole would then have a key representation as the list (primcall add a c). If this key is present in the equivalent expression table, you check to see if any of the equivalent labels is available at the current label. If so, hurrah! You mark the outputs of the current label as being replaced by the outputs of the equivalent label. Otherwise you add the key to the equivalent table, associated with the current label.

This simple algorithm is enough to recursively eliminate common subexpressions. Sometimes the recursive aspect (i.e. noticing that b should be replaced by a), along with the creation of a common key, causes the technique to be called global value numbering (GVN), but CSE seems a better name to me.

The algorithm as outlined above eliminates expressions that bind values. However not all expressions do that; some are used as control-flow branches. For this reason, Guile also computes a "truthy table" with another flow analysis pass. This table computes a set of which branches have been taken to get to each program point. In the elimination phase, if a branch is reached that is equivalent to a previously taken branch, we consult the truthy table to see which continuation the previous branch may have taken. If it can be proven to have taken just one of the legs, the test is elided and replaced with a direct jump.

A few things to note before moving on. First, the "compute an analysis, then transform the function" sequence is quite common in this sort of problem. It leads to some challenges regarding space for the analysis; my last article deals with these in more detail.

Secondly, the rewriting phase assumes that a value that is available may be substituted, and that the result would be a proper CPS term. This isn't always the case; see the discussion at the end of the article on CSE in Guile 2.0 about CPS, SSA, dominators, and scope. In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

Also, consider the clobbering part of analysis, where e.g. an expression that writes a value to memory has to invalidate previously read values. Currently this is implemented by traversing all available expressions. This is suboptimal and could be quadratic in the end. A better solution is to compute a dependency graph for expressions, which links together operations on the same regions of memory; see LLVM's memory dependency analysis for an idea of how to do this.

Finally, note that this algorithm is global but intraprocedural, meaning that it doesn't propagate values across closure boundaries. It's possible to extend it to be interprocedural, though it's less necessary in the presence of contification.

scalar replacement via fabricated expressions

Let's say you get to an expression at label L, (cons a b). It binds a result c. You determine you haven't seen it before, so you add (primcall cons a b) → L, c to your equivalent expressions set. Cool. We won't be able to replace a future instance of (cons a b) with c, because that doesn't preserve object identity of the newly allocated memory, but it's definitely a cool fact, yo.

What if we add an additional mapping to the table, (car c) → L, a? That way any expression at which L is available would replace (car c) with a, which would be pretty neat. To do so, you would have to add the &read effect to the cons call's effects analysis, but since the cons wasn't really up for elimination anyway it's all good.

Similarly, for (set-car! c d) we can add a mapping of (car c) → d. Again we have to add the &read effect to the set-car, but that's OK too because the write invalidated previous reads anyway.

The same sort of transformation holds for other kinds of memory that Guile knows how to allocate and mutate. Taken together, they form a sort of store-to-load forwarding and scalar replacement that can entirely eliminate certain allocations, and many accesses as well. To actually eliminate the allocations requires a bit more work, but that will be the subject of the next article.

future work

So, that's CSE in Guile 2.0. It works pretty well. In the future I think it's probably worth considering an abstract heap-style analysis of effects; in the end, the precision of CSE is limited to how precisely we can model the effects of expressions.

The trick of using CSE to implement scalar replacement is something I haven't seen elsewhere, though I doubt that it is novel. To fully remove the intermediate allocations needs a couple more tricks, which I will write about in my next nargy dispatch. Until then, happy hacking!

on gnu and on hackers

18 August 2014 9:07 PM (gnu | fsf | ghm | munich | hacienda | knock | gnunet | guix | guile | misogyny | sexism | intent)

Greetings, gentle hackfolk. 'Tis a lovely waning light as I write this here in Munich, Munich the green, Munich full of dogs and bikes, Munich the summer-fresh.

Last weekend was the GNU hackers meeting up in Garching, a village a few metro stops north of town. Garching is full of quiet backs and fruit trees and small gardens bursting with blooms and beans, as if an eddy of Chistopher Alexander whirled out and settled into this unlikely place. My French suburb could learn a thing or ten. We walked back from the hack each day, ate stolen apples and corn, and schemed the nights away.

The program of GHM this year was great. It started off with a bang, as GNUnet hackers Julian Kirsch and Christian Grothoff broke the story that the Five-Eyes countries (US, UK, Canada, Australia, NZ) regularly port-scan the entire internet, looking for vulnerabilities. They then proceed to exploit those vulnerabilities, in regular hack-a-thons, trying to own as many boxes in as many countries as they can. They then use them as launchpads for attacks and for exfiltration of information from other networks.

The presentation that broke this news also proposed a workaround based on port-knocking, Knock. Knock embeds the hash of a pre-shared key with some other information into the 32-bit initial sequence number of a TCP connection. Unlike previous incarnations of port-knocking, Knock also authenticates the first n payload bytes, so that the connection isn't vulnerable to hijacking (e.g. via GCHQ "quantum injection", where well-placed evil routers race the true destination server to provide the first response packet of a connection). Leaking the pwn-the-internet documents with Laura Poitras at the same time as the Knock unveiling was a pretty slick move!

I was also really impressed by Christian's presentation on the GNU name system. GNS is a replacement for DNS whose naming structure mirrors our social naming structure. For example, www.alice.gnu would be my friend Alice, and www.alice.bob.gnu would be Alice's friend Bob. With some integration, it can work on normal desktops and mobile devices. There are lots more details, so check gnunet.org/gns for more information.

Of course, a new naming system does need some operating system support. In this regard Ludovic Courtès' update on Guix was particularly impressive. Guix is a Nix-like system whose goal is reproducible, user-controlled GNU/Linux systems. A couple years ago I didn't think much of it, but now it's actually booting on raw hardware, not just under virtualization, and things seem to be rolling forth as if on rails. Guix manages to be technically innovative at the same time as being GNU-centered, so it can play a unique role in propagating GNU work like GNS.

and yet.

But now, as the dark clouds race above and the light truly goes, we arrive to the point I really wanted to discuss. GNU has a terrible problem with gender balance, and with diversity in general. Of about 70 attendees at this recent GHM, only two were women. We talk the talk about empowering users and working for freedom but, to a first approximation, it's really just a bunch of dudes that think the exact same things.

There are many reasons for this, of course. Some people like to focus on what's called the "pipeline problem" -- that there aren't as many women coming out of computer science programs as men. While true, the proportion of women CS graduates is much higher than the proportion of women at GHM events, so something must be happening in between. And indeed, the attrition rates of women in the tech industry are higher than that of men -- often because we men make it a needlessly unpleasant place for women to be. Sometimes it's even dangerous. The incidence of sexual harassment and assault in tech, especially at events, is something terrible. Scroll down in that linked page to June, July, and August 2014, and ask yourself whether that's OK. (Hint: hell no.)

And so you would think that people who consider themselves to be working for some abstract liberatory principle, as GNU is, would be happy to take a stand against this kind of asshaberdashery. There you would be wrong. Voilà a timeline of an incident.

timeline

March 2014
Someone at the FSF asks a GHM organizer to add an anti-harassment policy to GHM. The organizer does so and puts one on the web page, copying the text from Libreplanet's web site. The policy posted is:

Offensive or overly explicit sexual language or imagery is inappropriate during the event, including presentations.

Participants violating these rules may be sanctioned or expelled from the meeting at the discretion of the organizers.

Harassment includes offensive comments related to gender, sexual orientation, disability, appearance, body size, race, religion, sexual images in public spaces, deliberate intimidation, stalking, harassing photography or recording, persistent disruption of talks or other events, repeated unsolicited physical contact, or sexual attention.

Monday, 11 August 2014
The first mention of the policy is made on the mailing list, in a mail with details about the event that also includes the line:

An anti-harrasment policy applies at GHM: http://gnu.org/ghm/policy.html

Monday, 11 August 2014
A speaker writes the list to say:

Since I do not desire to be denounced, prosecuted and finally sanctioned or expelled from the event (especially considering the physical pain and inconvenience of attending due to my very recent accident) I withdraw my intention to lecture "Introducing GNU Posh" at the GHM, as it is not compliant with the policy described in the page above.

Please remove the talk from the official schedule. Thanks.

PS: for those interested, I may perform the talk off-event in case we find a suitable place, we will see..

The resulting thread goes totes clownshoes and rages on up until the event itself.
Friday, 15 August 2014
Sheepish looks between people that flamed each other over the internet. Hallway-track discussion starts up quickly though. Disagreeing people are not rational enough to have a conversation though (myself included).
Saturday, 16 August 2014
In the morning and lunch break, people start to really discuss the issues (spontaneously). It turns out that the original mail that sparked the thread was based, to an extent, on a misunderstanding: that "offensive or overly explicit sexual language or imagery" was parsed (by a few non-native English speakers) as "offensive language or ...", which people thought was too broad. Once this misunderstanding was removed, there were still people that thought that any policy at all was unneeded, and others that were concerned that someone could say something without intending offense, but then be kicked out of the event. Arguments back and forth. Some people wonder why others can be hurt by "just words". Some discussion of rape culture as continuum between physical violence and cultural tropes. One of the presentations after lunch is by a GNU hacker. He starts his talk by stating his hope that he won't be seen as "offensive or part of rape culture or something". His microphone wasn't on, so once he gets it on he repeats the joke. I stomp out, slam the door, and tweet a few angry things. Later in the evening the presenter and I discuss the issue. He apologizes to me.
Sunday, 17 August 2014
A closed meeting for GNU maintainers to round up the state of GNU and GHM. No women present. After dealing with a number of banalities, we finally broach the topic of the harassment policy. More opposition to the policy Sunday than Saturday lunch. Eventually a proposal is made to replace "offensive" with "disparaging", and people start to consent to that. We run out of time and have to leave; resolution unclear.
Monday, 18 August 2014
GHM organizer updates the policy to remove the words "offensive or" from the beginning of the harassment policy.

I think anyone involved would agree on this timeline.

thoughts

The problems seen over the last week with this anti-harassment policy are entirely to do with the men. It was a man who decided that he should withdraw his presentation because he found it offensive that he could be perceived as offensive. It was men who willfully misread the policy, comparing it to such examples as "I should have the right to offend Microsoft supporters", "if I say the wrong word I will go to jail", and who raised the ignorant, vacuous spectre of "political correctness" to argue that they should be able to say what they want, in a GNU conference, no matter who they hurt, no matter what the effects. That they are able to argue this position from a status-quo perspective is the definition of privilege.

Now, there is ignorance, and there is malice. Both must be opposed, but the former may find a cure. Although I didn't begin my contribution to the discussion in the smoothest way, linking to a an amusing article on the alchemy of intent that is probably misunderstood, it ended up that one of the main points was about intent. I know Ralph (say) and Ralph is a great person and so how could it be that anything Ralph would say could be a slur? You know he wouldn't mean it like that!

To that, we of course have to say that as GNU grows, not everyone knows that Ralph is a great person. In the end what would it mean for someone to be anti-racist but who says racist things all the time? You would have to call them racist, right? Or if you just said something one time, but refused to own up to your error, and instead persisted in repeating a really racist slur -- you would be racist right? But I know you... But the thing that you said...

But then to be honest I wonder sometimes. If someone repeats a joke trivializing rape culture, after making sure that the microphone is picking up his words -- I mean, that's a misogynist action, right? Put aside the question of whether the person is, in essence, misogynist or not. They are doing misogynist things. How do I know that this person isn't going to do it again, private apology or not?

And how do I know that this community isn't going to permit it again? That remark was made to a room of 40 dudes or so. Not one woman was present. Although there was some discussion afterwards, if people left because of the phrase, it was only two or three. How can we then say that GNU is not a misogynist community -- is not a community that tolerates misogyny?

Given all of this, what do you expect? Do you expect to grow GNU into a larger organization in the future, rich and strong and diverse? If that's not your goal, you are not my colleague. And if it is your goal, why do you put up with this kind of behavior?

The discussion on intent and offense seems to have had its effect in the removal of "offensive or" from the anti-harassment policy language. I think it's terrible, though -- if you don't trust someone who says they were offended by sexual language or imagery, why would you trust them when they report sexual harassment or assault? I can only imagine this leading to some sort of argument where the person who has had the courage to report such an incident finds himself or herself in a public witness box, justifying that the incident was offensive. "I'm sorry my words offended you, but that was not my intent, and anyway the words were not offensive." Lolnope.

There were so many other wrong things about this -- a suggestion that we the GNU cabal (lamentably, a bunch of dudes) should form a committee to make the wording less threatening to us; that we're just friends anyway; that illegal things are illegal anyway... it's as if the Code of Conduct FAQ that Ashe Dryden assembled were a Bingo card and we all lost.

Finally I don't think I trusted the organizers enough with this policy. Both organizers expressed skepticism about the policy in such terms that if I personally hadn't won the privilege lottery (white male "western" hetero already-established GNU maintainer) I wouldn't feel comfortable bringing up a concern to them.

In the future I will not be attending any conferences without strong, consciously applied codes of conduct, and I enjoin you to do the same.

conclusion


Propagandhi, "Refusing to Be a Man", Less Talk, More Rock (1996)

There is no conclusion yet -- working for the better world we all know is possible is a process, as people and as a community.

To outsiders, to outsiders everywhere, please keep up the vocal criticisms. Thank you for telling your story.

To insiders, to insiders everywhere, this is your problem. The problem is you. Own it.

flow analysis in guile

1 July 2014 8:00 AM (guile | compilers | igalia | flow analysis | dfa | cfa | kcfa | cps | gnu)

Greets, and welcome back to the solipsism! I've been wandering the wilderness with my Guile hackings lately, but I'm finally ready to come back to civilization. Hopefully you will enjoy my harvest of forest fruit.

Today's article is about flow analysis and data structures. Ready? Let's rock!

flow analysis

Many things that a compiler would like to know can be phrased as a question of the form, "What do I know about the data flowing through this particular program point?" Some things you might want to know are:

  1. The set of variables that must be live.

  2. The set of variables that happen to be live. This is the same as (1) except it includes variables that aren't needed but haven't been clobbered by anything.

  3. The set of expressions whose results are still valid (i.e., haven't been clobbered by anything else).

  4. An upper and lower bound on the range of numeric variables.

Et cetera. I'll talk about specific instances of flow analysis problems in the future, but today's article is a bit more general.

The first thing to note about these questions is that they don't necessarily need or have unique answers. If GCC decides that it can't prove anything about the ranges of integers in your program, it's not the end of the world -- it just won't be able to do some optimizations that it would like to do.

At the same time, there are answers that are better and worse than others, and answers that are just invalid. Consider a function of the form:

int f():
  int a = 1
  int b = 2
  int c = a + b
  int d = b + c
  ...
  int z = x + y
  return z

In this function, there are 27 different expressions, including the return, and 27 different program points. (You can think of a program point as a labelled sub-expression. In this example none of the expressions have sub-expressions.) If we number the program points in order from 0 to 26, we will have a program that first executes expression 0 (int a = 1), then 1, and so on to the end.

Let's plot some possible solutions to the live variable flow-analysis problem for this program.

Here we see two solutions to the problem (in light and dark blue), along with a space of invalid solutions (in red). The Y axis corresponds to the variables of the program, starting with a on the bottom and finishing with z on the top.

For example, consider position 4 in the program, corresponding to int e = c + d. It is marked in the graph with a vertical bar. After position 4, the only values that are used in the rest of the program are d and e. These are the variables that are contained within the light-blue area. It wouldn't be invalid to consider a, b, and c to be live also, but it also wouldn't be as efficient to allocate space and reason about values that won't contribute to the answer. The dark blue space holds those values that may harmlessly be considered to be live, but which actually aren't live.

It would, however, be invalid to consider the variable f to be live after position 4, because it hasn't been defined yet. This area of the variable space is represented in red on the graph.

Of course, the space of all possible solutions isn't possible to represent nicely on a two-dimensional graph; we're only able to show two with colors, and that not very well as they overlap. This difficulty cuts close to the heart of the data-flow problem: that it ultimately requires computing a two-dimensional answer, which necessarily takes time and space O(n2) in program size.

Or does it?

classical flow analysis frameworks

The classical way to do flow analysis is to iterate a set of data-flow equations over an finite lattice until you reach a fixed point.

That's a pithy sentence that deserves some unpacking. If you're already comfortable with what it means, you can skip a couple sections.

Still here? Cool, me too. Let's take a simple example of sign analysis. The problem is to determine, for the integer variables of a program, at every point in the program, which ones may be negative (-), which ones may be zero (0), and which may be positive (+). All of these are conceptually bit-flags.

For example, in this program:

int f(int x):
 L0:  while (x >= 0)
 L1:    int y = x - 1
 L2:    x = y
 L3:  return x

We can assign the flags -0+ to the argument x as the initial state that flows into L0, because we don't know what it is ahead of time, and it is the only variable in scope. We start by representing the initial state of the solution as a set of sets of state values:

state := {L0: {x: -0+}, L1: Ø, L2: Ø, L3: Ø}

In this notation, Ø indicates a program point that hasn't been visited yet.

Now we iterate through all labels in the program, propagating state to their successors. Here is where the specific problem being solved "hooks in" to the generic classical flow analysis framework: before propagating to a successor, a flow equation transforms the state that flows into a program point to a state that flows out, to the particular successor. In this case we could imagine equations like this:

visit_test(expr, in, true_successor, false_successor):
  if expr matches "if var >= 0":
    # On the true branch, var is not negative.
    propagate(in + {var: in[var] - -}, true_successor)
    # On the false branch, var is not zero and not positive.
    propagate(in + {var: in[var] - 0+}, false_successor)
  else if ...

visit_expr(expr, in, successor):
  if expr matches "left = right - 1":
    if in[right] has +:
      if in[right] has 0:
        # Subtracting one from a non-negative arg may be negative.
        propagate(in + {left: in[right] + -}, successor)
      else
        # Subtracting one from a positive arg may be 0.
        propagate(in + {left: in[right] + 0}, successor)
    else:
      # Subtracting one from a nonpositive arg will be negative.
      propagate(in + {left: -}, successor)
  else if expr matches "left = right":
    propagate(in + {left: in[right]}, successor)
  ...

The meat of classical data-flow analysis is the meet operation:

propagate(out, successor):
  if state[successor] is Ø:
    state[successor] = out
  else
    state[successor] = meet(out, state[successor]):

# A version of meet for sign analysis
meet(out, in):
  return intersect_vars_and_union_values(out, in)

Let's run this algorithm by hand over the example program. Starting from the initial state, we propagate the L0→L1 and L0→L3 edges:

visit_test("if x <= 0", {x: -0+}, L1, L3)
→ propagate({x: 0+}, L1)
→ state[L1] = {x: 0+}
→ propagate({x: -}, L3)
→ state[L3] = {x: -}

Neat. Let's keep going. The successor of L1 is L2:

visit_expr("y = x - 1", {x: 0+}, L2)
→ propagate({x: 0+, y: -0+}, L2)
→ state[L2] = {x: 0+, y: -0+}

L2→L0 is a back-edge, returning to the top of the loop:

visit_expr("x = y", {x: 0+, y: -0+}, L0)
→ propagate({x: -0+, y: -0+}, L0)
→ state[L0] = meet({x: -0+, y: -0+}, state[L0])
→ state[L0] = meet({x: -0+, y: -0+}, {x: -0+})
→ state[L0] = {x: 0+}

Finally, L3 has no successors, so we're done with this iteration. The final state is:

{L0: {x: -0+},
 L1: {x: 0+},
 L2: {x: 0+, y: -0+},
 L3: {x: -}}

which indeed corresponds with what we would know intuitively.

fixed points and lattices

Each of the steps in our example flow analysis was deterministic: the result was calculated from the inputs and nothing else. However the backwards branch in the loop, L2→L0, could have changed inputs that were used by the previous L0→L1 and L0→L3 forward edges. So what we really should do is iterate the calculation to a fixed point: start it over again, and run it until the state doesn't change any more.

It's easy to see in this case that running it again won't end up modifying the state. But do we know that in all cases? How do we know that iteration would terminate at all? It turns out that a few simple conditions are sufficient.

The first thing to ensure is that state space being explored is finite. Here we can see this is the case, because there are only so many ways you can combine -, 0, and +. Each one may be present or not, and so we have 2n = 23 = 8 possible states. The elements of the state array will be a set with at most one entry for each variable, so the whole state space is finite. That at least ensures that an answer exists.

Next, the "meet" operation has to be commutative, associative, and idempotent. The above example used intersect_vars_and_union_values. We intersect vars because it only makes sense to talk about a variable at a program point if the variable dominates the program point. It didn't make sense to propagate y on the L2→L0 branch, for example. It's usually a good idea to model a data-flow problem using sets, as set union and intersection operations fulfill these commutative, associative, and distributive requirements.

Finally, the state being modelled should have a partial order, and functions that add information along control-flow edges -- above, visit_test and visit_expr -- should preserve this partial ordering. That is to say, visit_test and visit_expr should be monotonic. This means that no matter on what control paths data propagates, we keep building towards an answer with more information, making forward progress. This condition is also easily fulfilled with sets, or more generally with any lattice. (A lattice is nothing more than a data type that fulfills these conditions.)

Iterating the data-flow equations until the state stops changing will find a fixed point of the lattice. Whether you find the greatest or least fixed point is another question; I can't help linking to Paul Khuong's old article on Québécois student union strikes for a lovely discussion.

Another question is, how many iterations are necessary to reach a fixed point? I would first note that although in our walk-through we iterated in forward order (L0, L1, L2, L3), we could have visited nodes in any order and the answer would be the same. I'll cut to the chase and say that if:

  1. you represent your state with bitvectors

  2. the control-flow graph is reducible (has only natural loops)

  3. the meet operation on values is bitvector union or intersection

  4. you visit the program points in topologically sorted order

If these conditions are fulfilled, then you will reach a fixed point after LC + 2 iterations, where LC is the "loop-connectness number" of your graph. You can ensure (1), (3), and (4) by construction. (Reverse post-order numbering is an easy way to fulfill (4).) (2) can be ensured by using programming languages without goto (a for loop is always a natural loop) but can be violated by optimizing compilers (for example, via contification).

Loop connectedness is roughly equivalent to the maximum nesting level of loops in the program, which has experimentally been determined to rarely exceed 3. Therefore in practice, data-flow analysis requires a number of steps that is O(n * 5) = O(n) in program size.

For more information on data-flow analysis, including full proofs and references, see Carl Offner's excellent, excellent manuscript "Notes on Graph Algorithms used in Optimizing Compilers". I don't know of any better free resource than that. Thanks, Carl!

an aside: the kCFA algorithms

I just finished describing what I called "classical" data-flow analysis. By that I mean to say that people have been doing it since the 1970s, which is classical enough as far as our industry goes. However with the rise of functional languages in the 1980s, it became unclear how to apply classical data-flow analysis on a language like Scheme. Let's hear it from the horse's mouth:

This brings us to the summer of 1984. The mission was to build the world's most highly-optimising Scheme compiler. We wanted to compete with C and Fortran. The new system was T3, and the compiler was to be called Orbit. We all arrived at WRL and split up responsibility for the compiler. Norman was going to do the assembler. Philbin was going to handle the runtime (as I recall). Jonathan was project leader and (I think) wrote the linker. Kranz was to do the back end. Kelsey, the front end. I had passed the previous semester at CMU becoming an expert on data-flow analysis, a topic on which I completely grooved. All hot compilers do DFA. It is necessary for all the really cool optimisations, like loop-invariant hoisting, global register allocation, global common subexpression elimination, copy propagation, induction-variable elimination. I knew that no Scheme or Lisp compiler had ever provided these hot optimisations. I was burning to make it happen. I had been writing 3D graphics code in T, and really wanted my floating-point matrix multiplies to get the full suite of DFA optimisation. Build a DFA module for T, and we would certainly distinguish ourselves from the pack. So when we divided up the compiler, I told everyone else to back off and loudly claimed DFA for my own. Fine, everyone said. You do the DFA module. Lamping signed up to do it with me.

Lamping and I spent the rest of the summer failing. Taking trips to the Stanford library to look up papers. Hashing things out on white boards. Staring into space. Writing little bits of experimental code. Failing. Finding out *why* no one had ever provided DFA optimisation for Scheme. In short, the fundamental item the classical data-flow analysis algorithms need to operate is not available in a Scheme program. It was really depressing. I was making more money than I'd ever made in my life ($600/week). I was working with *great* guys on a cool project. I had never been to California before, so I was discovering San Francisco, my favorite city in the US and second-favorite city in the world. Silicon Valley in 1984 was beautiful, not like the crowded strip-mall/highway hell hole it is today. Every day was perfect and beautiful when I biked into work. I got involved with a gorgeous redhead. And every day, I went in to WRL, failed for 8 hours, then went home.

It was not a good summer.

At the end of the summer, I slunk back to CMU with my tail between my legs, having contributed not one line of code to Orbit.

Olin Shivers, A history of T

It took him another 7 years, but Shivers stuck with it, and in the end came out with the family of algorithms known as k-CFA. Instead of focusing on loops, which Scheme doesn't have syntactically, Shivers used continuation-passing style to ruthlessly simplify Scheme into a dialect consisting of not much more than function calls, and focused his attention on function calls. The resulting family of flow algorithms can solve flow equations even in the presence of higher-order functions -- a contribution to computer science born out of necessity, failure, and stubbornness.

With all those words, you'd think that I'd be itching to use k-CFA in Guile, and I'm not. Unfortunately even the simplest, least expressive version (0-CFA) is O(n2); 1-CFA is exponential. I don't have time for that. Instead, Guile is able to use classical DFA because it syntactically distinguishes labelled continuations and functions, and contifies functions to continuations where possible, which makes the Scheme DFA problem exactly the same as in any other language.

n times what?

Now that we have established that the number of visit operations is O(n), it remains to be seen what the individual complexity of a visit operation is in order to determine the total complexity. The naïve thing is just to use bitvectors, with each of the bitvectors having as many entries as the program has variables, times however many bits we are using.

This leads to O(|L|*|V|) space and time complexity, where |L| is the number of program points (labels) and |V| is the number of variables. As the number of variables is generally proportional to the size of program, we can approximate this as O(n2).

In practice, this means that we can use data-flow analysis to programs up to about 10000 labels in size. Sign analysis on a 10000-label function would require 100002*3/8 = 37.5 MB of memory, which is already a bit hefty. It gets worse if you need to represent more information. I was recently doing some flow-sensitive type and range inference, storing 12 bytes per variable per program point; for a 10000-label function, that's more than a gigabyte of memory. Badness.

shared tails

Although it was the type inference case that motivated this investigation, sign inference is similar and more simple so let's go with that. The visit_expr and visit_test functions above are only ever going to add additional information about the variables that are used in or defined by an expression; in practice this is a small finite number. What if we chose a representation of state that could exploit this fact by only adding O(1) amounts of data, sharing a common tail with preceding expressions?

If we draw a control-flow graph for the sign analysis program, we get something like:

The goal is to create a data structure that looks like the dominator tree. For "normal" control-flow edges -- those whose destination only have one predecessor -- we can avoid the "meet" operations, and just copy the predecessor's out set to the successor's in set. We then define "meet" as an adjoin operation that effectively conses the new information onto a shared tail, if it wasn't there already. The first iteration through the CFG will initialize the shared tail of a given control-flow join to the set of variables flowing into the join's dominator. Subsequent information will adjoin (cons) on new incoming values. In this case the resulting data structure ends up looking like:

Here the italic references like L1 indicate shared structure, and the tuples annotating the edges represent additional information flow, beyond that information that was already present in the successor's dominator.

Of course, you can implement this with linked lists and it will work fine. The problem there will be lookup speed -- when your visit operation (visit_expr or visit_test) goes to look up the sign of a variable, or the same happens via the meet operation, you get O(n) lookup penalties. Anticipating this, I implemented this with a version of Phil Bagwell's vhashes, which promise O(log n) variable lookup. See Guile's documentation, or Bagwell's excellent paper.

Note that you can't remove items from sets once they have been added in a shared-tail flow analysis; to keep the meet function monotonic, you have to instead insert tombstone entries. Not so nice, but it is what it is.

A shared-tail flow analysis consumes only O(1) additional memory per node, leading to O(n) space complexity. I have some measured space and time graphs below that show this experimentally as well.

space and time

Unfortunately, lookup time on branchy vhashes is really terrible: O(log n) in the best case, and O(n) at worst. This is especially acute because there is no easy way to do unions or intersections on vhashes -- you end up having to compute the unshared heads of the two vhashes you are merging, and looking up elements in one in the other... I could go on, but I think you believe me when I say it gets complicated and slow. It's possible to beat a bitvector approach in time for relatively "big" problems like type analysis, but for common subexpression elimination where I was just storing a bit per expression, it was tough to beat the speed of bitvectors.

I started looking for another solution, and in the end came on a compromise that I am much happier with, and again it's Phil Bagwell to the rescue. Instead of relying on vhashes that explicitly share state, I use Clojure-style persistent sparse bit-sets and bit-maps that share state opportunistically.

Guile's intset module implements a bitvector as a functional tree whose branches are vectors and whose leaves are fixnums. Each leaf represents one range of 32 integers, and each branch on top of it increases the range by a factor of 8. Branches can be sparse, so not all integers in the range of an intset need leaves.

As you would expect, adjoining an element onto such a tree is O(log n). Intersecting is much faster than vhashes though, as intsets partition the key space into power-of-two blocks. Intsets try hard to share state, so that if your adjoin would return the same value, the result is the same object, at the same address. This allows sub-trees to be compared for equality via pointer comparison, which is a great fast-path for intersection and union.

Likewise, Guile's new intmap module allow the association of larger values with integer keys.

science! fetch your white coats and lab books!

I had the chance to actually test the system with all three of these data structures, so I compiled one of Guile's bigger files and recorded the memory used and time taken when solving a 1-bit flow analysis problem. This file has around 600 functions, many of them small nested functions, many of them macro-generated, some of them quite loopy, and one big loopless one (6000 labels) to do the initialization.

First, a plot of how many bytes are consumed per label during while solving this 1-bit DFA.

Note that the X axis is on a log scale.

The first thing that pops out at me from these graphs is that the per-label overhead vhash sizes are indeed constant. This is a somewhat surprising result for me; I thought that iterated convergence would make this overhead depend on the size of the program being compiled.

Secondly we see that the bitvector approach, while quadratic in overall program size, is still smaller until we get to about 1000 labels. It's hard to beat the constant factor for packed bitvectors! Note that I restricted the Y range, and the sizes for the bitvector approach are off the charts for N > 1024.

The intset size is, as we expected, asymptotically worse than vhashes, but overall not bad. It stays on the chart at least. Surprisingly, intsets are often better than vhashes for small functions, where we can avoid allocating branches at all -- note the "shelves" in the intset memory usage, at 32 and 256 entries, respectively, corresponding to the sizes that require additional levels in the tree. Things keep on rising with n, but sublinearly (again, recall that the X axis is on a log scale).

Next, a plot of how many nanoseconds it takes per label to solve the DFA equation above.

Here we see, as expected, intsets roundly beating vhashes for all n greater than about 200 or so, and show sublinear dependence on program size.

The good results for vhashes for the largest program are because the largest program in this file doesn't have any loops, and hardly any branching either. It's the best case for vhashes: all appends and no branches. Unfortunately, it's not the normal case.

It's not quite fair to compare intsets to bitvectors, as Guile's bitvectors are implemented in C and intsets are implemented in Scheme, which runs on a bytecode VM right now. But still, the results aren't bad, with intsets even managing to beat bitvectors for the biggest function. The gains there probably pay for the earlier losses.

This is a good result, considering that the goal was to reduce the space complexity of the algorithm. The 1-bit case is also the hardest case; when the state size grows, as in type inference, the gains of using structure-sharing trees grow accordingly.

conclusion

Let's wrap up this word-slog. A few things to note.

Before all this DFA work in Guile, I had very little appreciation of the dangers of N-squared complexity. I mean, sometimes I had to to think about it, but not often, expecially if your constant factors are low, or so I thought. But I got burned by it; hopefully the next time, if any, will be a long time coming.

I was happily, pleasantly surprised at the expressiveness and power of Bagwell/Clojure-style persistent data structures when applied to the kinds of problems that I work on. Space-sharing can make a fundamental difference to the characteristics of an algorithm, and Bagwell's data structures can do that well. Intsets simplified my implementations because I didn't have to reason much about space-sharing on my own -- finding the right shared tail for vhashes is, as I said, an unmitigated mess.

Finally I would close by saying that I was happy to fail in such interesting (to me) ways. It has been a pleasant investigation and I hope I have been able to convey some of the feeling of it. If you want to see what the resulting algorithm looks like in practice, see compute-truthy-expressions.

Until next time, happy hacking!

effects analysis in guile

18 May 2014 7:19 PM (guile | compilers | optimization | cps | ssa | anf | cse | v8 | igalia)

OK kids, so I had a bit of time recently. I've been hacking on Guile's new CPS-based compiler, which should appear in a stable release in a few months. I have a few things to write about, but today's article is on effects analysis.

what it is, yo

The job of the optimization phase of a compiler is to remove, replace and reorder the sub-expressions of a program. The optimizer recognizes ways in which the program can be better, and then needs to check if those transformations are valid. A transformation is valid if a program exhibits the same behavior both before and after the transformation.

Effects analysis is a proof technique that builds a conservative model of how expressions can affect each other. It can be used to prove the validity of some transformations. For example, if we determine that an expression A reads the first field in a vector, and expression B sets the second field in a vector, then we can use effects analysis to show that the expressions don't affect each others' value and can be freely reordered, for example to hoist either one out of a loop.

In effects analysis, we model the program state coarsely and conservatively with a limited set of categories. The precise set of categories depends on the domain. In graphics, for example, you might have a state bit for the current coordinate system, the current color, the current fill mode, etc. In general-purpose compilation, the effects are mostly about memory and (sometimes) exceptions.

In Guile we model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

For the purposes of Guile, a "type check" is the possibility that this expression will throw an exception if its arguments are somehow out of range. For example, cons will never throw an exception, except in out-of-memory situations, but + may throw an exception if its arguments aren't numbers. However if an expression is dominated by a copy of itself -- that is, if we see that (+ a b) (which may throw an exception) is dominated by (+ a b), perhaps after CSE -- then we can determine that only the first will exhibit the type-check effects, and that the second will not.

Allocation indicates that an expression allocates a fresh object. In Scheme (and many other languages), each allocated object has its own identity: (eq? (cons 1 2) (cons 1 2)) must be false. Note that this restriction does not apply to constant literals, in Scheme; (eq? '(1 . 2) '(1 . 2)) may or may not be true. In Guile both results are possible. They are the same object when compiled (and thus deduplicated), but different when interpreted; the objects are just the ones returned from `read' and this are different. Anyway we use this allocation bit to indicate that an expression allocates a fresh object with a fresh identity.

The remaining effect kinds are "read" and "write", which indicate reads or writes to memory. So there are just 4 kinds of effects.

Allocation, read, and write effects are associated at run-time with particular memory locations. At compile-time we cannot in general know which of these locations are distinct, and which are actually the same. To simplify the problem, we simply record the type of the object, knowing that (say) a pair and a vector will never be at the same location. We also record the field in the object in the case of objects with multiple fields. There are special catch-all values to indicate "all memory kinds", when we really don't know what an expression will do (which is the case for all expression kinds without specific support in the effect analyzer), and for "all fields" when we don't know which field we are accessing.

One example of the use of this analysis is in common subexpression elimination (CSE). If at a program point P we have a set of available expressions A, then to determine which members of A are still available after the expression at P, you subtract members of A that are clobbered by P. Computation of A at each P plus value numbering is most of CSE; but more on that in some later dispatch. Anyway here's the definition of effects-clobber?.

(define (effect-clobbers? a b)
  "Return true if A clobbers B.  This is the case
if A might be a write, and B might be a read or a
write to the same location as A."
  (define (locations-same?)
    (let ((a (ash a (- &effect-kind-bits)))
          (b (ash b (- &effect-kind-bits))))
      (or (eqv? &unknown-memory-kinds
                (logand a &memory-kind-mask))
          (eqv? &unknown-memory-kinds
               (logand b &memory-kind-mask))
          (and (eqv? (logand a &memory-kind-mask)
                     (logand b &memory-kind-mask))
               ;; A negative field indicates "the
               ;; whole object".  Non-negative fields
               ;; indicate only part of the object.
               (or (< a 0) (< b 0) (= a b))))))
  (and (not (zero? (logand a &write)))
       (not (zero? (logand b (logior &read &write))))
       (locations-same?)))

This analysis is simple, small, and fast. It's also coarse and imprecise -- if you are reading from and writing to two vectors at once, you're almost sure to miss some optimization opportunities as accesses to all vectors are conflated into one bit. Oh well. If you get into this situation, you'll know it, and be able to invest a bit more time into alias analysis; there's lots of literature out there. A simple extension would be to have alias analysis create another mapping from expression to equivalence class, and to use those equivalence classes in the same-location? check above.

Of course this assumes that expressions just access one object. This is the case for Guile's CPS intermediate language, because in CPS, as in SSA or ANF, expressions don't have subexpressions.

This contrasts with direct-style intermediate languages, in which expressions may have nested subexpressions. If you are doing effects analysis on such a language, it's probably more appropriate to allocate a bit to each kind of effect on each kind of object, so that you can usefully union effects for a tree of expressions. Since you don't have to do this for CPS, we can allocate a fixed bit-budget towards more precision as to which field of an object is being accessed. The inability to be precise as to which field was being accessed due to the direct-style IL was one of the problems in Guile's old CSE pass.

Finally, a note about type checks. Guile includes type checks as part of the effects analysis for two reasons. The first is the obvious case of asking whether an expression is effect-free or not, which can lead to some optimizations in other parts of the compiler. The other is to express the potential for elision of duplicate expressions, if one dominates the other. But it's also possible to remove type checks in more cases than that: one can run a type inference pass to remove type-check effects if we can prove that the arguments to an expression are in range. Obviously this is more profitable for dynamically-typed languages, but the same considerations apply to any language with sum types.

Guile's effects analysis pass is here. V8 seems to have two effects analysis passes; one is in effects.h and typing.cc, and operates over the direct-style AST, and the other is in the value numbering pass (hydrogen-instructions.h and hydrogen-gvn.h; search for GVNFlag).

More on how effects analysis is used in Guile in a future missive. Until then, happy hacking.