amores prohibidos

23 September 2015 10:12 PM (spain | france | europe | immigration)

It was with anxious trepidation that today, after having been officially resident in Spain for 10 years, working and paying taxes all that time, I went to file a request for Spanish nationality.

See, being a non-European resident in Europe is a precarious thing. If ever something happens "back home" with your family or to those that you love and you need to go help out, you might not be able to come back. Sure, if you keep your official residence in Europe maybe you can make it fly under the radar, but officially to keep your right of residence you need to reside, continually. It doesn't matter that you have all your life in Spain, or France, or wheresoever: if you have to leave for a year, you start over at day 1, if you are able to get back in.

In my case I moved away from the US when I was 22. I worked in Namibia for a couple years after college teaching in a middle school, and moved directly from there to Barcelona when a company started up around a free software project I had been working on. It was a more extreme version of the established practice of American diaspora: you go to college far away from home to be away from your parents, then upon graduation your first job takes you far away again, and as the years go by you have nothing left to go back to. Your parents move into a smaller house, perhaps in a different town, your town changes, everyone moved away anyway, and where is home? What makes a home? What am I doing here and if I stopped, is there somewhere to go back to, or is it an ever-removing onward?

I am 35 now. While it's true that there will always be something in my soul that pines for the smell of a mountain stream bubbling down an Appalachian hollow, there's another part of my heart that is twined to Europe: where I spent the all of my working life up to now, where I lived and found love and ultimately married. I say Europe and not specifically Barcelona because... well. My now-wife was living in Paris when we got together. I made many, many journeys on the overnight Talgo train in those days. She moved down to Barcelona with me for a couple years, and when her studies as an interpreter from Spanish and French moved her back to France, I went with her.

That move was a couple years ago. Since we didn't actually know how much time would be required there or if we would be in Switzerland or France I kept my official residence in Spain, and kept on as a Spanish salaried worker. I was terrified of the French paperwork to set up as a freelancer, even though with the "long-term residency-EU" permit it would at least be possible to make that transition. We lived a precarious life in Geneva for a while before finally settling in France.

A note about that. We put 12 months of rent (!!!) in an escrow account, as a guarantee that allowed us to be able to rent our house. In France this is illegal: a landlord is only allowed to ask for a couple months or so. However in France you usually have a co-signer on a lease, and usually it's against your parent's house. So even if you are 45, you often have your parents signing off on your lease. We wouldn't have been able to find anything if we weren't willing to do this -- one of many instances of the informal but very real immigrant tax.

All this time I was a salaried Spanish worker. This made it pretty weird for me in France. I had to pretend I was there on holiday to get covered by health care, and although there is a European health card, it's harder to get if you are an immigrant: the web page seems to succeed but then they email you an error and don't tell why. The solution is to actually pass by the office with your residence permit, something that nationals don't need. And anyway this doesn't cover having a family doctor, despite the fact that I was paying for it in Spain.

This is one instance of the general pattern of immigrants using the health care system less than nationals. If you are British, say, then you know your rights and you know how the NHS works and you make it work for you. If you are an immigrant, maybe English is your second language, probably you're poor, you're ignorant of the system, you don't have family members or a big support system to tell you how the system works, you might not speak or write the language well, and probably all your time is spent working anyway because that's why you're there.

In my case I broke my arm a couple years ago while snowboarding in France. (Sounds posh but it's not really.) If all my papers were in order and I understood the system I would probably have probably walked out without paying anything. As it was I paid some thousands of euros out of my pocket, and that is my sole interaction with health care over the course of the last 5 years I think. I still have to get the plate taken out of my arm; should have done that a year ago. It hurts sometimes.

There is a popular idea about immigrants scrounging on benefits, and as a regular BBC radio 1 listener I hear that phrase in the voice of their news presenters inciting their listeners to ignorant resentment of immigrants with their racist implications that we are somehow "here" for "their" things. Beyond being implausible that an immigrant would actually receive benefits at all, it's unlikely that they would be able to continue to do so, given that residence is predicated on work.

In the US where there are no benefits the phrase is usually reduced to "immigrants are stealing our jobs", a belief encouraged by the class of people that employ immigrants: the owners. If you encourage a general sentiment of "immigrants are bad, let's make immigrants' life difficult", you will have cheaper, more docile workers. The extreme form of this is the American H1B visa, in which if you quit your job, for whatever reason, even if your boss was sexually harrassing you, you have only one week to find another job or you're deported back to your "home". Whatever "home" means.

And besides, owners only hire workers if they produce surplus value. If the worker doesn't pay off, you fire them. Wealth transfer from workers to owners is in general from immigrants to nationals, because if you are national, maybe you inherited your house and could spend your money starting your business. Maybe you know how to get the right grants. You speak the language and have the contacts. Maybe you inherited the business itself.

I go through all this detail because when you were born in a place and grew up in a place and have never had to deal with what it is like being an immigrant, you don't know. You hear a certain discourse, almost always of the form "the horde is coming", but you don't know. And those that are affected the most have no say in the matter.

Of course, it would be nice to pass over to the other side, to have EU citizenship. Spanish would do, but any other Schengen citizenship would at least take away that threat of deportation or, what is equivalent, denial of re-entry. So I assembled all the documentation: my birth certificate from the US, with its apostille, and the legal Spanish translation. My criminal record check in the US, with its apostille, and the legal translation. The certificates that I had been continually resident, my social security payments, my payslips, the documents accrediting me as a co-owner of my company, et cetera.

All prepared, all checked, I go to the records department to file it, and after a pleasantly short half-hour wait I give the documents to the official.

Who asks if I have an appointment -- but I thought the papers could be presented and then they'd give me an appointment for the interview?

No matter, she could give me an appointment -- for May.


And then some months later there would be a home visit by the police.

And then they'd assess my answers on a test to determine that I had sufficient "cultural integration", but because it was a new measure they didn't have any details on what that meant yet.

And then they'd give me a number some 6 months later.

And then maybe they would decide after some months.

So, 2018? 2019, perhaps?

This morning the streets of Barcelona were packed with electoral publicity, almost all of it urging a vote for independence. After the shock and the sadness of the nationality paperwork things wore off, I have been riding the rest of the day on a burning anger. I've never, never been able to vote in a local election, and there is no near prospect of my ever being able to do so.

As kids we are sold on a story of a fictional first-person-plural, the "we" of state, and we look forward to coming of age as if told by some benevolent patriarch, arm outstretched, "Some day, this will all be yours." Today was the day that this was replaced in my mind by the slogan pasted all around Barcelona a few years ago, "no vas a tener una casa en la puta vida" (you'll never own a house in your fucking life). It's profoundly sad. My wife and I will probably be between the two countries for many years, but being probably forever third-class non-citizens: "in no day will you ever belong to a place."

I should note before finishing that I don't want to hear "it could be worse" or anything else from non-immigrants. We have much less political power than you do and I doubt that you understand what it is like. What needs to happen is a revaluing of the nature of citizenship: countries are for the people that are in them, not for some white-pride myth of national identity or only for those that were born there or even for people who identify with the country but don't live there. Anything else is inhuman. 10+ years to simply *be* is simply wrong.

As it is, I need to reduce the precarious aspect of my life so I will probably finally change my domicile to France. It's a loss to me: I lose the Spanish nationality process, all my familiarity with the Spanish system, the easy life of being a salaried employee. I know my worth and it's a loss to Spain too. Probably I'll end up cutting all ties there; too bad. And I count myself lucky to be able to do this, due to the strange "long term-EU" residency permit I got a few years ago. But I'm trading a less precarious life for having to set up a business, figure out social security, all in French -- and the nationality clock starts over again.

At least I won't have to swear allegiance to a king.

developing v8 with guix

4 August 2015 4:23 PM (v8 | guix | nix | functional package management | guile | igalia)

a guided descent into hell

It all started off so simply. My primary development machine is a desktop computer that I never turn off. I suspend it when I leave work, and then resume it when I come back. It's always where I left it, as it should be.

I rarely update this machine because it works well enough for me, and anyway my focus isn't the machine, it's the things I do on it. Mostly I work on V8. The setup is so boring that I certainly didn't imagine myself writing an article about it today, but circumstances have forced my hand.

This machine runs Debian. It used to run the testing distribution, but somehow in the past I needed something that wasn't in testing so it runs unstable. I've been using Debian for some 16 years now, though not continuously, so although running unstable can be risky, usually it isn't, and I've unborked it enough times that I felt pretty comfortable.

Perhaps you see where this is going!

I went to install something, I can't even remember what it was now, and the downloads failed because I hadn't updated in a while. So I update, install the thing, and all is well. Except my instant messaging isn't working any more because there are a few moving parts (empathy / telepathy / mission control / gabble / dbus / whatwhat), and the install must have pulled in something that broke one of them. No biggie, this happens. Might as well go ahead and update the rest of the system while I'm at it and get a reboot to make sure I'm not running old software.

Most Debian users know that you probably shouldn't do a dist-upgrade from an old system -- you upgrade and then you dist-upgrade. Or perhaps this isn't even true, it's tribal lore to avoid getting eaten by the wild beasts of bork that roam around the village walls at night. Anyway that's what I did -- an upgrade, let it chunk for a while, then a dist-upgrade, check the list to make sure it didn't decide to remove one of my kidneys to satisfy the priorities of the bearded demon that lives inside apt-get, OK, let it go, all is well, reboot. Swell.

Or not! The computer restarts to a blank screen. Ha ha ha you have been bitten by a bork-beast! Switch to a terminal and try to see what's going on with GDM. It's gone! Ha ha ha! Your organs are being masticated as we speak! How does that feel! Try to figure out which package is causing it, happily with another computer that actually works. Surely this will be fixed in some update coming soon. Oh it's something that's going to take a few weeks!!!! Ninth level, end of the line, all passengers off!

my gods

I know how we got here, I love Debian, but it is just unacceptable and revolting that software development in 2015 is exposed to an upgrade process which (1) can break your system (2) by default and (3) can't be rolled back. The last one is the killer: who would design software this way? If you make a system like this in 2015 I'd say you're committing malpractice.

Well yesterday I resolved that this would be the last time this happens to me. Of course I could just develop in a virtual machine, and save and restore around upgrades, but that's kinda trash. Or I could use btrfs and be able to rewind changes to the file system, but then it would rewind everything, not just the system state.

Fortunately there is a better option in the form of functional package managers, like Nix and Guix. Instead of upgrading your system by mutating /usr, Nix and Guix store all files in a content-addressed store (/nix/store and /gnu/store, respectively). A user accesses the store via a "profile", which is a forest of symlinks into the store.

For example, on my machine with a NixOS system installation, I have:

$ which ls

$ ls -l /run/current-system/sw/bin/ls
lrwxrwxrwx 1 root nixbld 65 Jan  1  1970
  /run/current-system/sw/bin/ls ->

$ ldd /nix/store/wc472nw0kyw0iwgl6352ii5czxd97js2-coreutils-8.23/bin/ls (0x00007fff5d3c4000) => /nix/store/c2p56z920h4mxw12pjw053sqfhhh0l0y-acl-2.2.52/lib/ (0x00007fce99d5d000) => /nix/store/la5imi1602jxhpds9675n2n2d0683lbq-glibc-2.20/lib/ (0x00007fce999c0000) => /nix/store/jd3gggw5bs3a6sbjnwhjapcqr8g78f5c-attr-2.4.47/lib/ (0x00007fce997bc000)
  /nix/store/la5imi1602jxhpds9675n2n2d0683lbq-glibc-2.20/lib/ (0x00007fce99f65000)

Content-addressed linkage means that files in the store are never mutated: they will never be overwritten by a software upgrade. Never. Never will I again gaze in horror at the frozen beardcicles of a Debian system in the throes of "oops I just deleted all your programs, like that time a few months ago, wasn't that cool, it's really cold down here, how do you like my frozen facial tresses and also the horns".

At the same time, I don't have to give up upgrades. Paradoxically, immutable software facilitates change and gives me the freedom to upgrade my system without anxiety and lost work.

nix and guix

So, there's Nix and there's Guix. Both are great. I'll get to comparing them, but first a digression on the ways they can be installed.

Both Nix and Guix can be installed either as the operating system of your computer, or just as a user-space package manager. I would actually recommend to people to start with the latter way of working, and move on to the OS if you feel comfortable. The fundamental observation here is that because /nix/store doesn't depend on or conflict with /usr, you can run Nix or Guix as a user on a (e.g.) Debian system with no problems. You can have a forest of symlinks in ~/.guix-profile/bin that links to nifty things you've installed in the store and that's cool, you don't have to tell Debian.

and now look at me

In my case I wanted to also have the system managed by Nix or Guix. GuixSD, the name of the Guix OS install, isn't appropriate for me yet because it doesn't do GNOME. I am used to GNOME and don't care to change, so I installed NixOS instead. It works fine. There have been some irritations -- for example it just took me 30 minutes to figure out how to install dict, with a local wordnet dictionary server -- but mostly it has the packages I need. Again, I don't recommend starting with the OS install though.

GuixSD, the OS installation of Guix, is a bit harder even than NixOS. It has fewer packages, though what it does have tends to be more up-to-date than Nix. There are two big things about GuixSD though. One is that it aims to be fully free, including avoiding non-free firmware. Because they build deterministic build products from source, Nix and Guix can offer completely reproducible builds, which is swell for software reliability. Many reliability people also care a lot about software freedom and although Nix does support software freedom very well, it also includes options to turn on the Flash plugin, for example, and of course includes the Linux kernel with all of the firmware. Well GuixSD eschews non-free firmware, and uses the Linux-Libre kernel. For myself I have a local build on another machine that uses the stock Linux kernel with firmware for my Intel wireless device, and I was really discouraged from even sharing the existence of this hack. I guess it makes sense, it takes a world to make software freedom, but that particular part is not my fight.

The other thing about Guix is that it's really GNU-focused. This is great but also affects the product in some negative ways. They use "dmd" as an init system, for example, which is kinda like systemd but not. One consequence of this is that GuixSD doesn't have an implementation of the org.freedesktop.login1 seat management interface, which these days is implemented by part of systemd, which in turn precludes a bunch of other things GNOME-related. At one point I started working on a fork of systemd that pulled logind out to a separate project, which makes sense to me for distros that want seat management but not systemd, but TBH I have no horse in the systemd race and in fact systemd works well for me. But, a system with elogind would also work well for me. Anyway, the upshot is that unless you care a lot about the distro itself or are willing to adapt to e.g. Xfce or Xmonad or something, NixOS is a more pragmatic choice.

i'm on a horse

I actually like Guix's tools better than Nix's, and not just because they are written in Guile. Guix also has all the tools I need for software development, so I prefer it and ended up installing it as a user-space package manager on this NixOS system. Sounds bizarre but it actually works pretty well.

So, the point of this article is to be a little guide of how to build V8 with Guix. Here we go!

up and running with guix

First, check the manual. It's great and well-written and answers many questions and in fact includes all of this.

Now, I assume you're on an x86-64 Linux system, so we're going to use the awesome binary installation mechanism. Check it out: because everything in /gnu/store is linked directly to each other, all you have to do is to copy a reified /gnu/store onto a working system, then copy a sqlite thing into /var, and you've installed Guix. Sweet, eh? And actually you can take a running system and clone it onto other systems in that way, and Guix even provides a tool to generate such a tarball for you. Neat stuff.

cd /tmp
tar xf guix-binary-0.8.3.x86_64-linux.tar.xz
mv var/guix /var/ && mv gnu /

This Guix installation has a built-in profile for the root user, so let's go ahead and add a link from ~root to the store.

ln -sf /var/guix/profiles/per-user/root/guix-profile \

Since we're root, we can add the bin/ part of the Guix profile to our environment.

export PATH="$HOME/.guix-profile/bin:$HOME/.guix-profile/sbin:$PATH"

Perhaps we add that line to our ~root/.bash_profile. Anyway, now we have Guix. Or rather, we almost have Guix -- we need to start the daemon that actually manages the store. Create some users:

groupadd --system guixbuild

for i in `seq -w 1 10`; do
  useradd -g guixbuild -G guixbuild           \
          -d /var/empty -s `which nologin`    \
          -c "Guix build user $i" --system    \

And now run the daemon:

guix-daemon --build-users-group=guixbuild

If your host distro uses systemd, there's a unit that you can drop into the systemd folder. See the manual.

A few more things. One, usually when you go to install something, you'll want to fetch a pre-built copy of that software if it's available. Although Guix is fundamentally a build-from-source distro, Guix also runs a continuous builder service to make sure that binaries are available, if you trust the machine building the binaries of course. To do that, we tell the daemon to trust

guix archive --authorize < ~root/.guix-profile/share/guix/

as a user

OK now we have Guix installed. Running Guix commands will install things into the store as needed, and populate the forest of symlinks in the current user's $HOME/.guix-profile. So probably what you want to do is to run, as your user:

/var/guix/profiles/per-user/root/guix-profile/bin/guix \
  package --install guix

This will make Guix available in your own user's profile. From here you can begin to install software; for example, if you run

guix package --install emacs

You'll then have an emacs in ~/.guix-profile/bin/emacs which you can run. Pretty cool stuff.

back on the horse

So what does it mean for software development? Well, when I develop software, I usually want to know exactly what the inputs are, and to not have inputs to the build process that I don't control, and not have my build depend on unrelated software upgrades on my system. That's what Guix provides for me. For example, when I develop V8, I just need a few things. In fact I need these things:

;; Save as ~/src/profiles/v8.scm
(use-package-modules gcc llvm base python version-control less ccache)

 (list clang
       (list gcc-4.9 "lib")

This set of Guix packages is what it took for me to set up a V8 development environment. I can make a development environment containing only these packages and no others by saving the above file as v8.scm and then sourcing this script:

~/.guix-profile/bin/guix package -p ~/src/profiles/v8 -m ~/src/profiles/v8.scm
eval `~/.guix-profile/bin/guix package -p ~/src/profiles/v8 --search-paths`
export GYP_DEFINES='linux_use_bundled_gold=0 linux_use_gold_flags=0 linux_use_bundled_binutils=0'
export CXX='ccache clang++'
export CC='ccache clang'
export LD_LIBRARY_PATH=$HOME/src/profiles/v8/lib

Let's take this one line at a time. The first line takes my manifest -- the set of packages that collectively form my build environment -- and arranges to populate a symlink forest at ~/src/profiles/v8.

$ ls -l ~/src/profiles/v8/
total 44
dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 bin
dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 etc
dr-xr-xr-x  4 root guixbuild  4096 Jan  1  1970 include
dr-xr-xr-x  2 root guixbuild 12288 Jan  1  1970 lib
dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 libexec
-r--r--r--  2 root guixbuild  4138 Jan  1  1970 manifest
lrwxrwxrwx 12 root guixbuild    59 Jan  1  1970 sbin -> /gnu/store/1g78hxc8vn7q7x9wq3iswxqd8lbpfnwj-glibc-2.21/sbin
dr-xr-xr-x  6 root guixbuild  4096 Jan  1  1970 share
lrwxrwxrwx 12 root guixbuild    58 Jan  1  1970 var -> /gnu/store/1g78hxc8vn7q7x9wq3iswxqd8lbpfnwj-glibc-2.21/var
lrwxrwxrwx 12 root guixbuild    82 Jan  1  1970 x86_64-unknown-linux-gnu -> /gnu/store/wq6q6ahqs9rr0chp97h461yj8w9ympvm-binutils-2.25/x86_64-unknown-linux-gnu

So that's totally scrolling off the right for you, that's the thing about Nix and Guix names. What it means is that I have a tree of software, and most directories contain a union of links from various packages. It so happens that sbin though just has links from glibc, so it links directly into the store. Anyway. The next line in my arranges to point my shell into that environment.

$ guix package -p ~/src/profiles/v8 --search-paths
export PATH="/home/wingo/src/profiles/v8/bin:/home/wingo/src/profiles/v8/sbin"
export CPATH="/home/wingo/src/profiles/v8/include"
export LIBRARY_PATH="/home/wingo/src/profiles/v8/lib"
export LOCPATH="/home/wingo/src/profiles/v8/lib/locale"
export PYTHONPATH="/home/wingo/src/profiles/v8/lib/python2.7/site-packages"

Having sourced this into my environment, my shell's ls for example now points into my new profile:

$ which ls

Neat. Next we have some V8 defines. On x86_64 on Linux, v8 wants to use some binutils things that it bundles itself, but oddly enough for months under Debian I was seeing spurious intermittent segfaults while linking with their bundled gold linker binary. I don't want to use their idea of what a linker is anyway, so I set some defines to make v8's build tool use Guix's linker. (Incidentally, figuring out what those defines were took spelunking through makefiles, to gyp files, to the source of gyp itself, to the source of the standard shlex Python module to figure out what delimiters shlex.split actually splits on... yaaarrggh!)

Then some defines to use ccache, then a strange thing: what's up with that LD_LIBRARY_PATH?

Well. I'm not sure. However the normal thing for dynamic linking under Linux is that you end up with binaries that are just linked against e.g., whereever the system will find That's not what we want in Guix -- we want to link against a specific version of every dependency, not just any old version. Guix's builders normally do this when building software for Guix, but somehow in this case I haven't managed to make that happen, so the binaries that are built as part of the build process can end up not specifying the path of the libraries they are linked to. I don't know whether this is an issue with v8's build system, that it doesn't want to work well with Nix / Guix, or if it's something else. Anyway I hack around it by assuming that whatever's in my artisanally assembled symlink forest ("profile") is the right thing, so I set it as the search path for the dynamic linker. Suggestions welcome here.

And from here... well it just works! I've gained the ability to precisely specify a reproducible build environment for the software I am working on, which is entirely separated from the set of software that I have installed on my system, which I can reproduce precisely with a script, and yet which is still part of my system -- I'm not isolated from it by container or VM boundaries (though I can be; see NixOps for more in that direction).

OK I lied a little bit. I had to apply this patch to V8:

$ git diff
diff --git a/build/standalone.gypi b/build/standalone.gypi
index 2bdd39d..941b9d7 100644
--- a/build/standalone.gypi
+++ b/build/standalone.gypi
@@ -98,7 +98,7 @@
         ['OS=="win"', {
           'gomadir': 'c:\\goma\\goma-win',
         }, {
-          'gomadir': '<!(/bin/echo -n ${HOME}/goma)',
+          'gomadir': '<!(/usr/bin/env echo -n ${HOME}/goma)',
         ['host_arch!="ppc" and host_arch!="ppc64" and host_arch!="ppc64le"', {
           'host_clang%': '1',

See? Because my system is NixOS, there is no /bin/echo. It does helpfully install a /usr/bin/env though, which other shell invocations in this build script use, so I use that instead. I mention this as an example of what works and what workarounds there are.

dpkg --purgatory

So now I have NixOS as my OS, and I mostly use Guix for software development. This is a new setup and we'll see how it works in practice.

Installing NixOS on top of Debian was a bit irritating. I ended up making a bootable USB installation image, then installing over to my Debian partition, happy in the idea that it wouldn't conflict with my system. But in that I forgot about /etc and /var and all that. So I copied /etc to /etc-debian, just as a backup, and NixOS appeared to install fine. However it wouldn't boot, and that's because some systemd state from my old /etc which was still in place conflicted with... something? In the end I redid the install, moving my old /usr, /etc and such directories to backup names and letting NixOS have control. That worked fine.

I have GuixSD on a laptop but I really don't recommend it right now -- not unless you have time and are willing to hack on it. But that's OK, install NixOS and you'll be happy on the system side, and if you want Guix you can install it as a user.

Comments and corrections welcome, and happy hacking!

loop optimizations in guile

28 July 2015 8:10 AM (licm | loop peeling | code motion | loop inversion | loop identification | guile | cps | ssa | code hoisting | effects analysis)

Sup peeps. So, after the slog to update Guile's intermediate language, I wanted to land some new optimizations before moving on to the next thing. For years I've been meaning to do some loop optimizations, and I was finally able to land a few of them.

loop peeling

For a long time I have wanted to do "loop peeling". Loop peeling means peeling off the first iteration of a loop. If you have a source program that looks like this:

while foo:

Loop peeling turns it into this:

if foo:
  while foo:

You wouldn't think that this is actually an optimization, would you? Well on its own, it's not. But if you combine it with common subexpression elimination, then it means that the loop body is now dominated by all effects and all loop-invariant expressions that must be evaluated for the expression to loop.

In dynamic languages, this is most useful when one source expression expands to a number of low-level steps. So for example if your language runtime implements top-level variable references in three parts, one where it gets a reference to a mutable box, then it checks if the box has a value, and and the third where it unboxes it, then we would have:

if foo:
  bar_location = lookup("bar")
  bar_value = dereference(bar_location)
  if bar_value is null: throw NotFound("bar")

  baz_location = lookup("baz")
  baz_value = dereference(baz_location)
  if baz_value is null: throw NotFound("baz")

  while foo:
    bar_value = dereference(bar_location)

    baz_value = dereference(baz_location)

The result is that we have hoisted the lookups and null checks out of the loop (if a box can never transition from full back to empty). It's a really powerful transformation that can even hoist things that traditional loop-invariant code motion can't, but more on that later.

Now, the problem with loop peeling is that usually values will escape your loop. For example:

while foo:
  x = qux()
  if x then return x

In this little example, there is a value x, and the return x statement is actually not in the loop. It's syntactically in the loop, but the underlying representation that the compiler uses looks more like this:

function qux(k):
  label loop_header():
    fetch(foo) -gt; loop_test
  label loop_test(foo_value):
    if foo_value then -> exit else -> body
  label body():
    fetch(x) -gt; have_x
  label have_x(x_value):
    if x_value then -> return_x else -> loop_header
  label return_x():
    values(x) -> k
  label exit():

This is the "CPS soup" I described in my last post. Only the bold parts are in the loop; notably, the return is outside the loop. Point being, if we peel off the first iteration, then there are two possible values for x that we would return:

if foo:
  x1 = qux()
  if x1 then return x1
  while foo:
    x2 = qux()
    if x2 then return x2

I have them marked as x1 and x2. But I've also duplicated the return x terms, which is not what we want. We want to peel off the first iteration, which will cause code growth equal to the size of the loop body, but we don't want to have to duplicate everything that's after the loop. What we have to do is re-introduce a join point that defines x:

if foo:
  x1 = qux()
  if x1 then join(x1)
  while foo:
    x2 = qux()
    if x2 then join(x2)
label join(x)
  return x

Here I'm playing fast and loose with notation because the real terms are too gnarly. What I'm trying to get across is that for each value that flows out of a loop, you need a join point. That's fine, it's a bit more involved, but what if your loop exits to two different points, but one value is live in both of them? A value can only be defined in one place, in CPS or SSA. You could re-place a whole tree of phi variables, in SSA parlance, with join blocks and such, but it's just too hard.

However we can still get the benefits of peeling in most cases if we restrict ourselves to loops that exit to only one continuation. In that case the live variable set is the intersection of all variables defined in the loop that are live at the exit points. Easy enough, and that's what we have in Guile now. Peeling causes some code growth but the loops are smaller so it should still be a win. Check out the source, if that's your thing.

loop-invariant code motion

Usually when people are interested in moving code out of loops they talk about loop-invariant code motion, or LICM. Contrary to what you might think, LICM is complementary to peeling: some things that peeling+CSE can hoist are not hoistable by LICM, and vice versa.

Unlike peeling, LICM does not cause code growth. Instead, for each expression in a loop, LICM tries to hoist it out of the loop if it can. An expression can be hoisted if all of these conditions are true:

  1. It doesn't cause the creation of an observably new object. In Scheme, the definition of "observable" is quite subtle, so in practice in Guile we don't hoist expressions that can cause any allocation. We could use alias analysis to improve this.

  2. The expression cannot throw an exception, or the expression is always evaluated for every loop iteration.

  3. The expression makes no writes to memory, or if it writes to memory, other expressions in the loop cannot possibly read from that memory. We use effects analysis for this.

  4. The expression makes no reads from memory, or if it reads from memory, no other expression in the loop can clobber those reads. Again, effects analysis.

  5. The expression uses only loop-invariant variables.

This definition is inductive, so once an expression is hoisted, the values it defines are then considered loop-invariant, so you might be able to hoist a whole chain of values.

Compared to loop peeling, this has the gnarly aspect of having to explicitly reason about loop invariance and manually move code, which is a pain. (Really LICM would be better named "artisanal code motion".) However it causes no code growth, which is a plus, though like peeling it can increase register pressure. But the big difference is that LICM can hoist effect-free expressions that aren't always executed. Consider:

while foo:
  x = qux() ? "hi" : "ho"

Here for some reason it could be faster to cache "hi" or "ho" in registers, which is what LICM allows:

hi, ho = "hi", "ho"
while foo:
  x = qux() ? hi : ho

On the other hand, LICM alone can't hoist the if baz is null checks in this example from above:

while foo:

The issue is that the call to bar() might not return, so the error that might be thrown if baz is null shouldn't be observed until bar is called. In general we can't hoist anything that might throw an exception past some non-hoisted code that might throw an exception. This specific situation happens in Guile but there are similar ones in any language, I think.

More formally, LICM will hoist effectful but loop-invariant expressions that postdominate the loop header, whereas peeling hoists those expressions that dominate all back-edges. I think? We'll go with that. Again, the source.

loop inversion

Loop inversion is a little hack to improve code generation, and again it's a little counterintuitive. If you have this loop:

while n < x:

Loop inversion turns it into:

if n < x:
  while n < x

The goal is that instead of generating code that looks like this:

  test n, x;
  branch-if-greater-than-or-equal done;
  x = x + 1
  goto header

You make something that looks like this:

  test n, x;
  branch-if-greater-than-or-equal done;
  x = x + 1
  test n, x;
  branch-if-less-than header;

The upshot is that the loop body now contains one branch instead of two. It's mostly helpful for tight loops.

It turns out that you can express this transformation on CPS (or SSA, or whatever), but that like loop peeling the extra branch introduces an extra join point in your program. If your loop exits to more than one label, then we have the same problems as loop peeling. For this reason Guile restricts loop inversion (which it calls "loop rotation" at the moment; I should probably fix that) to loops with only one exit continuation.

Loop inversion has some other caveats, but probably the biggest one is that in Guile it doesn't actually guarantee that each back-edge is a conditional branch. The reason is that usually a loop has some associated loop variables, and it could be that you need to reshuffle those variables when you jump back to the top. Mostly Guile's compiler manages to avoid shuffling, allowing inversion to have the right effect, but it's not guaranteed. Fixing this is not straightforward, since the shuffling of values is associated with the predecessor of the loop header and not the loop header itself. If instead we reshuffled before the header, that might work, but each back-edge might have a different shuffling to make... anyway. In practice inversion seems to work out fine; I haven't yet seen a case where it doesn't work. Source code here.

loop identification

One final note: what is a loop anyway? Turns out this is a somewhat hard problem, especially once you start trying to identify nested loops. Guile currently does the simple thing and just computes strongly-connected components in a function's flow-graph, and says that a loop is a non-trivial SCC with a single predecessor. That won't tease apart loop nests but oh wells! I spent a lot of time last year or maybe two years ago with that "Loop identification via D-J graphs" paper but in the end simple is best, at least for making incremental steps.

Okeysmokes, until next time, loop on!

cps soup

27 July 2015 2:43 PM (guile | cps | ssa | compilers | contification | cse | closure optimization | clojure | bagwell | persistent data structures)

Hello internets! This blog goes out to my long-time readers who have followed my saga hacking on Guile's compiler. For the rest of you, a little history, then the new thing.

In the olden days, Guile had no compiler, just an interpreter written in C. Around 8 years ago now, we ported Guile to compile to bytecode. That bytecode is what is currently deployed as Guile 2.0. For many reasons we wanted to upgrade our compiler and virtual machine for Guile 2.2, and the result of that was a new continuation-passing-style compiler for Guile. Check that link for all the backstory.

So, I was going to finish documenting this intermediate language about 5 months ago, in preparation for making the first Guile 2.2 prereleases. But something about it made me really unhappy. You can catch some foreshadowing of this in my article from last August on common subexpression elimination; I'll just quote a paragraph here:

In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

This is exactly the same concern that Matthew Fluet and Stephen Weeks had back in 2003:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y (i.e. without putting y in the scope of x)

[MLton-devel] CPS vs SSA

To be honest I think all this talk about dominators is distracting. Dominators are but a lightweight flow analysis, and I usually find myself using full-on flow analysis to compute the set of optimizations that I can do on a piece of code. In fact the only use I had for dominators in the nested CPS language was to rewrite scope trees! The salient part of Weeks' observation is that nested scope trees are the problem, not that dominators are the solution.

So, after literally years of hemming and hawing about this, I finally decided to remove nested scope trees from Guile's CPS intermediate language. Instead, a function is now a collection of labelled continuations, with one distinguished entry continuation. There is no more $letk term to nest continuations in each other. A program is now represented as a "soup" -- basically a map from labels to continuation bodies, again with a distinguished entry. As an example, consider this expression:

  return add(x, 1)

If we rewrote it in continuation-passing style, we'd give the function a name for its "tail continuation", ktail, and annotate each expression with its continuation:

function(ktail, x):
  add(x, 1) -> ktail

Here the -> ktail means that the add expression passes its values to the continuation ktail.

With nested CPS, it could look like:

function(ktail, x):
  letk have_one(one): add(x, one) -> ktail
    load_constant(1) -> have_one

Here the label have_one is in a scope, as is the value one. With "CPS soup", though, it looks more like this:

function(ktail, x):
  label have_one(one): add(x, one) -> ktail
  label main(x): load_constant(1) -> have_one

It's a subtle change, but it took a few months to make so it's worth pointing out what's going on. The difference is that there is no scope tree for labels or variables any more. A variable can be used at a label if it flows to the label, in a flow analysis sense. Indeed, determining the set of variables that can be used at a label requires flow analysis; that's what Weeks was getting at in his 2003 mail about the advantages of SSA, which are really the advantages of an intermediate language without nested scope trees.

The question arises, though, now that we've decided on CPS soup, how should we represent a program as a value? We've gone from a nested term to a graph term, and we need to find a way to represent it somehow that facilitates looking up labels by name, and facilitates tree rewrites.

In Guile's IR, labels and variables are both integers, so happily enough, we have such a data structure: Clojure-style maps specialized for integer keys.

Friends, if there has been one realization or revolution for me in the last year, it has been Clojure-style data structures. Here's why. In compilers, I often have to build up some kind of analysis, then use that analysis to transform data. Often I need to keep the old term around while I build a new one, but it would be nice to share state between old and new terms. With a nested tree, if a leaf changed you'd have to rebuild all surrounding terms, which is gnarly. But with Clojure-style data structures, more and more I find myself computing in terms of values: build up this value, transform this map to that set, fold over this map -- and yes, you can fold over Guile's intmaps -- and so on. By providing an expressive data structure for which I can control performance characteristics by using transients if needed, these data structures make my programs more about data and less about gnarly machinery.

As a concrete example, the old contification pass in Guile, I didn't have the mental capacity to understand all the moving parts in such a way that I could compute an optimal contification from the beginning; instead I had to iterate to a fixed point, as Kennedy did in his "Compiling with Continuations, Continued" paper. With the new CPS soup language and with Clojure-style data structures, I could actually fit more of the algorithm into my head, with the result that Guile now contifies optimally while avoiding the fixed-point transformation. Also, the old pass used hash tables to represent the analysis, which I found incredibly confusing to reason about -- I totally buy Rich Hickey's argument that place-oriented programming is the source of many evils in programs, and hash tables are nothing if not a place party. Using functional maps let me solve harder problems because they are easier for me to reason about.

Contification isn't an isolated case, either. For example, we are able to do the complete set of optimizations from the "Optimizing closures in O(0) time" paper, including closure sharing, which I think makes Guile unique besides Chez Scheme. I wasn't capable of doing it on the old representation because it was just too hard for me to think about, because my data structures weren't right.

This new "CPS soup" language is still a first-order CPS language in that each term specifies its continuation, and that variable names appear in the continuation of a definition, not the definition itself. This effectively makes every variable a phi variable, in the sense of SSA, and you have to do some work to get to a variable's definition. It could be that still this isn't the right number of names; consider this function:

function foo(k, x):
  label have_y(y) bar(y) -> k
  label y_is_two() load_constant(2) -> have_y
  label y_is_one() load_constant(1) -> have_y
  label main(x) if x -> y_is_one else -> y_is_two

Here there is no distinguished name for the value load_constant(1) versus load_constant(2): both are possible values for y. If we ended up giving them names, we'd have to reintroduce actual phi variables for the joins, which would basically complete the transformation to SSA. Until now though I haven't wanted those names, so perhaps I can put this off. On the other hand, every term has a label, which simplifies many things compared to having to contain terms in basic blocks, as is usually done in SSA. Yet another chapter in CPS is SSA is CPS is SSA, it seems.

Welp, that's all the nerdery for right now. Talk at yall later!

Pfmatch, a packet filtering language embedded in Lua

3 July 2015 11:05 AM (lua | pflua | bpf | pflang | igalia | snabb | compilers | dsl | edsl)

Greets, hackers! I just finished implementing a little embedded language in Lua and wanted to share it with you. First, a bit about the language, then some notes on how it works with Lua to reach the high performance targets of Snabb Switch.

the pfmatch language

Pfmatch is a language designed for filtering, classifying, and dispatching network packets in Lua. Pfmatch is built on the well-known pflang packet filtering language, using the fast pflua compiler for LuaJIT.

Here's an example of a simple pfmatch program that just divides up packets depending on whether they are TCP, UDP, or something else:

match {
   tcp => handle_tcp
   udp => handle_udp
   otherwise => handle_other

Unlike pflang filters written for such tools as tcpdump, a pfmatch program can dispatch packets to multiple handlers, potentially destructuring them along the way. In contrast, a pflang filter can only say "yes" or "no" on a packet.

Here's a more complicated example that passes all non-IP traffic, drops all IP traffic that is not going to or coming from certain IP addresses, and calls a handler on the rest of the traffic.

match {
   not ip => forward
   ip src => incoming_ip
   ip dst => outgoing_ip
   otherwise => drop

In the example above, the handlers after the arrows (forward, incoming_ip, outgoing_ip, and drop) are Lua functions. The part before the arrow (not ip and so on) is a pflang expression. If the pflang expression matches, its handler will be called with two arguments: the packet data and the length. For example, if the not ip pflang expression is true on the packet, the forward handler will be called.

It's also possible for the handler of an expression to be a sub-match:

match {
   not ip => forward
   ip src => {
      tcp => incoming_tcp(&ip[0], &tcp[0])
      udp => incoming_udp(&ip[0], &ucp[0])
      otherwise => incoming_ip(&ip[0])
   ip dst => {
      tcp => outgoing_tcp(&ip[0], &tcp[0])
      udp => outgoing_udp(&ip[0], &ucp[0])
      otherwise => outgoing_ip(&ip[0])
   otherwise => drop

As you can see, the handlers can also have additional arguments, beyond the implicit packet data and length. In the above example, if not ip doesn't match, then ip src matches, then tcp matches, then the incoming_tcp function will be called with four arguments: the packet data as a uint8_t* pointer, its length in bytes, the offset of byte 0 of the IP header, and the offset of byte 0 of the TCP header. An argument to a handler can be any arithmetic expression of pflang; in this case &ip[0] is actually an extension. More on that later. For language lawyers, check the syntax and semantics over in our source repo.

Thanks especially to my colleague Katerina Barone-Adesi for long backs and forths about the language design; they really made it better. Fistbump!

pfmatch and lua

The challenge of designing pfmatch is to gain expressiveness, compared to writing filters by hand, while not endangering the performance targets of Pflua and Snabb Switch. These days Snabb is on target to give ASIC-driven network appliances a run for their money, so anything we come up with cannot sacrifice speed.

In practice what this means is compile, don't interpret. Using the pflua compiler allows us to generalize the good performance that we have gotten on pflang expressions to a multiple-dispatch scenario. It's a pretty straightword strategy. Naturally though, the interface with Lua is more complex now, so to understand the performance we should understand the interaction with Lua.

How does one make two languages interoperate, anyway? With pflang it's pretty clear: you compile pflang to a Lua function, and call the Lua function to match on packets. It returns true or false. It's a thin interface. Indeed with pflang and pflua you could just match the clauses in order:

not_ip = pf.compile('not ip')
incoming = pf.compile('ip src')
outgoing = pf.compile('ip dst')

function handle(packet, len)
   if not_ip(packet, len) then return forward(packet, len)
   elseif incoming(packet, len) then return incoming_ip(packet, len)
   elseif outgoing(packet, len) then return outgoing_ip(packet, len)
   else return drop(packet, len) end

But not only is this tedious, you don't get easy access to the packet itself, and you're missing out on opportunities for optimization. For example, if the packet fails the not_ip check, we don't need to check if it's an IP packet in the incoming check. Compiling a pfmatch program takes advantage of pflua's optimizer to produce good code for the match expression as a whole.

If this were Scheme I would make the right-hand side of an arrow be an expression and implement pfmatch as a macro; see Racket's match documentation for an example. In Lua or other languages that's harder to do; you would have to parse Lua, and it's not clear which parts of the production as a whole are the host language (Lua) and which are the embedded language (pfmatch).

Instead, I think embedding host language snippets by function name is a fine solution. It seems fairly clear that incoming_ip, for example, is some kind of function. It's easy to parse identifiers in an embedded language, both for humans and for programs, so that takes away a lot of implementation headache and cognitive overhead.

We are left with a few problems: how to map names to functions, what to do about the return value of match expressions, and how to tie it all together in the host language. Again, if this were Scheme then I'd use macros to embed expressions into the pfmatch term, and their names would be scoped into whatever environment the match term was defined. In Lua, the best way to implement a name/value mapping is with a table. So we have:

local handlers = {
   forward = function(data, len)
   drop = function(data, len)
   incoming_ip = function(data, len)
   outgoing_ip = function(data, len)

Then we will pass the handlers table to the matcher function, and the matcher function will call the handlers by name. LuaJIT will mostly take care of the overhead of the table dispatch. We compile the filter like this:

local match = require('pf.match')

local dispatcher = match.compile([[match {
   not ip => forward
   ip src => incoming_ip
   ip dst => outgoing_ip
   otherwise => drop

To use it, you just invoke the dispatcher with the handlers, data, and length, and the return value is whatever the handler returns. Here let's assume it's a boolean.

function loop(self)
   local i, o = self.input.input, self.output.output
   while not link.empty() do
      local pkt = link.receive(i)
      if dispatcher(handlers,, pkt.length) then
         link.transmit(o, pkt)

Finally, we're ready for an example of a compiled matcher function. Here's what pflua does with the match expression above:

local cast = require("ffi").cast
return function(self,P,length)
   if length < 14 then return self.forward(P, len) end
   if cast("uint16_t*", P+12)[0] ~= 8 then return self.forward(P, len) end
   if length < 34 then return self.drop(P, len) end
   if P[23] ~= 6 then return self.drop(P, len) end
   if cast("uint32_t*", P+26)[0] == 67305985 then return self.incoming_ip(P, len) end
   if cast("uint32_t*", P+30)[0] == 134678021 then return self.outgoing_ip(P, len) end
   return self.drop(P, len)

The result is a pretty good dispatcher. There are always things to improve, but it's likely that the function above is better than what you would write by hand, and it will continue to get better as pflua improves.

Getting back to what I mentioned earlier, when we write filtering code by hand, we inevitably end up writing interpreters for some kind of filtering language. Network functions are essentially linguistic in nature: static appliances are no good because network topologies change, and people want solutions that reflect their problems. Usually this means embedding an interpreter for some embedded language, for example BPF bytecode or iptables rules. Using pflua and pfmatch expressions, we can instead compile a filter suited directly for the problem at hand -- and while we're at it, we can forget about worrying about pesky offsets, constants, and bit-shifts.


I'm optimistic about pfmatch or something like it being a success, but there are some challenges too.

One challenge is that pflang is pretty weird. For example, attempting to access ip[100] will abort a filter immediately on a packet that is less than 100 bytes long, not including L2 encapsulation. It's wonky semantics, and in the context of pfmatch, aborting the entire pfmatch program would obviously be the wrong thing. That would abort too much. Instead it should probably just fail the pflang test in which that packet access appears. To this end, in pfmatch we turn those aborts into local expression match failures. However, this leads to an inconsistency with pflang. For example in (ip[100000] == 0 or (1==1)), instead of ip[100000] causing the whole pflang match to fail, it just causes the local test to fail. This leaves us with 1==1, which passes. We abort too little.

This inconsistency is probably a bug. We want people to be able to test clauses with vanilla pflang expressions, and have the result match the pfmatch behavior. Due to limitations in some of pflua's intermediate languages, it's likely to persist for a while. It is the only inconsistency that I know of, though.

Pflang is also underpowered in many ways. It has terrible IPv6 support; for example, tcp[0] only matches IPv4 packets, and at least as implemented in libpcap, most payload access on IPv6 packets does the wrong thing regarding chained extension headers. There is no facility in the language for binding names to intermediate results, there is no linguistic facility for talking about fragmentation, no ability to address IP source and destination addresses in arithmetic expressions by name, and so on. We can solve these in pflua with extensions to the language, but that introduces incompatibilities with pflang.

You might wonder why to stick with pflang, after all of this. If this is you, Juho Snellman wrote a great article on this topic, just for you: What's wrong with pcap filters.

Pflua's optimizer has mostly helped us, but there have been places where it could be more helpful. When compiling just one expression, you can often end up figuring out which branches are dead-ends, which helps the rest of the optimization to proceed. With more than one successful branch, we had to make a few improvements to the optimizer to actually get decent results. We also had to relax one restriction on the optimizer: usually we only permit transformations that make the code smaller. This way we know we're going in the right direction and will eventually terminate. However because of reasons™ we did decide to allow tail calls to be duplicated, so instead of having just one place in the match function that tail-calls a handler, you can end up with multiple calls. I suspect using a tracing compiler will largely make this moot, as control-flow splits effectively lead to trace duplication anyway, and making sure control-flow joins later doesn't effectively counter that. Still, I suspect that the resulting trace shape will rejoin only at the loop head, instead of in some intermediate point, which is probably OK.


With all of these concerns, is pfmatch still a win? Yes, probably! We're going to start using it when building Snabb apps, and will see how it goes. We'll probably end up adding a few more pflang extensions before we're done. If it's something you're in to, snabb-devel is the place to try it out, and see you on the bug tracker. Happy packet hacking!

arrow functions coming to chrome 45!

18 June 2015 4:41 PM (v8 | chrome | javascript | es6 | chromium | igalia | bloomberg)

It's been a long time coming, but I just flipped the bit in V8 that will ship arrow functions in Chrome 45! Woo hoo!

You probably know, but arrow functions are a new way to write functions in JavaScript. They look like this:

// Two arguments, body implicitly returned.
(x, y) => x + y

// With just one argument, no parentheses needed.
x => x * 2

// Body can have braces too; in that case use "return".
x => { return x * 2 }

Relative to the other kind of function that is written like function (x) { return x * 2 }, arrow functions don't define this or arguments in their bodies, instead capturing these values from the environment. There are a couple of other minor differences, too, but instead of writing about them here I'll just point to the great article by Jason Orendorff of the SpiderMonkey team.

Arrow functions are part of the JavaScript language standard that was called "ECMAScript 6" or ES6, and I guess you could still call it that. It seems like a silly thing for the committee to do to throw away all their branding like that but they decided to rename it ECMAScript 2015, which I'm sure is a link that the pedants are glad I have included. The upshot is that the standard is now final, gold master, etched in stone, which from an implementor's perspective is a relief. You can practically feel the anxiety ebbing away by the happy rate at which commits bubble out of source repositories and into shipping browsers, free from the fear that some spec change will force the hack-stream to change course.

From the V8 side, our arrow function implementation has also been a long time coming. My colleague Adrián Pérez did the first half of the work, and I picked up on the back end of things. It seems like such a small feature and in many ways it is, but still it took a long time. Now I know that my readers are a bunch of nerds and many of you like implementing languages, so you might appreciate these nargish points.

One of the first bits is that arrow functions are hard to parse. Consider, this is a valid JavaScript expression:


It's a "comma expression" that will evaluate x then y and its result will be the result of evaluating y. But add an arrow on after the end and you get not an expression but a formal parameter list:


Now you might think, well OK, when you see an arrow, rewind the input stream and parse in "arrow function mode". Indeed that would be fine, but not in combination with some additional ES6 features, optional and destructuring arguments. Optional arguments look like this:


The =42 part is the expression that will be evaluated to give x a value, if the function is called with no arguments. Note that this bit is still under implementation in V8 so you can't try it in your browser. An optional argument initializer is an expression and not a value, so you can also have:


Combined, this makes rewinding the token stream a proposition of exponential complexity, which is a no-go for a production JavaScript parser. Parsers are on the hot path for page-load times and no browser vendor wants to introduce a pathological case into their page load.

Instead, V8 does something I hadn't seen before. It keeps an open mind about whether something is a comma expression or a formal parameter list of an arrow function, and only makes a decision when it sees the => (or not). As it parses, V8 records places that it would signal an error for either a parameter list or for an expression, and then when that superimposed wave function collapses it checks that the production is valid, signalling the appropriate error if not. I thought this was a really neat trick, so if you're into that thing see expression classifier to see those details.

The other thing that's tricky about arrow functions is the this binding. In JavaScript, this is basically a hidden parameter passed to a function when it is called. Calling a function like o.f() passes the value of o to f as its this parameter. If instead f() is called directly, like with no dot before the call, then undefined is passed as this. Also for sloppy-mode functions, if the passed this value isn't an object, then the global object instead is assigned to this. Finally outside a function, this is bound to the global object.

OK, I know all of you know these things. Thing is, you always have a this, and although it's like a variable it's not a valid variable name, and before ES6 nothing could capture its value, because each function has its own this value. Perhaps you see where I'm going with this (ahem) now. Arrow functions introduce a function scope that doesn't have a this value, and that indeed might capture some other scope's this value, forcing it to be context-allocated. Other parts of ES6 can actually force assignment to this, like a super call, and that assignment can actually come from within an arrow function. Zounds! A simple concept, but there was a lot of incidental complexity in V8 around the implementation. Between Adrián and myself it took like three months to fix this usage in V8 to always just go through the (possibly context-allocated) variable, and there are still probably some devtools bugs to find in the upcoming weeks.

Performance-wise, arrow functions are just like functions. They should be just as fast as if you wrote them with function. So use them with joy, use them with abandon, use them judiciously -- however you decide you use them, don't let perf influence your decision one way or the other.

That's about it! Like all of my JS engine work over the past couple years, this hacking was sponsored by fabulous folks over at Bloomberg, so big ups to them. From me and Adrián at Igalia, until next time! We leave you to puzzle out what this bit of JavaScript evaluates to:


Happy hacking!

state of js implementations, 2014 edition

9 December 2014 10:29 AM (javascript | v8 | jsc | spidermonkey | igalia | webengineshackfest)

I gave a short talk about the state of JavaScript implementations this year at the Web Engines Hackfest.

29 minutes, vorbis or mp3; slides (PDF)

The talk goes over a bit of the history of JS implementations, with a focus on performance and architecture. It then moves on to talk about what happened in 2014 and some ideas about where 2015 might be going. Have a look if that's a thing you are in to. Thanks to Adobe, Collabora, and Igalia for sponsoring the event.

there are no good constant-time data structures

2 December 2014 10:01 PM (programming | algorithms | security | timing attacks | guile)

Imagine you have a web site that people can access via a password. No user name, just a password. There are a number of valid passwords for your service. Determining whether a password is in that set is security-sensitive: if a user has a valid password then they get access to some secret information; otherwise the site emits a 404. How do you determine whether a password is valid?

The go-to solution for this kind of problem for most programmers is a hash table. A hash table is a set of key-value associations, and its nice property is that looking up a value for a key is quick, because it doesn't have to check against each mapping in the set.

Hash tables are commonly implemented as an array of buckets, where each bucket holds a chain. If the bucket array is 32 elements long, for example, then keys whose hash is H are looked for in bucket H mod 32. The chain contains the key-value pairs in a linked list. Looking up a key traverses the list to find the first pair whose key equals the given key; if no pair matches, then the lookup fails.

Unfortunately, storing passwords in a normal hash table is not a great idea. The problem isn't so much in the hash function (the hash in H = hash(K)) as in the equality function; usually the equality function doesn't run in constant time. Attackers can detect differences in response times according to when the "not-equal" decision is made, and use that to break your passwords.

Edit: Some people are getting confused by my use of the term "password". Really I meant something more like "secret token", for example a session identifier in a cookie. I thought using the word "password" would be a useful simplification but it also adds historical baggage of password quality, key derivation functions, value of passwords as an attack target for reuse on other sites, etc. Mea culpa.

So let's say you ensure that your hash table uses a constant-time string comparator, to protect against the hackers. You're safe! Or not! Because not all chains have the same length, "interested parties" can use lookup timings to distinguish chain lookups that take 2 comparisons compared to 1, for example. In general they will be able to determine the percentage of buckets for each chain length, and given the granularity will probably be able to determine the number of buckets as well (if that's not a secret).

Well, as we all know, small timing differences still leak sensitive information and can lead to complete compromise. So we look for a data structure that takes the same number of algorithmic steps to look up a value. For example, bisection over a sorted array of size SIZE will take ceil(log2(SIZE)) steps to get find the value, independent of what the key is and also independent of what is in the set. At each step, we compare the key and a "mid-point" value to see which is bigger, and recurse on one of the halves.

One problem is, I don't know of a nice constant-time comparison algorithm for (say) 160-bit values. (The "passwords" I am thinking of are randomly generated by the server, and can be as long as I want them to be.) I would appreciate any pointers to such a constant-time less-than algorithm. However a bigger problem is that the time it takes to access memory is not constant; accessing element 0 of the sorted array might take more or less time than accessing element 10. In algorithms we typically model access on a more abstract level, but in hardware there's a complicated parallel and concurrent protocol of low-level memory that takes a non-deterministic time for any given access. "Hot" (more recently accessed) memory is faster to read than "cold" memory.

Non-deterministic memory access leaks timing information, and in the case of binary search the result is disaster: the attacker can literally bisect the actual values of all of the passwords in your set, by observing timing differences. The worst!

You could get around this by ordering "passwords" not by their actual values but by their cryptographic hashes (e.g. by their SHA256 values). This would force the attacker to bisect not over the space of password values but of the space of hash values, which would protect actual password values from the attacker. You still leak some timing information about which paths are "hot" and which are "cold", but you don't expose actual passwords.

It turns out that, as far as I am aware, it is impossible to design a key-value map on common hardware that runs in constant time and is sublinear in the number of entries in the map. As Zooko put it, running in constant time means that the best case and the worst case run in the same amount of time. Of course this is false for bucket-and-chain hash tables, but it's false for binary search as well, as "hot" memory access is faster than "cold" access. The only plausible constant-time operation on a data structure would visit each element of the set in the same order each time. All constant-time operations on data structures are linear in the size of the data structure. Thems the breaks! All you can do is account for the leak in your models, as we did above when ordering values by their hash and not their normal sort order.

Once you have resigned yourself to leaking some bits of the password via timing, you would be fine using normal hash tables as well -- just use a cryptographic hashing function and a constant-time equality function and you're good. No constant-time less-than operator need be invented. You leak something on the order of log2(COUNT) bits via timing, where COUNT is the number of passwords, but since that's behind a hash you can't use it to bisect on actual key values. Of course, you have to ensure that the hash table isn't storing values in sorted order and short-cutting early. This sort of detail isn't usually part of the contract of stock hash table implementations, so you probably still need to build your own.

Edit: People keep mentioning Cuckoo hashing for some reason, despite the fact that it's not a good open-hashing technique in general (Robin Hood hashes with linear probing are better). Thing is, any operation on a data structure that does not touch all of the memory in the data structure in exactly the same order regardless of input leaks cache timing information. That's the whole point of this article!

An alternative is to encode your data structure differently, for example for the "key" to itself contain the value, signed by some private key only known to the server. But this approach is limited by network capacity and the appropriateness of copying for the data in question. It's not appropriate for photos, for example, as they are just too big.

Edit: Forcing constant-time on the data structure via sleep() or similar calls is not a good mitigation. This either massively slows down your throughput, or leaks information via side channels. Remote attackers can measure throughput instead of latency to determine how long an operation takes.

Corrections appreciated from my knowledgeable readers :) I was quite disappointed when I realized that there were no good constant-time data structures and would be happy to be proven wrong. Thanks to Darius Bacon, Zooko Wilcox-O'Hearn, Jan Lehnardt, and Paul Khuong on Twitter for their insights; all mistakes are mine.

scheme workshop 2014

27 November 2014 5:48 PM (scheme | guile | igalia | gnu | adaptive optimization | javascript)

I just got back from the US, and after sleeping for 14 hours straight I'm in a position to type about stuff again. So welcome back to the solipsism, France and internet! It is good to see you on a properly-sized monitor again.

I had the enormously pleasurable and flattering experience of being invited to keynote this year's Scheme Workshop last week in DC. Thanks to John Clements, Jason Hemann, and the rest of the committee for making it a lovely experience.

My talk was on what Scheme can learn from JavaScript, informed by my work in JS implementations over the past few years; you can download the slides as a PDF. I managed to record audio, so here goes nothing:

55 minutes, vorbis or mp3

It helps to follow along with the slides. Some day I'll augment my slide-rendering stuff to synchronize a sequence of SVGs with audio, but not today :)

The invitation to speak meant a lot to me, for complicated reasons. See, Scheme was born out of academic research labs, and to a large extent that's been its spiritual core for the last 40 years. My way to the temple was as a money-changer, though. While working as a teacher in northern Namibia in the early 2000s, fleeing my degree in nuclear engineering, trying to figure out some future life for myself, for some reason I was recording all of my expenses in Gnucash. Like, all of them, petty cash and all. 50 cents for a fat-cake, that kind of thing.

I got to thinking "you know, I bet I don't spend any money on Tuesdays." See, there was nothing really to spend money on in the village besides fat cakes and boiled eggs, and I didn't go into town to buy things except on weekends or later in the week. So I thought that it would be neat to represent that as a chart. Gnucash didn't have such a chart but I knew that they were implemented in Guile, as part of this wave of Scheme consciousness that swept the GNU project in the nineties, and that I should in theory be able to write it myself.

Problem was, I also didn't have internet in the village, at least then, and I didn't know Scheme and I didn't really know Gnucash. I think what I ended up doing was just monkey-typing out something that looked like the rest of the code, getting terrible errors but hey, it eventually worked. I submitted the code, many years ago now, some of the worst code you'll read today, but they did end up incorporating it into Gnucash and to my knowledge that report is still there.

I got more into programming, but still through the back door, so to speak. I had done some free software work before going to Namibia, on GStreamer, and wanted to build a programmable modular synthesizer with it. I read about Supercollider, and decided I wanted to do something like that but with the "unit generators" defined in GStreamer and orchestrated with Scheme. If I knew then that Scheme could be fast, I probably would have started on an entirely different course of things, but that did at least result in gainful employment doing unrelated GStreamer things, if not a synthesizer.

Scheme became my dominant language for writing programs. It was fun, and the need to re-implement a bunch of things wasn't a barrier at all -- rather a fun challenge. After a while, though, speed was becoming a problem. It became apparent that the only way to speed up Guile would be to replace its AST interpreter with a compiler. Thing is, I didn't know how to write one! Fortunately there was previous work by Keisuke Nishida, jetsam from the nineties wave of Scheme consciousness. I read and read that code, mechanically monkey-typed it into compilation, and slowly reworked it into Guile itself. In the end, around 2009, Guile was faster and I found myself its co-maintainer to boot.

Scheme has been a back door for me for work, too. I randomly met Kwindla Hultman-Kramer in Namibia, and we found Scheme to be a common interest. Some four or five years later I ended up working for him with the great folks at Oblong. As my interest in compilers grew, and it grew as I learned more about Scheme, I wanted something closer there, and that's what I've been doing in Igalia for the last few years. My first contact there was a former Common Lisp person, and since then many contacts I've had in the JS implementation world have been former Schemers.

So it was a delight when the invitation came to speak (keynote, no less!) the Scheme Workshop, behind the altar instead of in the foyer.

I think it's clear by now that Scheme as a language and a community isn't moving as fast now as it was in 2000 or even 2005. That's good because it reflects a certain maturity, and makes the lore of the tribe easier to digest, but bad in that people tend to ossify and focus on past achievements rather than future possibility. Ehud Lamm quoted Nietzche earlier today on Twitter:

By searching out origins, one becomes a crab. The historian looks backward; eventually he also believes backward.

So it is with Scheme and Schemers, to an extent. I hope my talk at the conference inspires some young Schemer to make an adaptively optimized Scheme, or to solve the self-hosted adaptive optimization problem. Anyway, as users I think we should end the era of contorting our code to please compilers. Of course some discretion in this area is always necessary but there's little excuse for actively bad code.

Happy hacking with Scheme, and au revoir!

on yakshave, on color, on cosines, on glitchen

14 November 2014 4:49 PM (jpeg | scheme | guile | colorspaces | yaks | photos | dct | compression)

Hold on to your butts, kids, because this is epic.

on yaks

As in all great epics, our prideful, stubborn hero starts in a perfectly acceptable state of things, decides on a lark to make a small excursion, and comes back much much later to inflict upon you pictures from his journey.

So. I have a web photo gallery but I don't take many pictures these days. Dealing with photos is a bit of a drag, and the ways that are easier like Instagram or what-not give me the (peer, corporate, government: choose 3) surveillance hives. So, I had vague thoughts that I should update my web gallery. Yakpoint 1.

At the same time, my web gallery was written for mod_python on the server, and I don't like hacking in Python any more and kinda wanted to switch away from Apache. Yakpoint 2.

So I rewrote the server-side part in Scheme. (Yakpoint 3.) It worked fine but I found I needed the ability to get the dimensions of files on the server, so I wrote a quick-and-dirty JPEG parser. Yakpoint 4.

I needed EXIF data as well, as the original version displayed EXIF data, and for that I used a binding to libexif that I had written a few years ago when I thought about starting this project (Yakpoint -1). However I found some crashers in the library, because it had never really been tested in production, and instead of fixing them I said "what the hell, I'll just write an EXIF parser". (Yakpoint 5.) So I did and adapted the web gallery to use it (Yakpoint 6, for the adaptation.)

At this point, I looked back, and looked forward, and looked all around, and all was good, but what was with this uneasiness I was feeling? And indeed, I hadn't actually made anything better, and I wasn't taking more photos, and the workflow was the same.

I was also concerned about the client side of things, which was still in Python and using some breakage-prone legacy libraries to do the photo scaling and transformations and what-not, and relied on a desktop application (f-spot) of dubious future. So I started to look at what it would take to port that script to Scheme (Yakpoint 7). Well it used some legacy libraries to copy files over SSH (gnome-vfs; switching away from that would be Yakpoint 8) and I didn't want to make a Scheme GIO binding (Yakpoint 9, narrowly avoided), and I then -- and then, dear reader -- so then I said "well WTF my caching story on the server is crap anyway, I never know when the sqlite database has changed or not so I never know what responses I can cache, what I really want is a functional datastore" (Yakpoint 10), which is what I have with Git and Tekuti (Yakpoint of yore), and so why not just store my photos in Git like I do in Tekuti for blog posts and serve them from there, indexing as needed? Of course I'd need some other server software (Yakpoint of fore, by which I meantersay the future), but then I could just git push to update my photo gallery, and I wouldn't have to endure the horror that is GVFS shelling out to ssh in a FUSE daemon (Yakpoint of ne'er).

So. After mulling over these thoughts for a while I decided, during an autumnal walk on the Salève in which we had the greatest views of Mont Blanc everrrrr and yet where are the photos?, that really what I needed was new photo management software, not just a web gallery. I should be able to share photos from my phone or from my desktop, fix them up either place, tag and such, and OK woo hoo! Such is the future! And the present for many people? Thing is, I also needed good permissions management (Yakpoint what, 10 I guess?), because you know a dude just out of college is not the same as that dude many years later. Which means serving things over HTTPS (Yakpoints 11-47) in such a way that the app has some good control over who gets what.

Well. Anyway. My mind ran ahead, and runs ahead, and yet we haven't actually tasted the awesome sauce yet. So! The photo management software, whereever it lives, needs to rotate photos at least, and scale them down to a few resolutions. I smell a yak! I looked at jpegtran which can do some lossless rotations but it's not available as a library, which is odd; and really I don't like shelling out for core program functionality, because every time I deal with the file system it's the wild west of concurrent mutation. If naming things is one of the two hardest problems in computer science, the file system is the worst because you have to give a global name to every intermediate value.

At the same time to scale images, what was I to do? Make a binding to libjpeg? Well I started (Yakpoint 48) but for reals kids, libjpeg is not fun. It works great and is really clever but

  1. it's approximately impossible to use from a dynamic ffi; you want a compiler to verify that you are using the right structure definitions

  2. there has been an inane ABI and format break imposed by the official IJG libjpeg but which other implementations have not followed, but how could you know which one you are using?

  3. the error handling facility encourages longjmp in C programs; somewhat terrifying

  4. off-heap image manipulation libraries always interact poorly with GC, because the GC only sees the small pointer to the off-heap image, and so doesn't GC often enough

  5. I have zero guarantee that libjpeg won't change ABI in weird ways, and I don't want to touch this software for the next 10 years

  6. I want to do jpegtran-like lossless transformations, but that's not available as a library, and it's totes ridics that binding libjpeg does not help you out here

  7. it's still an unsafe C library, battle-tested yes, but terrifyingly unsafe, and I'd be putting it on my server and who knows?

Friends, I arrived at the pasture, and I, I chose the yak less shaven. I took my lame JPEG parser and turned it into a full decoder (Yakpoint 49), realized it wasn't much more work to do an encoder (Yakpoint 50), and implemented the lossless transformations (Yakpoint 51).

on haters

Before we go on, I know some people would think "what is this kid about". I mean, custom gallery software, a custom JPEG library of all things, all bespoke, why don't you just use off-the-shelf solutions? Why aren't you normal and use a normal language and what about the best practices and where's your business case and I can't go on about this because there's a technical term for people that say this kind of thing and it's "hater".

Thing is, when did a hater ever make anything cool? Come to think of it, when did a hater make anything at all? In my experience the most vocal haters have nothing behind their names except a long series of pseudonymous rants in other people's comment boxes. So friends, in the joyful spirit of earning-anew, let's talk about JPEG!

on color

JPEG is a funny thing. Photos are our lives and our memories, our first steps and our friends, and yet I for one didn't know very much about them. My mental model that "a JPEG is a rectangle of pixels" doesn't turn out to be quite right.

If you actually look in a normal JPEG, you see three planes of information. If I take this image, for example:

If I decode it, actually I get three images. Here's the first one:

This is just the greyscale version of the image. So, storytime! Remember black and white television? We had an old one that got moved around the house sometimes, like if Mom was working at something in the kitchen. We also had a color one in the living room, and you could watch one or the other and they showed the same stuff. Strange when you think about it though -- one being in color and the other not. Well it turns out that color was literally just added on, both historically and technically. The main broadcast was still in black and white, and then in one part of the frequency band there were separate color signals, which color TVs would pick up, mix with the black and white signal, and come out with color. Wikipedia notes that "color TV" was really just "colored TV", which is a phrase whose cleverness I respect. Big ups to the W P.

In the context of JPEG, this black-and-white signal is sometimes called "luma", but is more precisely called Y', where the "prime" (the apostrophe) indicates that the signal has gamma correction applied.

In the image above, I replaced the color planes (sometimes collectively called the "chroma") with zeroes, while losslessly keeping the luma. Below is the first color plane, with the Y' plane replaced with a uniform 50% luma, and the other color plane replaced with zeros.

This color signal is technically known as CB, which may be very imperfectly understood as the bluish component of the color. Well the original image wasn't very blue, so we don't see very much here.

Indeed, our eyes have a harder time seeing differences in color than differences in intensity. Apparently this goes all the way down to biology -- we have more receptors in our eyes for "black and white" and fewer for color.

Early broadcasters took advantage of this difference in perception by actually devoting more bandwidth in their broadcasts to luma than to chroma; if you check the Wikipedia page you will see that the area in the spectrum allocation devoted to color is much smaller than the area devoted to intensity. So it is in JPEG: the above image being half-width indicates that actually we're just encoding one CB sample for every two Y' samples.

Finally, here we have the CR color plane, which can loosely be thought of as the "redness" of the image.

These test images and crops preserve the actual encoding of this photo as it came from my camera, without re-encoding. That's partly why there's not much interesting going on; with the megapixels these days, it's hard to fit much of anything in a few hundred pixels square. This particular camera is sub-sampling in the horizontal direction, but it's also common to subsample vertically as well, producing color planes that are half-width and half-height. In my limited investigations I have found that cameras tend to sub-sample just in the X direction, producing what they call 4:2:2 images, and that standard software encoders subsample in both, producing 4:2:0.

Incidentally, properly scaling up the color planes is quite an irritating endeavor -- the standard indicates that the color is sampled between the locations of the Y' samples ("centered" chroma), but these images originally have EXIF data that indicates that the color samples are taken at the position of the first Y' sample ("co-sited" chroma). I'm pretty sure libjpeg doesn't delve into the EXIF to check this though, so it would seem that all renderings I have seen of these photos are subtly off.

But how do you get proper color out of these strange luma and chroma things? Well, the Y'CBCR colorspace is really just the same color cube as RGB, except rotated: the Y' axis traverses the diagonal from (0, 0, 0) (black) to (255, 255, 255) (white). CB and CR are perpendicular to that diagonal, pointing towards blue or red respectively. So to go back to RGB, you multiply by a matrix to rotate the cube.

It's not a very intuitive color system, as you can see from the images above. For one thing, at zero or full luma, the chroma axes have no meaning; black and white can have no hue. Indeed if you imagine trying to fit a cube corner-down into a similar-sized box, you end up either having empty space in the box, or you have to cut off corners from the cube, or both. Cut corners means that bits of the Y'CBCR signal are wasted; empty space means there are RGB colors that are not representable in Y'CBCR. I'm not sure, but I think both are true for the particular formulation of Y'CBCR used in JPEG.

There's more to say about color here but frankly I don't know enough to do so, even though I worked in digital video for many years. If this is something you are mildly interested in, I highly, highly recommend watching Wim Taymans' presentation at this year's GStreamer conference. He takes a look at color in video that is constructive, building up from biology through math to engineering. His is a principled approach rather than a list of rules. It really clarified a number of things for me (and opened doors to unknown unknowns beyond).

on cosines

Where were we? Right, JPEG. So the proper way to understand what JPEG is is to understand the encoding process. We've covered colorspace conversion from RGB to Y'CBCR and sub-sampling. Next, the image canvas is divided into equal-sized "macroblocks". (These are called "minimum coded units" (MCUs) in the JPEG context, but in video they are usually called macroblocks, and it's a better name.) Without sub-sampling, each macro-block will contain one 8-sample-by-8-sample block for each component (Y', CB, CR) of the image. In my images above, the canvas space corresponding to one chroma block is the space of two luma blocks, so the macroblocks will be 16 samples wide and 8 samples tall, and contain two Y' blocks and one each of CB and CR. If the image canvas can't be evenly divided into macroblocks, it is padded to fit, usually by duplicating the last column or row of samples.

Then to make a JPEG, each block is encoded separately, then the whole thing is just written out to a file, and you're done!

This description glosses over a couple of important points, but it's a good big-picture view to have in mind. The pipeline goes from RGB pixels, to a padded RGB canvas, to separate Y'CBCR planes, to a possibly subsampled set of those planes, to macroblocks, to encoded macroblocks, to the file. Decoding is the reverse. It's a totally doable, comprehensible thing, and that was one of the big takeaways for me from this project. I took photography classes in high school and it was really cool to see how to shoot, develop, and print film, and this is similar in many ways. The real "film" is raw-format data, which some cameras produce, but understanding JPEG is like understanding enlargers and prints and fixer baths and such things. It's smelly and dark but pretty cool stuff.

So, how do you encode a block? Well peoples, this is a kinda cool thing. Maybe you remember from some math class that, given n uniformly spaced samples, you can always represent that series as a sum of n cosine functions of equally spaced frequencies. In each litle 8-by-8 block, that's what we do: a "forward discrete cosine transformation" (FDCT), which is just multiplying together some matrices for every point in the block. The FDCT is completely separable in the X and Y directions, so the space of 8 horizontal coefficients multiplies by the space of 8 vertical coefficients at each column to yield 64 total coefficients, which is not coincidentally the number of samples in a block.

Funny thing about those coefficients: each one corresponds to a particular horizontal and vertical frequency. We can map these out as a space of functions; for example giving a non-zero coefficient to (0, 0) in the upper-left block of a 8-block-by-8-block grid, and so on, yielding a 64-by-64 pixel representation of the meanings of the individual coefficients. That's what I did in the test strip above. Here is the luma example, scaled up without smoothing:

The upper-left corner corresponds to a frequency of 0 in both X and Y. The lower-right is a frequency of 4 "hertz", oscillating from highest to lowest value in both directions four times over the 8-by-8 block. I'm actually not sure why there are some greyish pixels around the right and bottom borders; it's not a compression artifact, as I constructed these DCT arrays programmatically. Anyway. Point is, your lover's smile, your sunny days, your raw urban graffiti, your child's first steps, all of these are reified in your photos as a sum of cosine coefficients.

The odd thing is that what is reified into your pictures isn't actually all of the coefficients there are! Firstly, because the coefficients are rounded to integers. Mathematically, the FDCT is a lossless operation, but in the context of JPEG it is not because the resulting coefficients are rounded. And they're not just rounded to the nearest integer; they are probably quantized further, for example to the nearest multiple of 17 or even 50. (These numbers seem exaggerated, but keep in mind that the range of coefficients is about 8 times the range of the original samples.)

The choice of what quantization factors to use is a key part of JPEG, and it's subjective: low quantization results in near-indistinguishable images, but in middle compression levels you want to choose factors that trade off subjective perception with file size. A higher quantization factor leads to coefficients with fewer bits of information that can be encoded into less space, but results in a worse image in general.

JPEG proposes a standard quantization matrix, with one number for each frequency (coefficient). Here it is for luma:

(define *standard-luma-q-table*
  #(16 11 10 16 24 40 51 61
    12 12 14 19 26 58 60 55
    14 13 16 24 40 57 69 56
    14 17 22 29 51 87 80 62
    18 22 37 56 68 109 103 77
    24 35 55 64 81 104 113 92
    49 64 78 87 103 121 120 101
    72 92 95 98 112 100 103 99))

This matrix is used for "quality 50" when you encode an 8-bit-per-sample JPEG. You can see that lower frequencies (the upper-left part) are quantized less harshly, and vice versa for higher frequencies (the bottom right).

(define *standard-chroma-q-table*
  #(17 18 24 47 99 99 99 99
    18 21 26 66 99 99 99 99
    24 26 56 99 99 99 99 99
    47 66 99 99 99 99 99 99
    99 99 99 99 99 99 99 99
    99 99 99 99 99 99 99 99
    99 99 99 99 99 99 99 99
    99 99 99 99 99 99 99 99))

For chroma (CB and CR) we see that quantization is much more harsh in general. So not only will we sub-sample color, we will also throw away more high-frequency color variation. It's interesting to think about, but also makes sense in some way; again in photography class we did an exercise where we shaded our prints with colored pencils, and the results were remarkable. My poor, lazy coloring skills somehow rendered leaves lifelike in different hues of green; really though, they were shades of grey, colored in imprecisely. "Colored TV" indeed.

With this knowledge under our chapeaux, we can now say what the "JPEG quality" setting actually is: it's simply that pair of standard quantization matrices scaled up or down. Towards "quality 100", the matrix approaches all-ones, for no quantization, and thus minimal loss (though you still have some rounding, often subsampling as well, and RGB-to-Y'CBCR gamut loss). Towards "quality 0" they scale to a matrix full of large values, for harsh quantization.

This understanding also explains those wavey JPEG artifacts you get on low-quality images. Those artifacts look like waves because they are waves. They usually occur at sharp intensity transitions, which like a cymbal crash cause lots of high frequencies that then get harshly quantized. Incidentally I suspect (but don't know) that this is the same reason that cymbals often sound bad in poorly-encoded MP3s, because of harsh quantization in the frequency domain.

Finally, the coefficients are written out to a file as a stream of bits. Each file gets a huffman code allocated to it, which ideally is built from the distribution of quantized coefficient sizes seen in all of the blocks of an image. There are usually different encodings for luma and chroma, to reflect their different quantizations. Reading and writing this bitstream is a bit of a headache but the algorithm is specified in the JPEG standard, and all you have to do is implement it. Notably, though, there is special support for encoding a run of zero-valued coefficients, which happens often after quantization. There are rarely wavey bits in a blue blue sky.

on transforms

It's terribly common for photos to be wrongly oriented. Unfortunately, the way that many editors fix photo rotation is by setting a bit in the EXIF information of the JPEG. This is ineffectual, as web browsers don't look in the EXIF information, and silly, because it turns out you can losslessly rotate most JPEG images anyway.

Consider that the body of a JPEG is an array of macroblocks. To rotate an image, you just have to rearrange those macroblocks, then rearrange the blocks inside the macroblocks (e.g. swap the two Y' blocks in my above example), then transform the blocks themselves.

The lossless transformations that you can do on a block are transposition, vertical flipping, and horizontal flipping.

Transposition flips a block along its downward-sloping diagonal. To do so, you just swap the coefficients at (u, v) with the coefficients at (v, u). Easy peasey.

Flipping is trickier. Consider the enlarged DCT image from above. What would it take to horizontally flip the function at (0, 1)? Instead of going from light to dark, you want it to go from dark to light. Simple: you just negate the coefficients! But you only want to negate those coefficients that are "odd" in the X direction, which are those coefficients whose column is odd. And actually that's all there is to it. Flipping vertically is the same, but for coefficients whose row is odd.

I said "most images" above because those whose size is not evenly divided by the macroblock size can't be losslessly rotated -- you will end up seeing some of the hidden data that falls off the edge of the canvas. Oh well. Most raw images are properly dimensioned, and if you're downscaling, you already have to re-encode anyway.

But that's just flipping and transposition, you say! What about rotation? Well it turns out that you can express rotation in terms of these operations: rotating 90 degrees clockwise is just a transpose and a horizontal flip (in that order). Together, flipping horizontally, flipping vertically, and transposing form a group, in the same way that flipping and flopping form a group for mattresses. Yeah!

on scheme

I wrote this library in Scheme because that's my language of choice these days. I didn't run into any serious impedance mismatches; Guile has a generic multi-dimensional array facility that made it possible to express many of these operations as generic folds, unfolds, or maps over arrays. The huffman coding part was a bit irritating, but all in all things were pretty good. The speed is pretty bad, but I haven't optimized it at all, and it gives me a nice test case for the compiler. Anyway, it's been fun and it suits my needs. Check out the project page if you're interested. Yes, to shave a yak you have to get a bit bovine and smelly, but yaks live in awesome places!

Finally I will leave you with a glitch, one of many that I have produced over the last couple weeks. Comments and corrections welcome below. Happy hacking!