requiem for a stringref
Good day, comrades. Today’s missive is about strings!
a problem for java
Imagine you want to compile a program to WebAssembly, with the new GC support for WebAssembly. Your WebAssembly program will run on web browsers and render its contents using the DOM API: Document.createElement, Document.createTextNode, and so on. It will also use DOM interfaces to read parts of the page and read input from the user.
How do you go about representing your program in WebAssembly? The GC support gives you the ability to define a number of different kinds of aggregate data types: structs (records), arrays, and functions-as-values. Earlier versions of WebAssembly gave you 32- and 64-bit integers, floating-point numbers, and opaque references to host values (externref). This is what you have in your toolbox. But what about strings?
WebAssembly’s historical answer has been to throw its hands in the air and punt the problem to its user. This isn’t so bad: the direct user of WebAssembly is a compiler developer and can fend for themself. Using the primitives above, it’s clear we should represent strings as some kind of array.
The source language may impose specific requirements regarding string representations: for example, in Java, you will want to use an (array i16), because Java’s strings are specified as sequences of UTF-16¹ code units, and Java programs are written assuming that random access to a code unit is constant-time.
Similarly, Java is going to need to perform a number of gnarly operations on its strings, for example, locale-specific collation. This is a hard problem whose solution basically amounts to shipping a copy of libICU in their WebAssembly module; that’s a lot of binary size, and it’s not even clear how to compile libICU in such a way that works on GC-managed arrays rather than linear memory.
Thinking about it more, there’s also the problem of regular expressions. A high-performance regular expression engine is a lot of investment, and not really portable from the native world to WebAssembly, as the main techniques require just-in-time code generation, which is unavailable on Wasm.
This is starting to sound like a terrible system: big binaries, lots of copying, suboptimal algorithms, and a likely ongoing functionality gap. What to do?
a solution for java
(Does this sound too complicated or too distasteful to implement? Disabuse yourself of the notion: it’s happening already. V8 does this and other JS/Wasm engines will be forced to follow, as users file bug reports that such-and-such an app is slow on e.g. Firefox but fast on Chrome, and so on and so on. It’s the same dynamic that led asm.js adoption.)
beyond the web?
But does that mean you can’t run Java on Fastly’s cloud? Does the Java compiler have to actually implement all of those things that we were trying to avoid? Will Java actually implement those things? I think the answers to all of those questions is “no”, but also that I expect a pretty crappy outcome.
First of all, it’s not technically required that Java implement its own strings in terms of (array i16). A Java-to-Wasm/GC compiler can keep the strings-as-opaque-host-values paradigm, and instead have these string routines provided by an auxiliary WebAssembly module that itself probably uses (array i16), effectively polyfilling what the browser would give you. The effort of creating this module can be shared between e.g. Java and C#, and the run-time costs for instantiating the module can be amortized over a number of Java users within a process.
However, I don’t expect such a module to be of good quality. It doesn’t seem possible to implement a good regular expression engine that way, for example. And, absent a very good run-time system with an adaptive compiler, I don’t expect the low-level per-codepoint operations to be as efficient with a polyfill as they are on the browser.
Instead, I could see non-web WebAssembly hosts being pressured into implementing their own built-in UTF-16¹ module which has accelerated compilation, a native regular expression engine, and so on. It’s nice to have a portable fallback but in the long run, first-class UTF-16¹ will be everywhere.
So when I say UTF-16¹, I really mean WTF-16: sequences of any 16-bit code units, without the UTF-16 requirement that surrogate code units be properly paired. In this way, WTF-16 encodes a larger language than UTF-16: not just USV codepoints, but also surrogate codepoints.
The existence of WTF-16 is a consequence of a kind of original sin, originating in the choice to expose 16-bit code unit access to the Java programmer, and which everyone agrees should be somehow firewalled off from the rest of the world. The usual way to do this is to prohibit WTF-16 from being transferred over the network or stored to disk: a message sent via an HTTP POST, for example, will never include a surrogate codepoint, and will either replace it with the U+FFFD replacement codepoint or throw an error.
You only have a few options. You can eagerly check that a string is valid UTF-16; this might be a potentially expensive O(n) check, but perhaps this is acceptable. (This check may be faster in the future.) Or, you can replace surrogate codepoints with U+FFFD, when accessing string contents; lossy, but preserves your language’s semantic domain. Or, you can extend your language’s semantics to somehow deal with surrogate codepoints.
My point is that if you want to use JS strings in a non-Java-like language, your language will need to define what to do with invalid UTF-16. Ideally the browser will give you a way to put your policy into practice: replace with U+FFFD, error, or pass through.
beyond java? (reprise) (feat: snakes)
Here we arrive to the second, thornier problem, which is less about domain and more about idiom: in Python, we expect to be able to access strings by codepoint index. This is the case not only to access string contents, but also to refer to positions in strings, for example when extracting a substring. These operations need to be fast (or fast enough anyway; CPython doesn’t have a very high performance baseline to meet).
However, the web platform doesn’t give us O(1) access to string codepoints. Usually a codepoint just takes up one 16-bit code unit, so the (zero-indexed) 5th codepoint of JS string s may indeed be at s.codePointAt(5), but it may also be at offset 6, 7, 8, 9, or 10. You get the point: finding the nth codepoint in a JS string requires a linear scan from the beginning.
Our O(1) random access problem remains, though. Are we stuck?
what does the good world look like
How should a language represent its strings, anyway? Here we depart from a precise gathering of requirements for WebAssembly strings, but in a useful way, I think: we should build abstractions not only for what is, but also for what should be. We should favor a better future; imagining the ideal helps us design the real.
I keep returning to Henri Sivonen’s authoritative article, It’s Not Wrong that “🤦🏼♂️”.length == 7, But It’s Better that “🤦🏼♂️”.len() == 17 and Rather Useless that len(“🤦🏼♂️”) == 5. It is so good and if you have reached this point, pop it open in a tab and go through it when you can. In it, Sivonen argues (among other things) that random access to codepoints in a string is not actually important; he thinks that if you were designing Python today, you wouldn’t include this interface in its standard library. Users would prefer extended grapheme clusters, which is variable-length anyway and a bit gnarly to compute; storage wants bytes; array-of-codepoints is just a bad place in the middle. Given that UTF-8 is more space-efficient than either UTF-16 or array-of-codepoints, and that it embraces the variable-length nature of encoding, programming languages should just use that.
As a model for how strings are represented, array-of-codepoints is outdated, as indeed is UTF-16. Outdated doesn’t mean irrelevant, of course; there is lots of Python code out there and we have to support it somehow. But, if we are designing for the future, we should nudge our users towards other interfaces.
I have two final data-points to indicate that UTF-8 is the way. One is that Swift used to use UTF-16 to represent its strings, but was able to switch to UTF-8. To adapt to the newer performance model of UTF-8, Swift maintainers designed new APIs to allow users to request a view on a string: treat this string as UTF-8, or UTF-16, or a sequence of codepoints, or even a sequence of extended grapheme clusters. Their users appear to be happy, and I expect that many languages will follow Swift’s lead.
Secondly, as a maintainer of the Guile Scheme implementation, I also want to switch to UTF-8. Guile has long used Python’s representation strategy: array of codepoints, with an optimization if all codepoints are “narrow” (less than 256). The Scheme language exposes codepoint-at-offset (string-ref) as one of its fundamental string access primitives, and array-of-codepoints maps well to this idiom. However, we do plan to move to UTF-8, with a Swift-like breadcrumbs strategy for accelerating per-codepoint access. We hope to lower memory consumption, simplify the implementation, and have general (but not uniform) speedups; some things will be slower but most should be faster. Over time, users will learn the performance model and adapt to prefer string builders / iterators (“string ports”) instead of string-ref.
a solution for webassembly in the browser?
Focussing on providing JS strings to Wasm/GC is only really good for Java and friends; the cost of mapping charCodeAt semantics to, say, Python’s strings is likely too high.
JS strings are only present on browsers (and Node and such).
I see the outcome being that Java will have to keep its implementation that uses (array i16) when targetting the edge, and use JS strings on the browser. I think that polyfills will not have acceptable performance. On the edge there will be a binary size penalty and a performance and functionality gap, relative to the browser. Some edge Wasm implementations will be pushed to implement fast JS strings by their users, even though they don’t have JS on the host.
If the JS string builtins proposal were a local maximum, I could see putting some energy into it; it does make the Java case a bit better. However I think it’s likely to be an unstable saddle point; if you are going to infect the edge with WTF-16 anyway, you might as well step back and try to solve a problem that is a bit more general than Java on JS.
stringref: a solution for webassembly?
I think WebAssembly should just bite the bullet and try to define a string data type, for languages that use GC. It should support UTF-8 and UTF-16 views, like Swift’s strings, and support some kind of iterator API that decodes codepoints.
What I describe is what the stringref proposal gives you. We don’t yet have consensus on this proposal in the Wasm standardization group, and we may never reach there, although I think it’s still possible. As I understand them, the objections are two-fold:
WebAssembly is an instruction set, like AArch64 or x86. Strings are too high-level, and should be built on top, for example with (array i8).
I think the first objection is a bit easier to overcome. Firstly, WebAssembly now defines quite a number of components that don’t map to machine ISAs: typed and extensible locals, memory.copy, and so on. You could have defined memory.copy in terms of primitive operations, or required that all local variables be represented on an explicit stack or in a fixed set of registers, but WebAssembly defines higher-level interfaces that instead allow for more efficient lowering to machine primitives, in this case SIMD-accelerated copies or machine-specific sets of registers.
Similarly with garbage collection, there was a very interesting “continuation marks” proposal by Ross Tate that would give a low-level primitive on top of which users could implement root-finding of stack values. However when choosing what to include in the standard, the group preferred a more high-level facility in which a Wasm module declares managed data types and allows the WebAssembly implementation to do as it sees fit. This will likely result in more efficient systems, as a Wasm implementation can more easily use concurrency and parallelism in the GC implementation than a guest WebAssembly module could do.
So, the criteria of what to include in the Wasm standard is not “what is the most minimal primitive that can express this abstraction”, or even “what looks like an ARMv8 instruction”, but rather “what makes Wasm a good compilation target”. Wasm is designed for its compiler-users, not for the machines that it runs on, and if we manage to find an abstract definition of strings that works for Wasm-targetting toolchains, we should think about adding it.
The second objection is trickier. When you compile to Wasm, you need a good model of what the performance of the Wasm code that you emit will be. Different Wasm implementations may use different stringref representations; requesting a UTF-16 view on a string that is already UTF-16 will be cheaper than doing so on a string that is UTF-8. In the worst case, requesting a UTF-16 view on a UTF-8 string is a linear operation on one system but constant-time on another, which in a loop over string contents makes the former system quadratic: a real performance failure that we need to design around.
The stringref proposal tries to reify as much of the cost model as possible with its “view” abstraction; the compiler can reason that any cost may incur then rather than when accessing a view. But, this abstraction can leak, from a performance perspective. What to do?
i’m on a horse
I may be off in some of my predictions about where things will go, so who knows. In the meantime, in the time that it takes other people to reach the same conclusions, stringref is in a kind of hiatus.
The Scheme-to-Wasm compiler that I work on does still emit stringref, but it is purely a toolchain concept now: we have a post-pass that lowers stringref to WTF-8 via (array i8), and which emits calls to host-supplied conversion routines when passing these strings to and from the host. When compiling to Hoot’s built-in Wasm virtual machine, we can leave stringref in, instead of lowering it down, resulting in more efficient interoperation with the host Guile than if we had to bounce through byte arrays.
So, we wait for now. Not such a bad situation, at least we have GC coming soon to all the browsers. appy hacking to all my stringfolk, and until next time!
Comments are closed.