wingologA mostly dorky weblog by Andy Wingo2015-07-03T11:05:11Ztekutihttps://wingolog.org/feed/atomAndy Wingohttps://wingolog.org/Pfmatch, a packet filtering language embedded in Luahttps://wingolog.org/2015/07/03/pfmatch-a-packet-filtering-language-embedded-in-lua2015-07-03T11:05:11Z2015-07-03T11:05:11Z

Greets, hackers! I just finished implementing a little embedded language in Lua and wanted to share it with you. First, a bit about the language, then some notes on how it works with Lua to reach the high performance targets of Snabb Switch.

the pfmatch language

Pfmatch is a language designed for filtering, classifying, and dispatching network packets in Lua. Pfmatch is built on the well-known pflang packet filtering language, using the fast pflua compiler for LuaJIT.

Here's an example of a simple pfmatch program that just divides up packets depending on whether they are TCP, UDP, or something else:

match {
   tcp => handle_tcp
   udp => handle_udp
   otherwise => handle_other
}

Unlike pflang filters written for such tools as tcpdump, a pfmatch program can dispatch packets to multiple handlers, potentially destructuring them along the way. In contrast, a pflang filter can only say "yes" or "no" on a packet.

Here's a more complicated example that passes all non-IP traffic, drops all IP traffic that is not going to or coming from certain IP addresses, and calls a handler on the rest of the traffic.

match {
   not ip => forward
   ip src 1.2.3.4 => incoming_ip
   ip dst 5.6.7.8 => outgoing_ip
   otherwise => drop
}

In the example above, the handlers after the arrows (forward, incoming_ip, outgoing_ip, and drop) are Lua functions. The part before the arrow (not ip and so on) is a pflang expression. If the pflang expression matches, its handler will be called with two arguments: the packet data and the length. For example, if the not ip pflang expression is true on the packet, the forward handler will be called.

It's also possible for the handler of an expression to be a sub-match:

match {
   not ip => forward
   ip src 1.2.3.4 => {
      tcp => incoming_tcp(&ip[0], &tcp[0])
      udp => incoming_udp(&ip[0], &ucp[0])
      otherwise => incoming_ip(&ip[0])
   }
   ip dst 5.6.7.8 => {
      tcp => outgoing_tcp(&ip[0], &tcp[0])
      udp => outgoing_udp(&ip[0], &ucp[0])
      otherwise => outgoing_ip(&ip[0])
   }
   otherwise => drop
}

As you can see, the handlers can also have additional arguments, beyond the implicit packet data and length. In the above example, if not ip doesn't match, then ip src 1.2.3.4 matches, then tcp matches, then the incoming_tcp function will be called with four arguments: the packet data as a uint8_t* pointer, its length in bytes, the offset of byte 0 of the IP header, and the offset of byte 0 of the TCP header. An argument to a handler can be any arithmetic expression of pflang; in this case &ip[0] is actually an extension. More on that later. For language lawyers, check the syntax and semantics over in our source repo.

Thanks especially to my colleague Katerina Barone-Adesi for long backs and forths about the language design; they really made it better. Fistbump!

pfmatch and lua

The challenge of designing pfmatch is to gain expressiveness, compared to writing filters by hand, while not endangering the performance targets of Pflua and Snabb Switch. These days Snabb is on target to give ASIC-driven network appliances a run for their money, so anything we come up with cannot sacrifice speed.

In practice what this means is compile, don't interpret. Using the pflua compiler allows us to generalize the good performance that we have gotten on pflang expressions to a multiple-dispatch scenario. It's a pretty straightword strategy. Naturally though, the interface with Lua is more complex now, so to understand the performance we should understand the interaction with Lua.

How does one make two languages interoperate, anyway? With pflang it's pretty clear: you compile pflang to a Lua function, and call the Lua function to match on packets. It returns true or false. It's a thin interface. Indeed with pflang and pflua you could just match the clauses in order:

not_ip = pf.compile('not ip')
incoming = pf.compile('ip src 1.2.3.4')
outgoing = pf.compile('ip dst 5.6.7.8')

function handle(packet, len)
   if not_ip(packet, len) then return forward(packet, len)
   elseif incoming(packet, len) then return incoming_ip(packet, len)
   elseif outgoing(packet, len) then return outgoing_ip(packet, len)
   else return drop(packet, len) end
end

But not only is this tedious, you don't get easy access to the packet itself, and you're missing out on opportunities for optimization. For example, if the packet fails the not_ip check, we don't need to check if it's an IP packet in the incoming check. Compiling a pfmatch program takes advantage of pflua's optimizer to produce good code for the match expression as a whole.

If this were Scheme I would make the right-hand side of an arrow be an expression and implement pfmatch as a macro; see Racket's match documentation for an example. In Lua or other languages that's harder to do; you would have to parse Lua, and it's not clear which parts of the production as a whole are the host language (Lua) and which are the embedded language (pfmatch).

Instead, I think embedding host language snippets by function name is a fine solution. It seems fairly clear that incoming_ip, for example, is some kind of function. It's easy to parse identifiers in an embedded language, both for humans and for programs, so that takes away a lot of implementation headache and cognitive overhead.

We are left with a few problems: how to map names to functions, what to do about the return value of match expressions, and how to tie it all together in the host language. Again, if this were Scheme then I'd use macros to embed expressions into the pfmatch term, and their names would be scoped into whatever environment the match term was defined. In Lua, the best way to implement a name/value mapping is with a table. So we have:

local handlers = {
   forward = function(data, len)
      ...
   end,
   drop = function(data, len)
      ...
   end,
   incoming_ip = function(data, len)
      ...
   end,
   outgoing_ip = function(data, len)
      ...
   end
}

Then we will pass the handlers table to the matcher function, and the matcher function will call the handlers by name. LuaJIT will mostly take care of the overhead of the table dispatch. We compile the filter like this:

local match = require('pf.match')

local dispatcher = match.compile([[match {
   not ip => forward
   ip src 1.2.3.4 => incoming_ip
   ip dst 5.6.7.8 => outgoing_ip
   otherwise => drop
}]])

To use it, you just invoke the dispatcher with the handlers, data, and length, and the return value is whatever the handler returns. Here let's assume it's a boolean.

function loop(self)
   local i, o = self.input.input, self.output.output
   while not link.empty() do
      local pkt = link.receive(i)
      if dispatcher(handlers, pkt.data, pkt.length) then
         link.transmit(o, pkt)
      end
   end
end

Finally, we're ready for an example of a compiled matcher function. Here's what pflua does with the match expression above:

local cast = require("ffi").cast
return function(self,P,length)
   if length < 14 then return self.forward(P, len) end
   if cast("uint16_t*", P+12)[0] ~= 8 then return self.forward(P, len) end
   if length < 34 then return self.drop(P, len) end
   if P[23] ~= 6 then return self.drop(P, len) end
   if cast("uint32_t*", P+26)[0] == 67305985 then return self.incoming_ip(P, len) end
   if cast("uint32_t*", P+30)[0] == 134678021 then return self.outgoing_ip(P, len) end
   return self.drop(P, len)
end

The result is a pretty good dispatcher. There are always things to improve, but it's likely that the function above is better than what you would write by hand, and it will continue to get better as pflua improves.

Getting back to what I mentioned earlier, when we write filtering code by hand, we inevitably end up writing interpreters for some kind of filtering language. Network functions are essentially linguistic in nature: static appliances are no good because network topologies change, and people want solutions that reflect their problems. Usually this means embedding an interpreter for some embedded language, for example BPF bytecode or iptables rules. Using pflua and pfmatch expressions, we can instead compile a filter suited directly for the problem at hand -- and while we're at it, we can forget about worrying about pesky offsets, constants, and bit-shifts.

challenges

I'm optimistic about pfmatch or something like it being a success, but there are some challenges too.

One challenge is that pflang is pretty weird. For example, attempting to access ip[100] will abort a filter immediately on a packet that is less than 100 bytes long, not including L2 encapsulation. It's wonky semantics, and in the context of pfmatch, aborting the entire pfmatch program would obviously be the wrong thing. That would abort too much. Instead it should probably just fail the pflang test in which that packet access appears. To this end, in pfmatch we turn those aborts into local expression match failures. However, this leads to an inconsistency with pflang. For example in (ip[100000] == 0 or (1==1)), instead of ip[100000] causing the whole pflang match to fail, it just causes the local test to fail. This leaves us with 1==1, which passes. We abort too little.

This inconsistency is probably a bug. We want people to be able to test clauses with vanilla pflang expressions, and have the result match the pfmatch behavior. Due to limitations in some of pflua's intermediate languages, it's likely to persist for a while. It is the only inconsistency that I know of, though.

Pflang is also underpowered in many ways. It has terrible IPv6 support; for example, tcp[0] only matches IPv4 packets, and at least as implemented in libpcap, most payload access on IPv6 packets does the wrong thing regarding chained extension headers. There is no facility in the language for binding names to intermediate results, there is no linguistic facility for talking about fragmentation, no ability to address IP source and destination addresses in arithmetic expressions by name, and so on. We can solve these in pflua with extensions to the language, but that introduces incompatibilities with pflang.

You might wonder why to stick with pflang, after all of this. If this is you, Juho Snellman wrote a great article on this topic, just for you: What's wrong with pcap filters.

Pflua's optimizer has mostly helped us, but there have been places where it could be more helpful. When compiling just one expression, you can often end up figuring out which branches are dead-ends, which helps the rest of the optimization to proceed. With more than one successful branch, we had to make a few improvements to the optimizer to actually get decent results. We also had to relax one restriction on the optimizer: usually we only permit transformations that make the code smaller. This way we know we're going in the right direction and will eventually terminate. However because of reasons™ we did decide to allow tail calls to be duplicated, so instead of having just one place in the match function that tail-calls a handler, you can end up with multiple calls. I suspect using a tracing compiler will largely make this moot, as control-flow splits effectively lead to trace duplication anyway, and making sure control-flow joins later doesn't effectively counter that. Still, I suspect that the resulting trace shape will rejoin only at the loop head, instead of in some intermediate point, which is probably OK.

future

With all of these concerns, is pfmatch still a win? Yes, probably! We're going to start using it when building Snabb apps, and will see how it goes. We'll probably end up adding a few more pflang extensions before we're done. If it's something you're in to, snabb-devel is the place to try it out, and see you on the bug tracker. Happy packet hacking!

Andy Wingohttps://wingolog.org/high-performance packet filtering with pfluahttps://wingolog.org/2014/09/02/high-performance-packet-filtering-with-pflua2014-09-02T10:15:49Z2014-09-02T10:15:49Z

Greets! I'm delighted to be able to announce the release of Pflua, a high-performance packet filtering toolkit written in Lua.

Pflua implements the well-known libpcap packet filtering language, which we call pflang for short.

Unlike other packet filtering toolkits, which tend to use the libpcap library to compile pflang expressions bytecode to be run by the kernel, Pflua is a completely new implementation of pflang.

why lua?

At this point, regular readers are asking themselves why this Schemer is hacking on a Lua project. The truth is that I've always been looking for an excuse to play with the LuaJIT high-performance Lua implementation.

LuaJIT is a tracing compiler, which is different from other JIT systems I have worked on in the past. Among other characteristics, tracing compilers only emit machine code for branches that are taken at run-time. Tracing seems a particularly appropriate strategy for the packet filtering use case, as you end up with linear machine code that reflects the shape of actual network traffic. This has the potential to be much faster than anything static compilation techniques can produce.

The other reason for using Lua was because it was an excuse to hack with Luke Gorrie, who for the past couple years has been building the Snabb Switch network appliance toolkit, also written in Lua. A common deployment environment for Snabb is within the host virtual machine of a virtualized server, with Snabb having CPU affinity and complete control over a high-performance 10Gbit NIC, which it then routes to guest VMs. The administrator of such an environment might want to apply filters on the kinds of traffic passing into and out of the guests. To this end, we plan on integrating Pflua into Snabb so as to provide a pleasant, expressive, high-performance filtering facility.

Given its high performance, it is also reasonable to deploy Pflua on gateway routers and load-balancers, within virtualized networking appliances.

implementation

Pflua compiles pflang expressions to Lua source code, which are then optimized at run-time to native machine code.

There are actually two compilation pipelines in Pflua. The main one is fairly traditional. First, a custom parser produces a high-level AST of a pflang filter expression. This AST is lowered to a primitive AST, with a limited set of operators and ways in which they can be combined. This representation is then exhaustively optimized, folding constants and tests, inferring ranges of expressions and packet offset values, hoisting assertions that post-dominate success continuations, etc. Finally, we residualize Lua source code, performing common subexpression elimination as we go.

For example, if we compile the simple Pflang expression ip or ip6 with the default compilation pipeline, we get the Lua source code:

return function(P,length)
   if not (length >= 14) then return false end
   do
      local v1 = ffi.cast("uint16_t*", P+12)[0]
      if v1 == 8 then return true end
      do
         do return v1 == 56710 end
      end
   end
end

The other compilation pipeline starts with bytecode for the Berkeley packet filter VM. Pflua can load up the libpcap library and use it to compile a pflang expression to BPF. In any case, whether you start from raw BPF or from a pflang expression, the BPF is compiled directly to Lua source code, which LuaJIT can gnaw on as it pleases. Compiling ip or ip6 with this pipeline results in the following Lua code:

return function (P, length)
   local A = 0
   if 14 > length then return 0 end
   A = bit.bor(bit.lshift(P[12], 8), P[12+1])
   if (A==2048) then goto L2 end
   if not (A==34525) then goto L3 end
   ::L2::
   do return 65535 end
   ::L3::
   do return 0 end
   error("end of bpf")
end

We like the independence and optimization capabilities afforded by the native pflang pipeline. Pflua can hoist and eliminate bounds checks, whereas BPF is obligated to check that every packet access is valid. Also, Pflua can work on data in network byte order, whereas BPF must convert to host byte order. Both of these restrictions apply not only to Pflua's BPF pipeline, but also to all other implementations that use BPF (for example the interpreter in libpcap, as well as the JIT compilers in the BSD and Linux kernels).

However, though Pflua does a good job in implementing pflang, it is inevitable that there may be bugs or differences of implementation relative to what libpcap does. For that reason, the libpcap-to-bytecode pipeline can be a useful alternative in some cases.

performance

When Pflua hits the sweet spots of the LuaJIT compiler, performance screams.


(full image, analysis)

This synthetic benchmark runs over a packet capture of a ping flood between two machines and compares the following pflang implementations:

  1. libpcap: The user-space BPF interpreter from libpcap

  2. linux-bpf: The old Linux kernel-space BPF compiler from 2011. We have adapted this library to work as a loadable user-space module (source)

  3. linux-ebpf: The new Linux kernel-space BPF compiler from 2014, also adapted to user-space (source)

  4. bpf-lua: BPF bytecodes, cross-compiled to Lua by Pflua.

  5. pflua: Pflang compiled directly to Lua by Pflua.

To benchmark a pflang implementation, we use the implementation to run a set of pflang expressions over saved packet captures. The result is a corresponding set of benchmark scores measured in millions of packets per second (MPPS). The first set of results is thrown away as a warmup. After warmup, the run is repeated 50 times within the same process to get multiple result sets. Each run checks to see that the filter matches the the expected number of packets, to verify that each implementation does the same thing, and also to ensure that the loop is not dead.

In all cases the same Lua program is used to drive the benchmark. We have tested a native C loop when driving libpcap and gotten similar results, so we consider that the LuaJIT interface to C is not a performance bottleneck. See the pflua-bench project for more on the benchmarking procedure and a more detailed analysis.

The graph above shows that Pflua can stream in packets from memory and run some simple pflang filters them at close to the memory bandwidth on this machine (100 Gbit/s). Because all of the filters are actually faster than the accept-all case, probably due to work causing prefetching, we actually don't know how fast the filters themselves can run. At any case, in this ideal situation, we're running at a handful of nanoseconds per packet. Good times!


(full image, analysis)

It's impossible to make real-world tests right now, especially since we're running over packet captures and not within a network switch. However, we can get more realistic. In the above test, we run a few filters over a packet capture from wingolog.org, which mostly operates as a web server. Here we see again that Pflua beats all of the competition. Oddly, the new Linux JIT appears to fare marginally worse than the old one. I don't know why that would be.

Sadly, though, the last tests aren't running at that amazing flat-out speed we were seeing before. I spent days figuring out why that is, and that's part of the subject of my last section here.

on lua, on luajit

I implement programming languages for a living. That doesn't mean I know everything there is to know about everything, or that everything I think I know is actually true -- in particular, I was quite ignorant about trace compilers, as I had never worked with one, and I hardly knew anything about Lua at all. With all of those caveats, here are some ignorant first impressions of Lua and LuaJIT.

LuaJIT has a ridiculously fast startup time. It also compiles really quickly: under a minute. Neither of these should be important but they feel important. Of course, LuaJIT is not written in Lua, so it doesn't have the bootstrap challenges that Guile has; but still, a fast compilation is refreshing.

LuaJIT's FFI is great. Five stars, would program again.

As a compilation target, Lua is OK. On the plus side, it has goto and efficient bit operations over 32-bit numbers. However, and this is a huge downer, the result range of bit operations is the signed int32 range, not the unsigned range. This means that bit.band(0xffffffff, x) might be negative. No one in the history of programming has ever wanted this. There are sensible meanings for negative results to bit operations, but only if an argument was negative. Grr. Otherwise, Lua shares the same concerns as other languages whose numbers are defined as 64-bit doubles.

Sometimes people get upset that Lua starts its indexes (in "arrays" or strings) with 1 instead of 0. It's foreign to me, so it's sometimes a challenge, but it can work as well as anything else. The problem comes in when working with the LuaJIT FFI, which starts indexes with 0, leading me to make errors as I forget which kind of object I am working on.

As a language to implement compilers, Lua desperately misses a pattern matching facility. Otherwise, a number of small gripes but no big ones; tables and closures abound, which leads to relatively terse code.

Finally, how well does trace compilation work for this task? I offer the following graph.


(full image, analysis)

Here the tests are paired. The first test of a pair, for example the leftmost portrange 0-6000, will match most packets. The second test of a pair, for example the second-from-the-left portrange 0-5, will reject all packets. The generated Lua code will be very similar, except for some constants being different. See portrange-0-6000.md for an example.

The Pflua performance of these filters is very different: the one that matches is slower than the one that doesn't, even though in most cases the non-matching filter will have to do more work. For example, a non-matching filter probably checks both src and dst ports, whereas a successful one might not need to check the dst.

It hurts to see Pflua's performance be less than the Linux JIT compilers, and even less than libpcap at times. I scratched my head for a long time about this. The Lua code is fine, and actually looks much like the BPF code. I had taken a look at the generated assembly code for previous traces and it looked fine -- some things that were not as good as they should be (e.g. a fair bit of conversions between integers and doubles, where these traces have no doubles), but things were OK. What changed?

Well. I captured the traces for portrange 0-6000 to a file, and dove in. Trace 66 contains the inner loop. It's interesting to see that there's a lot of dynamic checks in the beginning of the trace, although the loop itself is not bad (scroll down to see the word LOOP:), though with the double conversions I mentioned before.

It seems that trace 66 was captured for a packet whose src port was within range. Later, we end up compiling a second trace if the src port check fails: trace 67. The trace starts off with an absurd amount of loads and dynamic checks -- to a similar degree as trace 66, even though trace 66 dominates trace 67. It seems that there is a big penalty for transferring from one trace to another, even though they are both compiled.

Finally, once trace 67 is done -- and recall that all it has to do is check the destination port, and then update the counters from the inner loop) -- it jumps back to the top of trace 66 instead of the top of the loop, repeating all of the dynamic checks in trace 66! I can only think this is a current deficiency of LuaJIT, and not with trace compilation in general, although the amount of state transfer points to a lack of global analysis that you would get in a method JIT. I'm sure that values are being transferred that are actually dead.

This explains the good performance for the match-nothing cases: the first trace that gets compiled residualizes the loop expecting that all tests fail, and so only matching cases or variations incur the trace transfer-and-re-loop cost.

It could be that the Lua code that Pflua residualizes is in some way not idiomatic or not performant; tips in that regard are appreciated.

conclusion

I was going to pass some possible slogans by our marketing department, but we don't really have one, so I pass them on to you and you can tell me what you think:

  • "Pflua: A Totally Adequate Pflang Implementation"

  • "Pflua: Sometimes Amazing Performance!!!!1!!"

  • "Pflua: Organic Artisanal Network Packet Filtering"

Pflua was written by Igalians Diego Pino, Javier Muñoz, and myself for Snabb Gmbh, fine purveyors of high-performance networking solutions. If you are interested in getting Pflua in a Snabb context, we'd be happy to talk; drop a note to the snabb-devel forum. For Pflua in other contexts, file an issue or drop me a mail at wingo@igalia.com. Happy hackings with Pflua, the totally adequate pflang implementation!