Benchmarks: How not to do them

2009/02/17

Our competitors VirtualLogix have published a paper titled “A practical look at micro-kernels and virtual machine monitors” at last January’s IEEE Computer Communications and Networking Conference (CCNC) in Las Vegas. It paints a picture that, compared to their VLX system, microkernels (read “OKL4”) are an inferior basis for virtualization for CE devices.

This may come as a surprise to those who know that OKL4 is deployed in 250 million mobile phone handsets and VLX in zero! Do you really think our customers, which include leading chipset suppliers and handset OEMs, would deploy such a poor system ?

We’ll have a look at how the facts stack up. It turns out that the paper is full of flawed methodology and incorrect conclusions.

Unfortunately, I cannot point to an on-line copy of the paper. As participants of CCNC’09 we have an electronic copy, but the copyright is owned by the IEEE and their rules don’t allow us to re-distribute it. Furthermore, the CCNC proceedings aren’t even up on IEEE Xplore yet. (But note that my CCNC’09 paper is available on-line.) However, VirtualLogix have been busy distributing copies to potential customers, so some of you will have seen it, others will have to wait until it’s up on IEEE Xplore (or ask VirtualLogix for a copy).

There is so much wrong with the paper that it’s hard to work out where to start picking it apart. But given that the most damaging assertions relate to performance, let’s have a look at the benchmarks presented. We can make a few interesting observations.

Evaluation Platform

For one, the paper claims that it is comparing the inherent suitability of what they call “hypervisors” (represented by their VLX system) with “microkernels” (represented by OKL4, as if OKL4 wasn’t a hypervisor, but that’s for a future blog). To end up with a fair comparison, what should you use: the best representative, i.e. a mature, highly-optimised version, or a recent port that hasn’t been optimised?

Well, their choice was to use OKL4 on an ARM11 platform. OKL4 has supported ARM11 for a while, but no serious effort had been made into tuning its performance. On the other hand, our ARM9 version has been out there for a long time, and in fact we have published its performance years ago, and challenged everyone to match it. No-one has come forward, least of all VirtualLogix.

Clearly, to establish performance limits, OKL4 on ARM9 should be the starting point. Ignoring it and using ARM11 is an inadequate approach which I would not let any of my undergraduate students get away with.

Achieving the real performance

Everyone in the business knows that benchmarking is difficult. Even in a seemingly easy case as measuring lmbench performance on a native Linux system using a binary distribution (you can’t get it easier than this) turns out to be far from trivial. We’ve seen way too many cases where this would lead to nonsensical or irreproducible results.

Things tend to be much more tricky in the embedded space. Companies tend to spend weeks and months benchmarking products. It ain’t easy.

Under such circumstances, what would you expect to see if a competitor benchmarks your product, without your involvement (or even knowledge)? Would you expect that competitor to be impartial and disinterested in the result? Do you think your customers should put any faith at all in the results?

I leave the answer to you.

Data completeness and reproducibility

One of the important rules of publishing experimental results is to provide them in a way that an independent validation is possible, and to provide sufficient information for someone to do this. Not doing this is poor science that is not fit for publication.

Yet these rules have been thoroughly violated in that paper. All the data they present (on which they build their case) is the relative performance of OKL4 and VLX.

This means that it is impossible to say whether the measured OKL4 performance makes sense. I don’t know what the baseline is. The results could be a factor of ten off, and you couldn’t tell. This is appalling science!

You may wonder how this could pass through a scientific peer-review process. In my experience it clearly shouldn’t have. The (anonymous) reviewers of that paper were obviously out of their depth.

Apples vs oranges

Finally, the paper compares apples with oranges, without saying so (but in this case the reviewers were given no chance to find out, as the details were not revealed in the paper).

The story behind this is that VLX can be run in two ways: One is proper virtualization (this and only this is what they describe in the paper) where the hypervisor is in full control of hardware resources, and the guest operating systems are de-privileged. The other one, which is not mentioned in the paper at all, and thus not revealed to the reader, including the reviewers (but the presenter made a lot of fuss about it in the talk at CCNC) is what I call pseudo-virtualization: the guest OS is not de-privileged but co-located with the “hypervisor” in kernel mode. Obviously, this means that the hypervsior no longer has exclusive control over resources, which is why this completely fails to qualify as virtualization, even according to the definition they give in their own paper! (You don’t believe me? Check for their description of “optional” isolation on their website. Isolation of guests is implicit in a virtual-machine environment, not optional. Why would they do this? Could it be because the performance of their “isolated” execution mode is unimpressive?) [Note added 2012-01-08: VirtualLogix has since been acquired by Red Bend Software, and the original content is no longer available. However, the description of Red Bend’s “Mobile Virtualization” technology refers to an optional “Isolator” module, which provides “even stronger isolation”. Given that virtualization, by definition, provides full isolation, one can conclude that the above arguments still apply.]

What does this have to do with that paper? Well, in front of me I have our lmbench performance data for the 2.1 release measured on a Freescale iMX31 processor (the same used by VirtualLogix in their paper). The interesting observation is that for most measures, what they claim as the performance ratio between VLX- and OKL4-based virtualization is close to the ratio between the performance of native Linux and OK Linux (remember, this is an un-optimised OKL4 version on ARM11, much improved in the meantime). So, what are we to conclude? The virtualization overhead of VLX is essentially zero? If you believe that, then I’ve got a great deal for you in snake oil with really amazing healing properties!

What’s really behind this is apples and oranges. As became clear during the talk at CCNC, they compared pseudo-virtualized Linux on VLX (i.e., the Linux guest running privileged) with properly-virtualized (we don’t do anything else) OK Linux (running in user mode) on OKL4. I leave it to you to judge whether that is a fair comparison.

Summary

So, in summary, the benchmark data presented in the paper is worthless: it uses an unoptimised OKL4 platform where a well-optimised one is available, even then it may not be showing the real performance of the OKL4 system they used (because of the challenges involved in benchmarking), and it compares apples with oranges.

We aren’t afraid of comparing our performance with VLX. But it has to be a fair, transparent, apples with apples comparison.

I’ll discuss some of the other main faults of the paper in future blogs, stay tuned.

From → academia, operating systems and virtualization

3 Comments

Benchmarks: How not to do them

Evaluation Platform

Achieving the real performance

Data completeness and reproducibility

Apples vs oranges

Summary

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Categories

Archives

Benchmarks: How not to do them

Evaluation Platform

Achieving the real performance

Data completeness and reproducibility

Apples vs oranges

Summary

Share this:

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Categories

Archives