One Big Fluke

In the 1500s cheesemakers began using annatto (seeds from the tropical achiote tree) to color their cheese orange. Why?

At the time the best cheeses had an orangey hue. This was caused by carotene (vitamin A) in cow's milk. The carotene came from good grass. Good grass made fat cows. Fat cows made fat milk. Fat milk made good cheese. Thus, eaters of cheese could identify the good stuff by an orange rind.

As a consumer of cheese, looking for the orange cue was a lot easier than trying to find out what the cows ate. Cheesemakers using annatto took advantage of this by faking the cue. Regardless of what their cows ate, their cheese would seem high quality because of the orange annatto dye they added. Consumers didn't know the difference and ended up buying inferior cheese.

The practice of using annatto continues to this day. Unsurprisingly, similar strategies are used elsewhere. The analog of cheese's orange color in software development is the performance benchmark. Benchmarks are cues that let us compare many types of systems and make good decisions. Benchmarks also provide an opportunity to get the wrong impression.

PayPal's engineering team recently posted details of how reimplementing their application platform in Node.js led to vast improvements. It's an interesting write-up, with charts and statistics to make the case. Their conclusion is they'll use Node.js for all new consumer-facing web apps.

With general statements like "double the requests per second" and "35% decrease in the average response time" the post implies that you should be using Node.js for all your web apps, too. But how do you know they aren't just dyeing the cheese orange?

To make an informed decision I need to know what the cows ate:

Why was their old system slow? What were the sources of latency? Why couldn't they be remedied?
On their application servers, how slow is a null request? How much of the overhead is out of their control?
If they reimplemented the same system using Java like before, but from scratch, how much faster would it be?
If they reimplemented in a third environment like Python or Go or C, how would it compare?
and many more...

Questions like these are difficult to answer. Rigorous experiments to test Node.js versus Java for the full system would be expensive. This is why we, as software engineers, seek out benchmarks. Unfortunately, benchmarks are easily gamed and often not comparable. I'm not saying PayPal's team is trying to deliberately mislead people. I'm wary because it's easy to do benchmarks incorrectly (and a sample size of one is not representative).

This is why I'm conservative about adopting new tools. I'll read about new stuff. I'll try to gain some familiarity through first-hand experience. I'll listen to the opinions of engineers I know and respect about what's new. But I'm skeptical of sources I don't know and success stories that seem too good to be true. I prefer to move slow. I'd rather know what the cows ate than risk buying dyed cheese.

(PS: Butter is dyed yellow for exactly the same reason.)

27 December 2013

Dyeing the cheese orange – Beware of benchmarks