I'm Brett Slatkin and this is where I write about programming and related topics. Check out my favorite posts if you're new to this site. You can also contact me here or view my projects.

19 November 2012

Distillation: Small Data is the Purpose of Big Data

I was explaining to someone this weekend how most of the work I do these days with "big data" is to extract "small data" that I can reasonably visualize or interact with. This is how I try to make sense of loads of information. The process is called distillation.

In general "big data" is a pain in the ass. Way too many small records sitting around, doing nothing, eating up disk. And every day it gets worse. There are many opportunities to turn big datasets into products, but these require lots of planning and infrastructure building. The more common interaction I hear of these days is extracting just a few facts with clarity.

Today's example: I ran a map reduce job to extract cohort data. The input dataset is ~1B records*. The output dataset is 10,000 rows of CSV. This job goes from ~1TB of source data to ~1MB of distilled data-- the output is 1,000,000 times smaller than the input. The source data is a terrifying mess. The result data is also a mess, but at least it's something I can massage.

The ultimate goal of distillation is to make a dataset that's actionable to my team. We want to understand if our product is working well, if users are happy, if our results are high quality, etc. Just as much, we also want to identify problem areas where things could use improvement. I try to generate small datasets that highlight the contrasts where we can actually do something.

One of the most impressive tools for distillation is Dremel. Its external counterpart is BigQuery, which now supports up to 1TB of source data (update: per import). The only limitation with BigQuery is it still doesn't support big joins like the one I need for my cohort analysis; you're limited to 8MB of compressed data on the left side of the join. But for many cases it's all you'd need.

* This is really "medium data"
© 2009-2016 Brett Slatkin