I'm Brett Slatkin and this is where I write about programming and related topics. You can contact me here or view my projects.

02 July 2010

Biggest Map() of my own

I recently ran the biggest Map() job of my life on my own data using the new App Engine Mapper framework: 1.5 Billion rows. The data was from the PubSubHubbub reference hub I built and run for Google (it uses App Engine, my day job). The Map job's goal was to update all Datastore entities of a certain type to use less space (700GB less, ~60% reduction, fun graphs below). These entities are used to track changes to feeds over time, which means I have to store one "row" for each feed item for a long time (perhaps my own version of Yucca mountain).

The Map() itself is very simple. Here's the Python code:
def RemoveOldFeedEntryRecordPropertiesMapper(feed_entry_record):
  OLD_PROPERTIES = ('entry_id_hash', 'entry_id')
  for name in OLD_PROPERTIES:
    if hasattr(feed_entry_record, name):
      delattr(feed_entry_record, name)
  yield op.db.Put(feed_entry_record)
This will remove some wasteful properties and write the entity back to the Datastore and update its indexes, many of which I've removed (which is a big part of why I get a storage savings; more detail here). Importantly, each map call results in a Datastore write; anyone who has run big MapReduce jobs (on Hadoop or otherwise) will tell you that writes (and their poor throughput) is often what kills offline processing performance.

I ran my job on 16 shards, meaning 16 CPU cores were in use at a time in parallel. My results surprised me! Here are the important stats:
  • Elapsed time: 8 days, 12:59:25
  • mapper_calls: 1533270409 (2077.7/sec avg.)
That's right, 2000+ entity writes per second sustained for 8+ days straight. That's 125 writes/sec/shard. Yes, this was on App Engine. Did you know it could do that? And I saw it peek at 3000+. Wow! What impresses me further is part of the job ran during this Datastore maintenance period, when all Datastore writes were temporarily disabled. The Mapper framework handled the outage without a problem and continued without my intervention once Datastore writes were re-enabled. Running a job like this through datacenter maintenance is unheard of in my experience.

In terms of cost, the map job only doubled the daily price of running Google's Hub on App Engine. That's awesome, especially considering the amount of low-hanging-optimization fruit (mmm... delicious) still present in the Mapper framework. I expect us to deliver an order of magnitude of efficiency and performance improvements for the framework.

After three+ years of working on App Engine, it's kind of surprising to me that I'm now more excited about its future than ever before. I've known App Engine is special since the beginning (eg: no config, easiest-ever deployment, auto-scaling). But where it's headed now is more advanced than any other cluster or batch-data system I have used or read about.

Here's the Datastore statistics page before I ran my job:

Total Number of FeedEntryRecord Entities: Average Size of Entity:
1,433,233,338 411 Bytes

Storage Space by Property Type

pie chart

Storage Space by Entity Kind

pie chart
Property Type Size

entry_content_hash

String

88 GBytes (16% of the entity)

entry_id_hash

String

81 GBytes (15% of the entity)

update_time

Date/Time

37 GBytes (7% of the entity)

entry_id

Text

31 GBytes (6% of the entity)

topic

Text

1 KBytes (0% of the entity)

topic_hash

String

1 KBytes (0% of the entity)

Metadata



312 GBytes (57% of the entity)

Here it is after (modulo time and data growth):

Total Number of FeedEntryRecord Entities: Average Size of Entity:
1,756,534,609 325 Bytes

Storage Space by Property Type

pie chart

Storage Space by Entity Kind

pie chart
Property Type Size

entry_content_hash

String

108 GBytes (20% of the entity)

update_time

Date/Time

46 GBytes (9% of the entity)

entry_id_hash

String

14 MBytes (0% of the entity)

entry_id

Text

2 MBytes (0% of the entity)

Metadata



378 GBytes (71% of the entity)
© 2009-2024 Brett Slatkin