02 July 2010

Biggest Map() of my own.

I recently ran the biggest Map() job of my life on my own data using the new App Engine Mapper framework: 1.5 Billion rows. The data was from the PubSubHubbub reference hub I built and run for Google (it uses App Engine, my day job). The Map job's goal was to update all Datastore entities of a certain type to use less space (700GB less, ~60% reduction, fun graphs below). These entities are used to track changes to feeds over time, which means I have to store one "row" for each feed item for a long time (perhaps my own version of Yucca mountain).

The Map() itself is very simple. Here's the Python code:
def RemoveOldFeedEntryRecordPropertiesMapper(feed_entry_record):
  OLD_PROPERTIES = ('entry_id_hash', 'entry_id')
  for name in OLD_PROPERTIES:
    if hasattr(feed_entry_record, name):
      delattr(feed_entry_record, name)
  yield op.db.Put(feed_entry_record)
This will remove some wasteful properties and write the entity back to the Datastore and update its indexes, many of which I've removed (which is a big part of why I get a storage savings; more detail here). Importantly, each map call results in a Datastore write; anyone who has run big MapReduce jobs (on Hadoop or otherwise) will tell you that writes (and their poor throughput) is often what kills offline processing performance.

I ran my job on 16 shards, meaning 16 CPU cores were in use at a time in parallel. My results surprised me! Here are the important stats:
  • Elapsed time: 8 days, 12:59:25
  • mapper_calls: 1533270409 (2077.7/sec avg.)
That's right, 2000+ entity writes per second sustained for 8+ days straight. That's 125 writes/sec/shard. Yes, this was on App Engine. Did you know it could do that? And I saw it peek at 3000+. Wow! What impresses me further is part of the job ran during this Datastore maintenance period, when all Datastore writes were temporarily disabled. The Mapper framework handled the outage without a problem and continued without my intervention once Datastore writes were re-enabled. Running a job like this through datacenter maintenance is unheard of in my experience.

In terms of cost, the map job only doubled the daily price of running Google's Hub on App Engine. That's awesome, especially considering the amount of low-hanging-optimization fruit (mmm... delicious) still present in the Mapper framework. I expect us to deliver an order of magnitude of efficiency and performance improvements for the framework.

After three+ years of working on App Engine, it's kind of surprising to me that I'm now more excited about its future than ever before. I've known App Engine is special since the beginning (eg: no config, easiest-ever deployment, auto-scaling). But where it's headed now is more advanced than any other cluster or batch-data system I have used or read about.

Here's the Datastore statistics page before I ran my job:

Total Number of FeedEntryRecord Entities: Average Size of Entity:
1,433,233,338 411 Bytes

Storage Space by Property Type

pie chart

Storage Space by Entity Kind

pie chart
Property Type Size

entry_content_hash

String

88 GBytes (16% of the entity)

entry_id_hash

String

81 GBytes (15% of the entity)

update_time

Date/Time

37 GBytes (7% of the entity)

entry_id

Text

31 GBytes (6% of the entity)

topic

Text

1 KBytes (0% of the entity)

topic_hash

String

1 KBytes (0% of the entity)

Metadata



312 GBytes (57% of the entity)

Here it is after (modulo time and data growth):

Total Number of FeedEntryRecord Entities: Average Size of Entity:
1,756,534,609 325 Bytes

Storage Space by Property Type

pie chart

Storage Space by Entity Kind

pie chart
Property Type Size

entry_content_hash

String

108 GBytes (20% of the entity)

update_time

Date/Time

46 GBytes (9% of the entity)

entry_id_hash

String

14 MBytes (0% of the entity)

entry_id

Text

2 MBytes (0% of the entity)

Metadata



378 GBytes (71% of the entity)

4 comments:

Alfred Westerveld said...

Could you please change your blogger theme. I find this hard to read this way :$? Interesting post nonetheless.

George Moschovitis said...

great post, the theme indeed sucks...

S. Sriram said...

Since each mapper call generates a task in the task queue, with billing enabled the task q quota being 1,000,000 per day would be about ~11 calls per second sustained over 24hrs. (11*60*60*24).

If ordinary usage does not exceed quota than using mapreduce would consume the quota blocking other tasks for the day unless one requested an upgrade or is there some other proposed way to deal with this??

re: theme- me too... I need a zap colors bookmarklet to read this site

brett said...

No, each mapper call does not generate a task. The mapper framework does as many map calls as it can in a 30 second period. Thus it can achieve way more work with fewer tasks. See http://mapreduce.appspot.com

Post a Comment