I'm Brett Slatkin and this is where I write about programming and related topics. You can contact me here or view my projects.

16 February 2015

Python's asyncio is for composition, not raw performance

Mike Bayer (of SQLAlchemy and Mako) wrote a post entitled "Asynchronous Python and Databases". I enjoyed it and I think it's worth your time to read. He put Python's new asyncio library to the test and concludes:

My point is that when it comes to stereotypical database logic, there are no advantages to using [asyncio] versus a traditional threaded approach, and you can likely expect a small to moderate decrease in performance, not an increase.

The benchmarks he ran are interesting, but initially I thought they missed the biggest advantage of asyncio: trivial concurrency through coroutines. I gave a talk about that at PyCon 2014 (slides here). I think what asyncio enables is remarkable. Later on, Mike linked me to another reply of his and now I think I understand where he's coming from.

Basically: If you're dealing with a transactional database, why would you care about the type of concurrency that asyncio enables? When you're using a database with strong consistency, you need to wrap all operations in a transaction, you need to provide a clear ordering of execution, locking, etc. He's right that asynchronous Python doesn't help in this situation alone.

But what if you want to mix queries through SQLAlchemy with lookups in Memcache and Redis, and simultaneously enqueue some background tasks with Celery? This is when asyncio shines: When you want to use many disparate systems together with the same asynchronous programming model. Asyncio makes it trivial to compose these types of infrastructure.

Asyncio won't win in benchmarks that focus on raw performance, as Mike showed. But asyncio will be faster in practice when there are parallel RPCs to distributed systems. It's Amdahl's law in action. What I mean specifically is cases where you issue N coroutines and wait for them all later:

def do_work(ip_address, session_id):
  # Issue two RPCs
  session_future = memcache.get(session_id)
  location_future = geocoder.lookup(ip_address)

  # Wait for both
  done, _ = yield from asyncio.wait([session_future, location_future])
  session, location = done

  # Now do something with both results ...

The majority of my experience in asynchronous Python comes from the NDB library for App Engine, which was a precursor to asyncio and is very similar. In that environment, you can access all of the APIs (Database, memcache, URLFetch, Task queues, RPCs, etc) with a unified asynchronous model. Our codebase that uses NDB employs asynchronous coroutines almost everywhere. That makes it simple to combine many parallel RPCs into workflows and larger pipelines.

Here's a simplified example of one pipeline from my day job. You can think of this as a very basic search engine. Note how many parallel coroutines are executed.

  1. Receive an HTTP request
  2. In parallel:
    1. Send RPC to geocode the IP address
    2. Send RPC to lookup inbound IP in a remote database
      1. After receiving response, in parallel:
        1. Lookup N rate limiters in N memcache shards
      2. Return whether inbound IP is over rate limits
    3. Lookup the user's session in memcache
      1. If it's missing, create the new session object, then in parallel:
        1. Enqueue a task to save the session to the DB
        2. Populate the session into memcache
        3. Set the user session response header
      2. Return the session (new or existing)
  3. Wait for geocode and session RPCs to finish
  4. In parallel:
    1. Do N separate queries based on the user's attributes
      1. First, look in memcache for cached data by attribute
      2. If memcache is missing or empty, do a database query
      3. Look up query results in rate limiting cache
  5. As queries finish (i.e., asyncio.as_completed)
    1. Rank results by relevance
    2. Return best result after all queries finish
  6. Wait for rate limit check from #2b above
    1. If the rate limits are over, return the 503 response and abort
  7. In parallel:
    1. Update result rate limiting caches
    2. Enqueue task to log ranking decision
    3. Enqueue task to update user's session in database
    4. Update user's session in memcache
    5. Start writing response
  8. Wait for all coroutines to finish

This pipeline has grown a lot over time. It began as a simple linear process. Now it's 5 layers "deep" and 10 parallel coroutines "wide" in some places. But it's still straightforward to test and expand because coroutines make the asynchronous boundaries clear to new readers of the code.

I can't wait to have this kind of composability throughout the Python ecosystem! Such a unified asynchronous programming model has been a secret weapon for our team. I hope that Mike enables asyncio for SQLAlchemy because I want to use it along with other tools, asynchronously. My goal isn't to speed up the use of SQLAlchemy alone.
© 2009-2020 Brett Slatkin