Building robust software with rigorous design documents

My work is centered around building software. In the past, I've been the primary designer and implementor of large software systems, collaborating with many engineers to launch programs into production. Lately, I've been spending much more of my time guiding others in their software design efforts.

Why design software at all? Why not just start writing code and see where it leads? For many problems, skipping design totally works and is the fastest path. In these cases, you can usually refactor your way into a reasonable implementation. I think you only need to design in advance once the scope of the software system you're building is large enough that it won't all fit in your head, or complex enough that it's difficult to explain in a short conversion.

Writing a design document is how software engineers use simple language to capture their investigations into a problem. Once someone has written a design document, a technical lead — often the same person who was the author of the document — can use it to set target milestones and drive an implementation project to completion.

I realized that I've never actually written down what I look for in a design document. So my goal in this post is to give you a sense of what I think it takes to write a design document, what the spectrum of design documentation looks like in my experience, and what I consider to be the crucial elements of good design.

How to get started

First, before writing any kind of formal documentation, you need to prototype. You need to gain experience working in the problem domain before you can establish any legitimate opinions.

The goal of making a prototype is to investigate the unknown. Before you start prototyping, you may have some sense of the "known unknowns", but understand little about them in practice. By prototyping, you'll improve your intuition so you can better anticipate future problems. You may even get lucky and discover some unknown unknowns that you couldn't have imagined.

Concretely, prototyping is getting the system to work end-to-end in one dimension (e.g., a tracer bullet implementation). It's working out and proving that the most confusing or risky part of the system is possible (e.g., the core algorithm). Or prototyping is dry-fitting all of the moving parts together, but without handling any of the complex edge cases. How you go about prototyping reflects the kind of problem you're trying to solve.

How to write a design document

The first draft of your "design document" is the code for your first working prototype. The second draft is a rough document that explains everything you learned from building that prototype. The third draft includes a proposal for a better design that addresses all of the difficulties you discovered while prototyping. You should share the third draft with the rest of your team to get their feedback. Then the final draft is a revision of the document that addresses all of the questions and concerns raised by your peers.

Design documents should be as short as possible. They should include enough detail to explain what you need to know and nothing more. Your design doc shouldn't include any code unless that code is critical for the reader to understand how the whole design fits together (e.g., an especially difficult algorithm that relies on a specific programming language's constructs).

There are five major sections that I recommend you have in a design document, and in this order:

1. Background

This is information that puts the design in context. You should assume that your reader knows very little about the subject matter. Here you should include everything they'll need to know about the problem domain to understand your design. Links to other design documents, product requirements, and existing code references are extremely useful.

When writing the background section, you should assume that it will be read by someone with no context. A couple of years from now, all of the knowledge that led to your implementation will likely be forgotten. You should treat the background section like it's a letter to the future about what you understood at this time in the past.

2. Goals and non-goals

These are the motivations for your project. Here you summarize the intentions of your proposed implementation. This section should explain the measurable impact of your design. You should provide estimates of how much you're going to help or hurt each quantifiable metric that you care about.

This section should also explicitly list the outcomes that you're not trying to achieve. This includes metrics you won't track, features you won't support, future functionality that isn't being considered yet, etc. Tracking non-goals is the primary way you can prevent scope creep. As your peers review your design document and bring up questions and comments, the non-goals section should grow accordingly to rule out entire areas of irrelevant investigation.

3. Overview

This section is a coarse explanation of what the software system is going to do. Engineers familiar with the problem and context should be able to read the overview and get a general sense of what the major moving parts of the design are. By reading the overview section, a fellow engineer should be able to formulate a set of important questions about the design. The purpose of the rest of the design document is to answer those questions in advance.

4. Detail

This section goes through each major component from the design overview and explains it in precise language. You should answer every reasonable question you can think of from the design overview. This is where you put things like sequence diagrams. You may also list step-by-step recipes that you'll employ in the software to solve the primary problem and various subproblems.

5. Risks

After reading the detailed design, your readers should have a sense of where your design may go wrong. It's a given that your system will fail to work in certain ways. You're making time vs. space trade-offs that are incorrect. Tolerances for the resources you need, or the wiggle room you'll have to accommodate changes will be insufficient. Edge-cases you ignored will turn out to be the most important functionality. In this section you should list how you anticipate your system will break, how likely you think those failures will be, and what you'll do to mitigate those problems before or when they occur.

What is the scope of a design document

After going through the distinct sections of a design document, there are still many open questions: How much detail should a design document include? How big of a scope should you address in a single design document? What should you expect a software engineer to produce on their own?

I'll answer these questions by trying to characterize the nature of problems that are solved by software engineers. You can identify distinct levels of software complexity by considering the size and shape of what's being confronted. Here's a conceptual diagram of what I consider to be the hierarchy of scope that software engineers handle:

The breakdown of this hierarchy is:

Market need: Broad category of goods and services that people and organizations desire.
Opportunity: Related ways of addressing those market needs.
Problem domain: Vast areas of complexity that must be understood and addressed to seize such opportunities.
Problem: Distinct issues in the problem domain that must be solved in order to advance towards the opportunity.
Subproblem: The many aspects of a larger issue that must be handled in order to solve the whole problem.

Here's a concrete example of what I mean with this hierarchy:

Market need: Hosting websites
Opportunity: Cloud computing
Problem domain: Virtual machines
Problem: CPU performance
Subproblems: context switching; cache invalidation

Some problems contain dozens of subproblems. Some problem domains contain hundreds of problems. Some opportunities contain vast numbers of related problem domains. And so on. This hierarchy diagram isn't meant to quantify the size ratios between these concepts. I'm trying to express their unique nature and the relationships between them.

The detail I expect to see in a design document, and thus the document's length, varies based on the scope of the project. Here's roughly the breakdown of design document length that I've seen in my career:

Scope	Design document length
Subproblem	500 words
Problem	2,000 words
Problem domain	8,000 words
Opportunity	1,500 words
Market need	1,000 words

What's surprising about this table is that designs addressing a problem domain are the most rigorous. The design detail required to handle a problem domain far exceeds that of any other scope. Design detail doesn't continue to increase as scope grows. Instead, as an engineer's scope expands to include multiple problem domains, multiple opportunities, and entire market needs, the level of detail I've seen in design documents plummets.

I think the reason these documents are so rigorous is that understanding a problem domain is the most difficult task an engineer can handle on their own. These design documents aren't immense because they're overly verbose, they're extremely detailed because that's what it takes to become an expert in a problem domain.

But there aren't enough hours in the day to become an expert in multiple areas. Once someone's scope gets large enough, they must start handing off problem domains to other engineers who can devote all of their time to each area. It's unrealistic for one person to write design documents for many problem domains. Thus, the design documents for larger scopes actually get smaller.

How to design for a whole problem domain

So what, exactly, goes into a design document for a problem domain? What makes these docs so detailed and rigorous? I believe that the hallmark of these designs is an extremely thorough assessment of risk.

As the owner of a problem domain, you need to look into the future and anticipate everything that could go wrong. Your goal is to identify all of the possible problems that will need to be addressed by your design and implementation. You investigate each of these problems deeply enough to provide a useful explanation of what they mean in your design document. Then you rank these problems as risks based on a combination of severity (low, medium, high) and likelihood (doubtful, potential, definite).

For each risk, you must decide:

Does it need to be mitigated in order to ship the minimum viable product?
Can it be addressed after shipping without hindering the product initially?
How will it be addressed if certain behaviors become worse over time?

You should not plan to mitigate every risk in advance; this is impossible because the scope of the problem domain you've taken on is too large and the unknowns are too complex. Instead, your design document should identify the most likely scenarios and outline potential mitigations for them. These mitigations can end up being large projects in themselves, and often need to be designed and implemented by dedicated teams of engineers.

To be more concrete about what a problem domain design document looks like, let's assume that you've taken on the cloud computing example from before. The problem domain you're addressing is "virtual machines". Here's what you'd cover in your risk assessment.

First, you'd enumerate the major concerns within this problem domain:

CPU performance
Security
Memory performance
I/O performance
and so on ...

Then you'd identify expected subproblems:

CPU Performance
- Instruction cache thrashing due to context switching
- Branch prediction failures
- Kernel lock contention
Security
- CPU instructions that can exploit ring 0
- Multi-threading exploits to system calls
- Shared address space between guest and host OS
- DMA vulnerabilities
Memory performance
- Data cache performance because of lack of CPU pinning and NUMA architectures
- Page alignment conflicts between virtual machine address space and host OS address space
- Endianness discrepancies
- Oversubscribing memory to increase multi-tenancy
- Memory compression and duplicate page merging
I/O performance
- System call context switching overhead
- Avoiding copies for network sends
- Local disk access fairness
- Latency vs. throughput interplay when the VM workloads on a single host machine are wildly different
and so on...

For each problem and subproblem you'd flesh out the potential solutions in the design document. You may write small programs to verify assumptions, do load tests to find the realistic limits of infrastructure, forecast reasonable estimates for capacity, etc. What you hope to learn during the design phase is if there are any major dealbreakers that could undermine the viability of your design.

For example, when digging into the I/O performance problem, you may find through experimentation that all VM guest operating systems will need to do a number of disk reads while idling. You may then measure how many reads each VM will need on average, and use that result to estimate the maximum number of VMs per physical machine and local disk. You may discover that local disk performance will severely limit your system's overall scalability. At this point you should document the reasoning that led you to this conclusion and show your work.

Once you've identified the potential dealbreaker, you should figure out if it's viable to launch your virtual machine product without first solving the local disk issue. Your proposed design should list what you'd do if demand grew too quickly, such as requiring a wait-list for new users, limiting each user to a maximum quota of allowed VMs, throwing money at the problem with more physical hardware, etc. Your document should explain the whole range of alternatives and settle on which ones are the most prudent to implement for launch.

By recognizing such a large problem in advance, you may also reach the conclusion that you need to build a virtual local disk system in order to release your product at all. That may severely delay your timeline because the problems you need to address for launch have become much larger than you originally anticipated. Or maybe you decide to launch anyways.

The point is that it's always much better to consider all the risks before you launch. It's acceptable to take risks as long as you're well informed. It's a disaster to learn about large risks once you're already on the path to failure.

Conclusion

Even though it's full of information, what's most impressive about a problem domain design document is that it doesn't feel like overdesign. The risk mitigations it includes are not overspecified. There's just enough detail to get a handle on the problem domain. When the engineering team begins implementing such a design, there's still a lot of flexibility to change how the problems are solved and how the implementation actually works.

In general, you're kidding yourself if you think that software will be built the way it was originally designed. The goal of these design documents isn't to provide the blueprints for software systems. The goal is to prepare your team for a journey into the unknown. The definitive yet incomplete nature of a problem domain design document is what makes it the pinnacle of good software design.

One Big Fluke

12 November 2016

Building robust software with rigorous design documents