Our product generates histograms from individual data points. We're working on improving our analysis system so it can answer more questions using the same raw input data. We do this by constructing an intermediate data format that makes computations fast. This lets us do aggregations on the fly instead of in batch, which also enables advanced use-cases like interactive parameter tuning and real-time updates.
Each time we'll add a new analysis feature, we'll also have to regenerate the intermediate data from the raw data. It's analogous to rebuilding a database index to support new types of queries. The problem is that this data transformation is non-deterministic and may produce slightly different results each time it runs. This varying behavior was never a problem before because the old system only did aggregations once. Now we're stuck with no stable way to migrate to the new system. It's a crucial mistake in the original design of our raw data format. It's my fault.
The functional programmers out there, who naturally leverage the power of immutability, may be chuckling to themselves now. Isn't reproducibility an obvious goal of any well-designed system? Yes. But, looking back to 4 years ago when we started this project, we thought it was impossible to do this type of analysis on the fly. The non-determinism made sense. We were blind to this potential before. Since then we've learned a lot about the problem domain. We've gained intuition about statistics. We're better programmers.
In hindsight, we should have questioned our assumptions more. For example, we should have asked: What if it becomes possible someday to do X? What design decisions would we change? That would have been enough to inform the original design and prevent this problem with little cost the first time around. It's true that too much future-proofing leads to over-design and over-abstraction. But a little bit of future-proofing goes a long way. It's another variation of the adage, "an ounce of prevention is worth a pound of cure".
1. The non-determinism is caused by machine learning classifiers, which are retrained over time. This results in slightly different judgements between different builds of the classifier.
(PS: There's some additional discussion about this on Lobsters)