It might seem hard to believe, but not so long ago, continuous integration was new and novel. Software engineering as a field wasn’t birthed with the best practices we now consider to be common sense today, these practices were accrued over time as hard-earned lessons on how to build reliable systems and teams.
Machine learning (ML) is currently undergoing the same evolution, moving out of the research lab and becoming a common engineering discipline in its own right. But more so than any specific single tool or practice, investing in a culture of quality is crucial to successfully developing ML models that users and businesses alike can rely on.
You’ve probably heard the phrase “culture of code quality” before. It typically manifests as things like code review, standardized formatting, automated builds and testing, comprehensive test coverage, five whys, and more. All of these processes represent short-term barriers to forward progress that in the long run contribute to healthier, faster, and more effective software development teams.
ML’s processes are different, but the underlying principle is the same: a culture of quality is an organization-level ethos of waiting to eat the marshmallow. Taking the time to set up processes with compounding returns differentiates engineering from prototyping and enables ML teams to build successful products.
Quality models are produced iteratively by a culture that values quality—not through a heroic one-time effort, a groundbreaking new architecture, or from a perfect training or test set.
Looking to foster a culture of quality in your ML shop? Here’s how you do it:
Fortunately, there’s a lot of familiarity to be found in ML best practices. The development of ML models, for example, maps cleanly to the agile engineering process: iteratively improving according to a feature roadmap.
For a car detector, the capability to detect a car in daylight is a feature. Every new scenario—occlusion, weather, camera conditions—is a feature. Training a model to detect cars does not mean optimizing a single F1 score on “cars” but breaking the domain down into subclasses and prioritizing these scenarios as discrete features.
ML products need to be managed and tested by specifications, not metrics. Doing this allows you to create a clear product roadmap where bugs, regressions, and new features can be prioritized according to their importance to the product outcomes—and new models can be assessed by their improvements on the specifically-scoped features within a given release. Building for a specification removes the magic and makes the development process predictable.
Test data is your only proxy for the production world. In spite of this, testing datasets are often arbitrarily created, and data collection efforts tend to focus primarily on training data. While training data is important, testing data is more important. Test data defines your ability to assess your model’s behaviors. It is your quantification of your product objectives—test data is your specification.
Good test data is comprehensive, covering the range of scenarios you expect to encounter in production. More importantly, good test data is stratified into specific scenarios such that metrics are computed at the scenario-level.
Ideally, every new scenario—new feature—for the model to tackle is represented in test data before it is considered for training data. Set the target before you take a swing. Test-driven development is often optional in software engineering but is a powerful tool in ML when manual verification of gained behaviors is infeasible or outright impossible.
In an unknown environment, the only way to know where you are is to remember where you’ve been. The act of training an ML model to operate in a complex domain is exploring an unknown environment.
Your understanding of your domain and the testing data you use to represent it evolve constantly. In a shifting data environment, it’s easy to lose the ability to meaningfully compare new models against past models. Maybe this conversation sounds familiar:
PM: “We’ve had 98% accuracy on class X for the last two years!”
ML engineer: “Well, over that time, the benchmark we use to evaluate class X has grown by an order of magnitude, and we’ve learned that it isn’t actually class X but classes X₁₂, X₁₃, X₂₁…”
In a culture of quality, you don’t have to sacrifice reproducibility to refine your test set. Model results are stored in an evaluation store where previous iterations of your testing datasets are easily accessible. Comparing apples to apples no longer involves dredging up old code or collating PDF reports haphazardly assembled with differing methodologies.
Quality requires alignment between stakeholders on goals, processes, and results. Unfortunately, in many organizations, ML teams are one degree further removed from the business and product sides of the org than vanilla software engineering teams. Unlike software development, it’s common for the model development process to be opaque and produce unpredictable results. This is typically because only the ML team has visibility into the underlying data.
Model behaviors and capabilities are communicated as functions of data. Only when internal stakeholders—product managers, sales and customer success representatives, business leaders—have the ability to understand that data will derived metrics like precision and recall be meaningful.
To communicate clearly, metrics must be computed on multiple well-scoped subsets of your test data that are grouped in alignment with product objectives. Behaviors are what matter outside of the context of the development process. Instead of saying, “This model has 2% higher recall than the previous model,” say, “This model significantly improves recall in low lighting without sacrificing performance in other lighting conditions.”
Bugs are inevitable, but their recurrence is not. In a culture of quality, there is a right and a wrong answer to the question, “Why did this bug occur in production?”
Wrong answer: “Because the implementation was incorrect.”
Right answer: “Because our tests didn’t catch it before landing on prod.”
The same holds for ML. Investigations into the root cause of an observed model bug often uncover gaps or imbalances in the training dataset. These bugs cannot be fixed with training data, only by processes that ensure they cannot pop up again in the future.
Each time a bug is encountered, a team invested in quality defines a new test case—a small, well-scoped set of examples of the failure scenario—and tests against this test case every time a new model is trained. Over time, these “regression tests” aggregate into a test suite that can immediately surface loss of behavior and stop bad models from reaching users.
The thread that runs through each of the elements in an ML culture of quality is testing.
Quality models are produced iteratively by a culture that values quality—not through a heroic one-time effort, a groundbreaking new architecture, or from a perfect training or test set. When only the inputs are controlled, testing becomes the most important tool in the chest to meaningfully guide development.
At Kolena, we’re building an ML testing platform that can systematize your evaluations and serve as the cornerstone of your ML culture of quality. Reach out!