Machine learning (ML) models should not be treated like magic, they should be treated like software. No one ships traditional software services to production without rigorous testing, but that’s exactly what happens with ML products. Most ML models are deemed production-ready after meeting minimal evaluation criteria—in most cases, after looking at just a few aggregate metrics! But this testing philosophy blind-spots model behavior and can lead to costly real-world consequences.
So why should you care about testing? Simply put, your customers rely on your models to perform as they expect, and failing to deliver will cause you to lose their trust. Their use of your product hinges on the answer to a fundamental question: Does the model behave as expected? As Katherine Heller, who works at Google on AI for healthcare, put it in 2020:
“[When AI underperforms in the real world], it makes people less willing to want to use it. We've lost a lot of trust when it comes to the killer applications, that’s important trust that we want to regain.”
Google Health had a very public and costly failure with a computer vision model that checked for diabetes indicators. The company gave customers the model to run in production, where it rejected more than 20% of images due to lower image quality than the images used in the lab. If Google had instead broken down its test data by subclasses with test cases for production scenarios including low resolution or low-light images, it could have caught the performance deficit in the lab. Instead, customers were very frustrated and unhappy with the product.
ML teams should spend 60-80% of their time improving models, but in reality, they lose weeks of productivity playing whack-a-mole, fixing issues and regressions, only for new ones to pop up elsewhere. The simple fact is that if model testing criteria are not constantly improved by adding new testing scenarios, there is no guarantee that issues (new and old) won’t surface again.
To understand the cost of model failures, consider the following case study. Let’s say you have an ML team of 5 engineers who cost you an average of $150,000 in salary ($750,000 in annual base salaries, $14.4k in weekly salaries) and 10 customers each paying an average of $200,000 per year. Let’s also assume that your team spends about 2 weeks a year fixing issues per customer that could have been caught earlier if you had a solid test framework in place.
Translation? You’re looking at approximately 20 weeks of lost development time spent on fixing issues. All totaled, that’s $288,000 a year in lost productivity, and that’s not even factoring in the costs of SLA breaches, loss of future revenue streams, or the cost of offering discounts and credits on renewal due to earlier failures. In fact, if you were to factor in a 10% loss per customer for failure to meet contractual targets, you’re easily looking at an additional loss of $200,000. Summing up the losses both in productivity and in revenue for failing to meet your contractual targets, you’re now at $488,000.
When you start crunching your own shop’s numbers, all of this feels much more real and a whole lot scarier, but there is a solution! A comprehensive testing framework makes regression testing a very low lift for your team while offering immense value, ultimately sparing you from the damage that can be caused by production model failures. In our experience, the right testing tool integrated into a team’s MLOps pipeline accelerates the productivity and iteration speed of an ML team by an order of magnitude, allowing models to be shipped regularly and with an assurance of their quality.
An ML unit test is one of the most powerful ways to ensure model quality. A unit test defines a specific scenario such as low light, low resolution, occlusion, and has metrics calculated for that specific subset of data. With clearly defined unit tests, you can know your model’s performance, not at just the aggregate level but down at the subclass level, too, ensuring that you aren’t silently regressing in areas that are important to your customer. Using the aforementioned Google Health model as an example, test cases built around the customer’s deployment, including low-light and low resolution cameras, could have prevented the model from being used in production.
By applying the same principles to the ML world that the software development world has applied through regression testing, we can also eliminate the whack-a-mole situation. If you make a bug fix, for example, you can simply update your test suite with the new test case, and you’ve solved the problem once and for all. If a regression happens, your test will now flag it, and your team won’t waste time fixing the same issues again and again. Instead, they can focus on the work you pay them to do—making forward progress with the model.
Built by ML practitioners who are very familiar with the pain of model failures in production, Kolena gives you and your team a versatile platform that fits your ML problem and gives you everything you need to ship high-quality models with confidence. Whether your goal is to ensure model performance for your customers, improve model quality by testing fine-grain subclasses, or streamline your team’s MLOps testing to let your team focus on their work, Kolena is the platform that can do it all.