When President Biden signed his new executive order on AI this week, we entered a new phase in AI’s remarkable and rapid evolution. As with the formation of the securities regulatory framework in the ‘30s or the creation of environmental protection safeguards in the ‘70s, this change will have profound impacts for both technology and society as a whole in the years ahead.
As many observers have noted, however, the executive order (EO) currently lacks ‘teeth;’ in its initial form, it’s more a set of guidelines than tactical, actionable directives. Transforming this landmark order into something functional will take a concerted, collaborative effort from companies, regulators and legislators alike.
More than anything, the order’s success will depend on empowering the right voices to be part of that crucial dialogue.
To date, legislative and regulatory discussions around AI have centered on a handful of large companies – OpenAI, Google, Anthropic, Meta and others. To advance an effective oversight framework for AI, regulators and lawmakers also need to hear from the startups who are advancing the state of the art in AI / ML testing and validation.
What specific best practices for AI / ML model testing do startups have to share? Some critical points are below:
- Post-deployment performance monitoring of AI models is necessary, but it’s not enough if it’s not balanced with robust pre-deployment testing. No other major industry skews as heavily toward post-deployment monitoring as AI does today – it’s like testing cars mainly by monitoring traffic data.
- The ‘red-teaming’ (or ‘ethical hacking’) exercises outlined in the White House’s executive order for pre-deployment testing must deliver more detailed outputs than simple aggregate test results or scores.
- ~~Aggregated test scores are not enough to ensure reliable, trustworthy AI. In fact, this approach often obscures a model’s strengths and weaknesses in the most important scenarios – even the ones it’s specifically designed to handle. This problem is known in AI research as the hidden stratification phenomenon. More information is available here.
- ~~The White House and others should be aware that ‘red-teaming’ is only one method of testing AI / ML models before deployment. Kolena is developing more effective solutions based on unit testing (see below).
- ~~Where possible, robust testing data should be made available to regulators, researchers and other stakeholders so that they can perform their own analyses.
- Pre-deployment validation processes must be based on unit testing practices that identify risks and gauge performance in key scenarios and track them across releases for each AI / ML model.
- ~~Rapid, scalable unit testing (or scenario-level testing) is the only approach that gives AI builders, customers, users, regulators and legislators assurance that a model will behave as expected in critical scenarios and applications.
- ~~This technique also addresses the hidden stratification phenomenon.
- Teams and organizations must consistently implement and follow best practices for a) curating high-fidelity tests that b) use a standardized quality rubric to c) assess and validate AI systems at the overall product level, not just the AI model.
- ~~This approach follows the AI Risk Management Framework published by the National Institute of Standards and Technology (NIST) in January, which requires AI builders to Map, Manage and Measure potential risks in their AI development processes.
- ~~Product-level testing (or ‘end-to-end system testing’) is critical to understanding whether an AI system actually does what it’s supposed to do. A model for a self-driving car, for example, cannot simply be tested on the ability to identify pedestrians; it must also be able to identify and adjust for collision risk.
- ~~As Google Brain Co-Founder Andrew Ng said on X, “The right place to regulate AI is at the application layer. Requiring AI applications such as underwriting software, healthcare applications, self-driving, chat applications, etc. to meet stringent requirements, even pass audits, can ensure safety.”
- Regulators and lawmakers must realize that a model’s test data is just as important – if not more so – than the training data used to develop it.
- ~~AI builders cannot understand how to make a model safer or more trustworthy simply by assessing or adjusting training data. Only a transparent, scalable method of managing and evaluating test data can help to identify and fix deficiencies or imbalances in a model’s training data.
As a helpful resource, Kolena's research team has developed an end-to-end best practices guide to comply with the EO and build AI that the community can trust. We will be publishing it in the next couple of weeks. Sign up below to be notified once the guide is out.
Kolena applauds the Biden Administration for its proactive measures to identify and address the key challenges involved in ensuring that AI can achieve its full potential in safe and trustworthy ways.
We look forward to working alongside our colleagues, customers, researchers, legislators and regulators alike to develop rigorous, nuanced and scalable testing practices that can translate this shared vision into action – and we urge the Biden Administration and others to seek input from the startups that are at the forefront of this effort.
Mohamed Elgendy is the CEO and co-founder of Kolena, a startup that provides robust, granular AI / ML model testing for computer vision, generative AI, natural language processing, LLMs, and multi-modal models, with the goal of ensuring trustworthiness and reliability in an increasingly AI-driven world.