There is not a single answer to this question. It comes from a lot of outside inspirations that stacked up over the years.
One of our inspirations is a 2017 research paper from 5 engineers at Google.
“What’s your ML Test Score? A rubric for ML production readiness and technical debt reduction”
It is a very straightforward and easy-to-read paper, summarizing a lot of good ideas in 4 pages. It starts by acknowledging something that all AI practitioners know:
Using machine learning in real-world production systems is complicated. 😓
The paper breaks down potential solutions to this challenge into 2 buckets: testing and monitoring. But how much testing and monitoring is enough?
It then introduces the concept of an ML Test Score, with points awarded for each verified test among 4 categories:
1. Features and Data
2. Model Development
3. Infrastructure
4. Monitoring Tests
Fast forward to 2021, this paper is surprisingly modern, even forward-thinking. My one regret: this approach is quite difficult to apply for companies without highly-trained software engineers & data scientists like Google. 👩💻🧑💻👩💻🧑💻👩💻🧑💻👩💻🧑💻
With Giskard, we want to make Test-Driven Data Science (TDDS) accessible to everyone. 🤓