Eric Korman Explains Valor and Its Step Change for Model Evaluation

Eric Korman is the Chief Science Officer at Striveworks. He leads our Research and Development Team, which recently released Valor—our first-of-its-kind evaluation service for machine learning (ML) models. 

We caught up with Eric to learn more about Valor, ML model evaluation, and why this open-source tool is a game changer for maintaining the reliability of ML models in production.

***

How did your machine learning research ultimately lead to Valor?

‘We haven’t seen an evaluation service before, because we’re just now getting to where data science is not a science experiment, but it’s done at scale and it’s in production.’
— Eric Korman, Striveworks

In the MLOps space now, there are a lot of point solutions around model deployment and data management and experiment tracking. But what was really lacking, before we launched Valor, was a modern evaluation service. This is a service that will compute evaluations for you, store them centrally, make them shareable and queryable, and also provide more fine-grained evaluation metrics than just a single, all-encompassing number. It lets you really get an understanding of how your model performs—understanding different segments of your data, properties, those things. That’s the need we saw, so we built Valor to address that need. 

Can you explain why that’s valuable? Couldn’t I just put a model in production and see how it performs for myself?

You can definitely do an eye test, but that’s not always reliable or quantitative. Plus, we’re seeing this explosion of AI and ML, so there’s a proliferation of models available to use. Teams are deploying multiple models at once. Deploying models and spot-checking how they perform is not a very scalable way to evaluate them. You want something systematic and also something you can trust. Valor is open-source. You can see exactly how it computes metrics—so, the number it spits out, you know exactly where it came from. 

In my experience—on teams I’ve been on, and speaking to other data scientists—evaluation may be done programmatically, but then it’s stuck in some spreadsheet somewhere, or some report, or some Confluence page. There’s a lack of auditability and trustworthiness. So, Valor handles that—not just by computing the metrics for you but also then storing them for you, so you can trust your model evaluations and do it at scale. 

Valor is the first of its kind, in terms of an open-source solution. What other approaches were people using for model evaluation before we launched it?

‘You see a lot of tools where the end output is a report that goes to someone. That’s cool, but when you start deploying these things, you want systems and processes that tie into each other.’
— Eric Korman, Striveworks

Valor is pretty unique. People have built their own internal solutions. We see a lot of that in general in the MLOps space—a mix of taking something that’s open-source, expanding upon it, integrating it with something you build internally. A lot of stuff is still done in Jupyter Notebooks, which are great for exploratory data analysis. You can do some model evaluation there. But really, we haven’t seen an evaluation service before, because we’re just now getting to where data science is not a science experiment, but it’s done at scale and it’s in production. 

You see a lot of tools where the end output is a report that goes to someone. That’s cool, but when you start deploying these things, you want systems and processes that tie into each other. You don’t want a report, you want some service that you can query to get metrics, and then you might want to act on that information in an automated way. So far, people have had to build that in-house. There are not many solutions that are open-source and general-purpose the way Valor is. 

Would you say that being open-source is a big part of what Valor brings to the table?

I don’t think being open-source makes it unique. Its functionality is very unique. Valor is able to encompass a lot of use cases with the right kind of abstractions and generalities. There’s uniqueness in that it both computes and stores the metrics. Maybe most important is the flexibility it gives you to attach different metadata and information to your models or datasets or data points and, really, evaluate model performance against that data and stratify evaluations with respect to different metadata. It gives you ways of determining bias, and things like that.

We see lots of ML research and ML tools that came up from these benchmark datasets—your ImageNets, your COCO. They’re simple datasets in the sense that, in the real world, you have imagery datasets, but you’ll have a host of metadata attached to that. “What sensor took this picture?” “What was the location of this picture?” “What time was it taken?” All that rich metadata can be useful to understand model performance. If you’re running the same model on images that come from a bunch of different cameras, you might want to know, “Does my model perform better on one camera versus another?” “What about nighttime versus daytime?” 

For all those more in-depth questions, there were not a lot of tools—until Valor—that helped you get that analysis into model performance.

How do you see Valor getting integrated into ML workflows? How are people using it on a day-to-day basis?

With our open-source offering of Valor, we’re hoping that it’ll get community adoption and integrate with an organization’s MLOps tech stack. It fills a gap. If an organization has their setup for doing model training and deployment, they’re probably missing this evaluation piece, and Valor was engineered so that it can easily integrate with what comes to the left of model evaluation.

What motivated the decision to open-source Valor? How do you think it will shape the evolution of the product?

‘We think we have a unique and emerging view of MLOps, where we really want every piece to be not just scalable but also auditable.’
— Eric Korman, Striveworks

There were a few reasons for wanting to open-source it. The virtuous reason is that the ML community does a really good job of open-sourcing things. Everything in ML is built on top of open-source foundational pieces. All the deep learning libraries that are used are open-source. Underlying databases are open-source. So, we wanted to give back. 

Then, obviously, the business case: It’s a piece of our brand. It’s showing off what we can do. It’s also showing off our worldview. We think we have a unique and emerging view of MLOps, where we really want every piece to be not just scalable but also auditable—and, again, where the output is not a report but something that you can take action upon in an automated way. Another phrase we like to use: People are used to “infrastructure as code,” but we want to take that further and really do “process as code.” That gives you auditability of processes.

Where do you think Striveworks fits within the broader ML community?

Valor is a good example of one of the ways that we are trying to reach out and be an axon to connect to the rest of the MLOps community. We’ve done a really good job of building partnerships on the application layer—companies and industries we can partner with to enable their products or applications to be intelligent by putting Striveworks machine learning under the hood. Now, this is a way to do that sort of thing, but at the lower level—at the platform MLOps level.

How do you see the future of model evaluation? Where do you see Valor fitting into that future?

At Striveworks, we talk about the Day 3 Problem. On Day 1, you build a model. Day 2, you deploy a model. Day 3, when the model is in production and things change, that third day is when that model fails. Talking about Day 1, Day 2, and Day 3 implies a line, but really, at failure, you go back to Day 1. You want to refine and retrain your model to make it more performant. Striveworks has a lot of IP around that Day 3 Problem, not just through Valor but also through our model monitoring techniques and other tools. 

How do we detect model failure? When we say failure, we don’t mean when the network goes down or some computer crashes, so your model’s not deployed. You can detect that straightforwardly. Rather, failures due to data drift are more what we’re trying to build technologies to detect. Valor does that by doing model evaluation against human ground truth. Our monitoring capability does that in an unsupervised way, where it detects changing input data and flags it. 

Going back to our viewpoint that everything should be a process and easily integrate with the rest of the pieces of the pipeline, that’s where we’re going with our offerings, including Valor. So, if there’s model drift detected or evaluation numbers go down, that’s just not an email that gets sent to someone, it can be fed into a retraining pipeline.

For example:

“Through monitoring, determine the data points where the model performs poorly.” 

“From that, automatically create an annotation job to get a human to annotate that data and get it retrained.” 

“Via Valor, compute metrics and see if the model actually improved on that newly annotated data compared to the previous model.”

It’s just really building this pipeline. It’s not going to automate everything—at least right now—but it’s going to automate the parts that are automatable, have a human do what humans excel at, and let computers do what computers excel at. 

Looking ahead, what’s your vision for Striveworks? How does Valor align with these long-term goals? 

Lofty goal is we’re the premier company for Day 3 technology—for model monitoring, detecting model failure, model evaluation. Then, being able to do that in a way that makes models easy to remediate through retraining and fine-tuning. Valor fits into that as one of the critical components of identifying if your models have problems. 


Interested in model evaluation with Valor? Try it yourself. Get Valor from the Striveworks GitHub repository to start using it in your machine learning workflows today, and read Valor’s official documentation to learn more.