Understanding Performance Bias With the Valor Model Evaluation Service

Machine learning benchmarks like ImageNet, COCO, and LLM Leaderboard usually target a single metric, such as accuracy for classification tasks or mean average precision for object detection. But for real-world problems, using a single metric to judge performance is usually not a good idea—and can even be misleading. Consider a fraud detection model: If 0.1% of transactions are fraudulent, then a machine learning model that predicts that every transaction is not fraudulent will be 99.9% accurate—but is actually completely useless.

Even considering a host of other metrics, such as class-wise precision and recall, confusion matrices, or receiver operating characteristic (ROC) curves, will not give a complete picture. The crucial thing lacking from these metrics is an understanding of performance bias: when a model performs worse on a particular segment of the data than the whole. The history of machine learning has plenty of examples of performance bias, including many newsworthy ones. There are many instances of AI models being biased against people of color, such as healthcare models, lending models, and facial recognition models. Some LLMs have been shown to have a geographic bias.

Striveworks now has an open-source tool, Valor, for understanding these different types of biases. This model evaluation service exposes performance bias by defining a subset of data through filters on the data (and any arbitrary attached metadata) attached to Valor objects. It has first-class support for:

  • Simple data types (numeric data, strings, booleans)
  • Dates and times
  • Geospatial data (via GeoJSON)
  • Geometric data

Below, we explore how machine learning teams can use Valor to gauge these sorts of model biases.

What Is Valor?

Valor is an open-source model evaluation service created to assist machine learning practitioners and teams in understanding and comparing model performance. It’s designed to fit into a modern MLOps tech stack; in particular, Valor is the model evaluation service for the Striveworks end-to-end MLOps platform.

Valor does the following:

  • Computes various metrics for different task types, including classification (for arbitrary data modalities), object detection, and semantic segmentation
  • Stores them centrally for discoverability, shareability, and query-ability
  • Supports defining data subsets using metadata to enable analyses, such as bias detection
  • Maintains model lineage so that metrics can be trusted, allowing users to see exactly what went into the metrics and how they were computed

Valor runs as a back-end service that users interact with via a Python client. For detailed information on setting up and using Valor, see the official documentation

How Do I Use Valor to Understand Model Performance Bias?

Valor identifies model performance bias through its robust metadata and attribute filtering.

Metadata and Evaluation Filtering

To represent datasets, models, predictions, and ground truth data, the Valor Python client has the following fundamental classes:

  • valor.Dataset: Represents a dataset
  • valor.Datum: Represents a single element in a dataset, such as an image in a computer vision dataset, a row in a tabular dataset, or a chunk of text in a natural language processing dataset
  • valor.Model: Represents a predictive model
  • valor.GroundTruth: Represents a ground truth, linking an annotation with a dataset
  • valor.Prediction: Represents a prediction, linking an annotation with a dataset and model
  • valor.Annotation: Used to store ground truth and prediction class labels, bounding boxes, etc.

Using Valor, the basic workflow is as follows.

  1. Create a valor.Dataset, which we will call dataset in the examples below.
  2. Add valor.GroundTruth objects to it.
  3. Create a valor.Model, which we will call model in the examples below.
  4. Add valor.Prediction objects to it.
  5. Call one of the task-specific evaluation methods on the model, such as evaluate_classification or evaluate_detection. In rough code, this looks something like:

from valor import Dataset, GroundTruth, Model, Prediction, Annotation

dataset = Dataset.create("dataset name")

dataset.add_groundtruth(GroundTruth(datum=Datum(...), annotations=[Annotation(...), ...]))

model = Model.create("model name")

model.add_prediction(dataset, Prediction(datum=Datum(...), annotations=[Annotation(...), ...]))

 

model.evaluate_classification(dataset)

One of the powers of Valor is that it allows all of the above objects to have arbitrary metadata associated with them. Users can filter metadata and attributes (such as class label or bounding-box size) to define subsets of data and then use those subsets for evaluation. This provides a means for quantifying model performance on different segments of the data.

Based on these metadata and attributes, Valor users can pass different types of filtering to evaluations.

Date and Time Filtering

Dates and times can be added as metadata, using Python's datetime library. For example:

from datetime import datetime, time

from valor import Datum

 

Datum(

    uid=<UID>,

    metadata={"date": datetime(year=2024, month=2, day=12), "time": time(hour=17, minute=49, second=25)}

)

Then, if we want to evaluate the performance of an object detection model on images taken during the day, we would do something like:

model.evaluate_detection(

    datasets=dataset,

    filter_by=[Datum.metadata["time"] >= time(hour=8), Datum.metadata["time"] <= time(hour=17)]

)

Or, to know how a classification model performs for data since the year 2023, we would do:

model.evaluate_detection(

    datasets=dataset,

    filter_by=[Datum.metadata["date"] >= datetime(year=2023, month=1, day=1)]

)

Simple Data Type Filtering

The standard data types (int, float, str, boolean) and their filtering are all supported in Valor as metadata values.

For example, demographic information may be attached as:

Datum(uid=<UID>, metadata={"sex": "Female", "age": 62, "race": "Pacific Islander", "hispanic_origin": False})

Then, to evaluate how a model performs on all female- and Hispanic-identifying people under the age of 50:

model.evaluate_classification(

dataset=dataset,

filter_by=[Datum.metadata["sex"] == "Female",

Datum.metadata["age"] < 50, Datum.metadata["hispanic_origin"] == True]

)

Metadata can be attached to objects besides Datums. For example, suppose we’re evaluating an object detection model for a self-driving vehicle, and we want to know how well the model performs on pedestrians in the road versus not in the road. In this case, we can attach a boolean metadata field to every person-bounding-box annotation and use this to filter object detection evaluation:

dataset.add_groundtruth(

GroundTruth(datum=Datum(...),

annotations=[Annotation(

task_type=TaskType.OBJECT_DETECTION,

          bounding_box=person_bbox,

          labels=[Label(key="class", value="person")],

          metadata={"in_road": True}

), ...])

)

 

model.evaluate_detection(dset, filter_by=[Annotation.metadata["in_road"] == True])

 

We explore this particular example in end-to-end detail in one of our sample notebooks.

Filtering on Geospatial Metadata

Valor supports GeoJSON dicts as metadata, which can then be filtered by geometric operations, such as checking if a point is inside a region or if two regions intersect. For example, suppose every piece of data has a location of collection. We can add this as metadata to the datum:

Datum(uid=<UID>, metadata={"location": {"type": "Point", "coordinates": [-97.7431, 30.2672]}})

Now, if we want to see how a model performs on data that was collected from a certain city, we can do the following (where city_geojson is a GeoJSON dict specifying the city):

 

model.evaluate_classification(

datasets=dataset,

filter_by=[Datum.metadata["location"].inside(city_geojson)]

)

Filtering on Geometric Metadata

Finally, for geometric tasks (such as object detection and segmentation), we can filter regions by geometric properties (such as area). For example, to evaluate an object detection model on bounding boxes with an area of less than 100,000 square pixels, we can use:

 

model_seg.evaluate_detection(

    valor_dataset,

    filter_by=[Annotation.raster.area < 100000]

)

A Tool for Understanding Model Performance in the Real World

Valor is a game changer when it comes to understanding model performance bias. By filtering model evaluations based on metadata and attributes, machine learning practitioners gain a world of insight into how their models perform on datasets and, crucially, different segments within a single dataset. Most importantly, this information is essential to understanding model performance in the real world. 


We encourage you to experiment with Valor and let us know how you use it to evaluate your ML models. Check out the Valor GitHub repository to start using it in your machine learning workflows today, and read Valor’s official documentation to learn more.