What Is Data Lineage?

Data lineage refers to the full history of data points and actions taken on them throughout a machine learning (ML) workflow. By inspecting a workflow’s data lineage, ML practitioners can observe the factors contributing to an ML model’s output and understand how that model arrived at its decisions. Exploring data lineage also reveals factors that contribute to errors and anomalies in ML workflows, including model drift.

Striveworks holds a patent for a unique data lineage process that enhances transparency and auditability for ML models. This process automatically captures activity involving data throughout workflows—including calls to external services—saving time and resources for data science teams.

To learn more about data lineage and why it matters for AI and ML, we sat down with Matthew Griffin, the Striveworks software engineer who was awarded the patent. Here, he explains how to think about data lineage and why organizations pursuing AI need to pay attention.

***

Matt Griffin, Striveworks
Matthew Griffin, the Striveworks software engineer who was awarded the patent for a new data lineage process.

Striveworks: Let’s start at a super high level. What is data lineage, exactly?

Matthew Griffin: It’s easiest to think about data lineage as the family tree of data. It’s the path that data took from its source to its current destination. Looking at data lineage gives you answers to questions, like “What is that data?”, “Where did you get it?”, “What has it been through?”, “How was it used?”…

You can think about that one step at a time. “We had Data X and we processed it through Pipeline X—that produced Data Y.” Now, think about that process for Data Y: “We got Data Y from Pipeline X, but now we process it through Pipeline Y, so we have Data Z.” 

You can build a chain and look at it across the entire system to get this unified picture of lineage. That’s where it’s most useful—where it’s unified across multiple levels and where you can see: “We trained a model on Dataset X, but that came from seven years ago and the dataset we used was processed 20 times before it got to us.”

So, data goes through steps and manipulations to produce an insight. Data lineage is a record of those steps until you get to this end point?

Exactly—but it’s also continuously evolving as processes are taking place in the system. Any end point is just the specific point in time when you decide to take a look at it. A healthy system that’s looking at lineage should be constantly evolving. 

Is data lineage the same thing as data provenance?

It depends how you choose to define these concepts. When people talk about data lineage, they tend to focus more on the steps and transformations that your data is processed through—that family tree component. When people talk about provenance, they are usually talking about data governance. They’re thinking about it from the perspective of the metadata: “Who’s the author?”, “Were they authorized to do that at the time?”, “How was this data generated?”

They might ask questions like, “Is this aerial drone footage?” or “Is it an underwater security camera?” 

I feel like there’s not necessarily a reason to create a big separation because you could inject some of that same authorship information in a family tree structure and then examine that with the same sort of process questions as data lineage. If you combine some of that information, I think you can get a more cohesive view and you can answer richer questions.

Why is data lineage important for data science and machine learning?

It’s important to be able to explain why you’re making a decision. If all your decisions are arbitrary then, I mean, cool, but how are you successful? You’d have no explanation. Regardless of the business and regardless of what you’re doing, if you can’t explain what it is and why you’re doing it, maybe there’s not really value there. Maybe you just happen to be lucky.

If that process is neatly contained in one person’s job, it becomes very simple. You can simply ask the analyst, “Hey, why are you making this decision?” When you think about a machine learning process, there’s not necessarily an individual who is going to have that answer, especially as you move toward more modern, deep learning techniques. The models that people use become increasingly difficult to understand because we’re talking about models that have billions of parameters. If you don’t record the lineage, it’s going to be really hard to explain why a model made a decision. 

With certain explainability tools, you may be able to explain it. For example, “It’s using this metric that we’re feeding it, and that’s the primary driver of this decision.” But then you have to ask the next “Why?” Maybe Parameter A is what’s really deciding “make this decision” or “don’t make this decision.” But is there something in the training data that would suggest that’s a meaningful parameter or not? 

If you don’t record the lineage—even that first step of lineage to just say this model was trained with this training data—you won’t be able to go back and say, “Hey, we looked at the training data, and the reason why this parameter is used a lot is that it corresponds almost directly with the decision we’re trying to make.”

What are the challenges and risks of not tracking data lineage from machine learning?

Let me give you an example. There was an organization that was using a particular analytic, and they were distributing it widely to departments within that organization so they could make decisions based on it. An ML consultant then discovered the way they were running that analytic was just wrong. The mathematical formula they were applying was not at all correct and would lead to inflated numbers for this metric. The error was pointed out to the organization, but they didn’t have any system for tracking where their data was going. There was no way for them to go to all the people they had distributed it to and say, “Hold on! That number was very wrong.”

Thinking about it from the lineage perspective, if they had distributed that analytic in a manner that captured the data lineage, they would know the exact people they would need to go to. They would know where all those further derivations of that data went, onward and outward until the ultimate decision points in those processes. They could potentially unwind everything. 

You were awarded the patent for Striveworks’ data lineage process. How did you come about the idea for it?

We had in our imagination a picture of what the overall Striveworks system could do. Part of that was, “Hey, I’ve got some random data point. How did I get here?” That led to this idea of, “Hey, if we need to know where any one data point came from, then we need to record all of the data points that are being used to make that data point.”

There were some other components in the platform that suggested the architecture we ultimately developed, but one of those things was replication. Not like, “I want to do this kind of the same way,” but “I want to run exactly the same workflow that was run yesterday.” There are some components that are somewhat beyond your control—maybe there’s random seeding that happens. But if we captured all of the network interactions, we could pretty much just give them back, which led to the approach of “Hey, we’ll just stick a proxy on all of the workflow tasks.”

Obviously, we want to combine that with what’s being done in the workflow itself and the steps that the workflow is taking. So, typically, when you’re running a workflow, there’s some component that’s running it. If it exposes some stream of events or some database that we can watch, then we can combine those events with what we see happening in the proxy to produce this combined picture. So, that’s how we get to where we are, with: “We’ll stick a proxy on things. We’ll listen to the workflow engine. Then we’ll combine those results and have a unified picture of lineage across the system.”

How would you track data lineage without having a standard process like that?

It ranges from difficult to literally impossible. It may be that somebody did something on a Friday night before going home, didn’t write down any of what they were doing, shipped a model, and then went on a yearlong sabbatical into the wilderness without cell coverage. At that point, you’re never going to know how you got that model—even if it’s making you millions of dollars. It’s just: “It showed up. We knew it came from him. He’s gone. How do you find out its details? We’ll never know.”

Does it get even more complex when you have an embedding model that is transforming data and feeding it to downstream models?

Substantially, yes.

How does data lineage help with remediating machine learning models that have drifted?

Lineage isn’t doing anything directly. It’s more informative. It’s more like auditability. But if you also have some system that’s able to help you identify drift. … Without lineage, you know you need to remediate that model, but you might be starting from zero. You obviously have the model and you can fine-tune that model on additional data to hopefully alleviate the problem, but if you had been recording the lineage data, then you know the process that was used to train the model and you know what input data was used as the training set. It’s easier to retrain if you know that stuff. 

You could even look at that training set and see if it’s part of the problem. If that’s the case, you could find out other models that used that training set and consider them for remediation as well—before they run into a problem in production and need remediation. 

What’s the one thing that everybody should understand about data lineage?

There’s probably a better analogy, but data lineage is a bit like insurance. If you never have a problem, you never need it. But if you have something that’s going on in your system that’s unexplained, then having the lineage is going to be an immense help to you. It’s going to be a lot smoother figuring out where things went wrong.

Want to learn more about Striveworks’ data lineage process and other ways we’re making MLOps disappear? Connect with us at info@striveworks.com.