Kiran Karkera
3 min readJul 10, 2020
Photo by Brett Jordan on Unsplash

Wikipedia on Provenance:

Provenance (from the French provenir, ‘to come from/forth’) is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses in a wide range of fields such as wine (elided for clarity).

An example of art provenance records for the curious.

Why is Provenance important for AI?

Project AGI, in a Medium post, describes the following issues with AISafety

Aspects of AI Safety

Consider the following scenarios:

Models for self driving cars

Self-driving cars have models that are trained on data-sets representative of the real world. The cars are running successfully (safely) in the real world. Next, some fresh data is available and the model is trained and shows improvement (on overall metrics) on this data-set. However, in testing, the cars seem to ignore some stop signs.

Turns out the additions to the data-set contain perturbations to stop signs, which the model is unable to recognize, causing the self driving car to drive past stop signs without stopping. ( In this case, the real stop signs, not images of them, were perturbed).

Models for predicting Alzheimer’s disease

A startup that built a technology to predict neurological conditions like Alzheimer’s disease based on linguistic features spoken by the patients under test. It worked well in the lab, and had potential for wider roll-out when it was discovered that it worked better on English speakers of a particular dialect than the general population.

The examples above showed that data can be subject to biases, perturbations and poor target distributions. Any AI system must include the other actors in the systems, for purposes of simplicity, let us restrict it to the triumvirate of

  1. Data
  2. Data scientist
  3. Algorithms.

For an AI system to work safely, all of the 3 entities must work together. Without data, no AI models can be trained. Once data becomes available, the data scientist must choose the models appropriate for the task, and make a decision based on multiple metrics, whether this model is fit for real-world deployment.

It is not unthinkable for a company to find a data-set, train a model and put it into production, especially if it is not a safety critical situation. As #AISafety kicks in, legal, environmental and ethical perspectives will ask questions such as:

What is the provenance of the data-set?

  • Is the data representative or biased? What is the provenance of the algorithms used?
  • Have measures been taken to prevent over-fitting?
  • What are the metrics used to evaluate this data-set?

What is the provenance of the data scientist?

  • Have they been trained in AI Safety?

Questions provenance could help answer

As we start to rely on AI in safety critical systems (such as self driving cars) and as AI systems become robust, failure analysis will looks for answers to questions such as:

  • What caused this model to have lower accuracy than expected?
  • Was the data-set updated and by whom?
  • Was the updated data-set biased in a certain way?
  • Was a new/improved algorithm used?
  • Was the test set sufficiently representative?
  • Can we reproduce the results by retraining the model on the same data?

Existing solutions

Products already exist that tie together the triad of {data, data scientist and algorithms} in a provenance chain. Kaggle scripts and kernels enable one to discern

a) Popularity of the data scientist who authored a script

b) Popularity of the algorithm/script.

c) Version of the algorithm used

However, that alone is insufficient. We also need to tie in the provenance of data-sets, and hold that information in an public database.

Provenance capability will enable us to answer difficult questions about an AI model, such as:

  • Who trained the model and when was it trained
  • What was the performance post training
  • What version of training and test data were used
  • Which environment was it trained on
  • What algorithms were used in training.

In the next article, we consider if provenance standards can help us answer the questions posed above.

Kiran Karkera
Kiran Karkera

Written by Kiran Karkera

Curious about hacking data in Clojure and Python

No responses yet