Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Putting Responsible AI into Practice

The challenge

To mitigate and prevent the unwanted consequences of ML on the world, we need a good understanding of how harm might be introduced by ML.

This paper frames harm as biases that appear at different stages of the ML process.

Datasets we use as ML practitioners may have historical, representation, or measurement biases due to their creation processes. Turning data into model outputs can introduce aggregation, learning, and evaluation biases. Finally, productionizing the model could create deployment biases.

So what are these biases, how do we identify them, and how can we deal with them?

The Biases

Historical bias is present in collected data if it reflects real-world biases that are then encoded in an algorithm. For example, word embeddings can reflect harmful stereotypes when trained on datasets from a particular decade. To mitigate, you can try over- or under-sampling features in order to generate a dataset that removes the bias.

Representation bias occurs when the data used for modeling under-represents part of the population, resulting in a model that will not generalize well for that subpopulation. Issues with image recognition algorithms are often caused by this bias.  A possible mitigation is to adjust the sampling approach so that groups are more appropriately represented.

Measurement Bias occurs during feature and target selection. Features and labels can be poor proxies for the outcomes you target, especially if they oversimplify what they are supposed to measure (e.g. GPA as a proxy for success), or they represent human decisions (e.g. human-assigned ratings).

Aggregation Bias happens when a “one-size-fits-all” model is used for data in which there are underlying groups or types of examples that should be considered differently.” This can lead to a model that is not optimal for any group, or a model that is fit to the dominant population.

Learning Bias appears when modeling choices compound performance disparities across subpopulations. For example, the choice of objective function might skew performance, or the choice of a smaller model might amplify poor performance on under-represented data, as the model has limited capacity and preserves information from frequent features only.

Evaluation Bias arises because of the need to compare models against each other using established benchmark data. This data may be misrepresentative of the use population for the model, and might also suffer from historical, representation or measurements bias. Mitigations could include evaluation on a broader range of metrics on more granular subsets of the data, or the development of new benchmarks.

Deployment Bias occurs when a model is used in a different way in production than was intended during training. For example, algorithms intended to predict a person’s likelihood of committing a future crime can be used “off-label” to determine the length of a sentence. Mitigations can be challenging but include contextualizing the model output with other sources of information and judgements.

The upshot

As AI becomes more broadly adopted, we as ML practitioners need to understand the broader impact of what we are developing and the unintended harm it may cause.

This paper is a good starting point for thinking about the consequences more systematically. You can find the paper here.