Artificial Intelligence is great but it can be used incorrectly. Machine Learning, the most widely used AI techniques, relies heavily on data. It is a common misconception that AI is absolutely objective, since AI is objective only in the sense of learning what human teaches. The data provided by human can be highly-biased. It has been found in 2016 that COMPAS, the algorithm used for recidivism prediction produces much higher false positive rate for black people than white people. XING, a job platform similar to Linked-in, was found to rank less qualified male candidates higher than more qualified female candidates. Publicly available commercial face recognition online services provided by Microsoft, Face++, and IBM respectively are found to suffer from achieving much lower accuracy on females with darker skin color. Bias in ML has been almost ubiquitous when the application is involved in people and it has already hurt the benefit of people in minority groups or historically disadvantageous groups. Not only people in minority groups but everyone should care about the bias in AI. If no one cares, it is highly likely that the next person who suffers from biased treatment is one of us.

Google has a a great little video explaining bias and fairness in plain terms.

Below is a slightly more technical exploration of fairness using Tensorflow and the adult census dataset.

The dataset

The Adult Census Income dataset is commonly used in machine learning literature. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker.

Each example in the dataset contains the following demographic data for a set of individuals who took part in the 1994 Census:

Numeric Features

  • capital_gain: Capital gain made by the individual, represented in US Dollars.
  • hours_per_week: Hours worked per week.
  • fnlwgt: The number of individuals the Census Organizations believes that set of observations represents.
  • education_num: An enumeration of the categorical representation of education. The higher the number, the higher the education that individual achieved. For example, an education_num of 11 represents Assoc_voc (associate degree at a vocational school), an education_num of 13 represents Bachelors, and an education_num of 9 represents HS-grad (high school graduate).
  • age: The age of the individual in years.
  • capital_loss: Capital loss mabe by the individual, represented in US Dollars.

Categorical Features

  • gender: Gender of the individual available only in binary choices: Female or Male.
  • education: The highest level of education achieved for that individual.
  • occupation: The occupation of the individual. Example include: tech-support, Craft-repair, Other-service, Sales, Exec-managerial and more.
  • relationship: The relationship of each individual in a household. Examples include: Wife, Own-child, Husband, Not-in-family, Other-relative, and Unmarried.
  • workclass: The individual’s type of employer. Examples include: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, and Never-worked.
  • marital_status: Marital status of the individual. Examples include: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, and Married-AF-spouse.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Black, and Other.
  • native_country: Country of origin of the individual. Examples include: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, and more.

Prediction Task

The prediction task is to determine whether a person makes over $50,000 US Dollar a year.


  • income_bracket: Whether the person makes more than $50,000 US Dollars annually.

All the examples extracted for this dataset meet the following conditions:
* age is 16 years or older.
* The adjusted gross income (used to calculate income_bracket) is greater than $100 USD annually.
* fnlwgt is greater than 0.
* hours_per_week is greater than 0.


Let’s import some modules that will be used throughout the notebook.

With the modules now imported, we can load the Adult dataset into a pandas DataFrame data structure.

Analyzing the Adult Dataset with Facets

Exploratory data analysis (EDA) is used to understand, summarize and analyse the contents of a dataset, usually to investigate a specific question or to prepare for more advanced modeling. It’is important to understand your dataset before diving straight into the prediction task.

Some important questions to investigate when auditing a dataset for fairness:

  • Are there missing feature values for a large number of observations?
  • Are there features that are missing that might affect other features?
  • Are there any unexpected feature values?
  • What signs of data skew do you see?

Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive. To start, we can use Facets Overview to quickly analyze the distribution of values across the Adult dataset.

Questions you should ask yourself when exploring the data:

  1. Are there missing feature values for a large number of observations?
  2. Are there features that are missing that might affect other features?
  3. Are there any unexpected feature values?
  4. What signs of data skew do you see?

We can see from reviewing the missing columns for both numeric and categorical features that there are no missing feature values, so that is not a concern here.

By looking at the min/max values and histograms for each numeric feature, we can pinpoint any extreme outliers in our data set. For hours_per_week, we can see that the minimum is 1, which might be a bit surprising, given that most jobs typically require multiple hours of work per week. For capital_gain and capital_loss, we can see that over 90% of values are 0. Given that capital gains/losses are only registered by individuals who make investments, it’s certainly plausible that less than 10% of examples would have nonzero values for these feature, but we may want to take a closer look to verify the values for these features are valid.

In looking at the histogram for gender, we see that over two-thirds (approximately 67%) of examples represent males. This strongly suggests data skew, as we would expect the breakdown between genders to be closer to 50/50.

A Deeper Dive

To futher explore the dataset, we can use Facets Dive, a tool that provides an interactive interface where each individual item in the visualization represents a data point. But to use Facets Dive, we need to convert our data to a JSON array.
Thankfully the DataFrame method to_json() takes care of this for us.

Run the cell below to perform the data transform to JSON and also load Facets Dive.

Use the menus on the left panel of the visualization to change how the data is organized:

  1. In the Faceting | X-Axis menu, select education, and in the Display | Color and Display | Type menus, select income_bracket.

  2. Next, in the Faceting | X-Axis menu, select marital_status, and in the Display | Color and Display | Type menus, select gender.

Some fairness questions to ask using Facets:

  • What’s missing?
  • What’s being overgeneralized?
  • What’s being underrepresented?
  • How do the variables, and their values, reflect the real world?
  • What might we be leaving out?

Higher education levels generally tend to correlate with a higher income bracket. An income level of greater than $50,000 is more heavily represented in examples where education level is Bachelor’s degree or higher. In most marital-status categories, the distribution of male vs. female values is close to 1:1. The one notable exception is “married-civ-spouse”, where male outnumbers female by more than 5:1. Given that we already discovered in Task #1 that there is a disproportionately high representation of men in our data set, we can now infer that it’s married women specifically that are underrepresented in our data.

The better you know what’s going on in your data, the more insight you’ll have as to where unfairness might creep in!

Prediction Using TensorFlow Estimators

Now that we have a better sense of the Adult dataset, we can now begin with creating a neural network to predict income. In this section, we will be using TensorFlow’s Estimator API to access the DNNClassifier class.

We first have to define our input fuction, which will take the Adult dataset that is in a pandas DataFrame and converts it into tensors using the tf.estimator.inputs.pandas_input_fn() function.

TensorFlow requires that data maps to a model. To accomplish this, you have to use tf.feature_columns to ingest and represent features in TensorFlow.

If you chose age when completing FairAware Task #3, you noticed that we suggested that age might benefit from bucketing (also known as binning), grouping together similar ages into different groups. This might help the model generalize better across age. As such, we will convert age from a numeric feature (technically, an ordinal feature) to a categorical feature.

Consider Key Subgroups

When performing feature engineering, it’s important to keep in mind that you may be working with data drawn from individuals belonging to subgroups, for which you’ll want to evaluate model performance separately.

NOTE: In this context, a subgroup is defined as a group of individuals who share a given characteristic—such as race, gender, or sexual orientation—that merits special consideration when evaluating a model with fairness in mind.

When we want our models to mitigate, or leverage, the learned signal of a characteristic pertaining to a subgroup, we will want to use different kinds of tools and techniques—most of which are still open research at this point.

As you work with different variables and define tasks for them, it can be useful to think about what comes next. For example, where are the places where the interaction of the variable and the task could be a concern?

Now we can explicitly define which feature we will include in our model.

We’ll consider gender a subgroup and save it in a separate subgroup_variables list, so we can add special handling for it as needed.

With the features now ready to go, we can try predicting income using deep learning.

For the sake of simplicity, we are going to keep the neural network architecture light by simply defining a feed-forward neural network with two hidden layers.

But first, we have to convert our high-dimensional categorical features into a low-dimensional and dense real-valued vector, which we call an embedding vector. Luckily, indicator_column (think of it as one-hot encoding) and embedding_column (that converts sparse features into dense features) helps us streamline the process.

The following cell creates the deep columns needed to move forward with defining the model.

With all our data preprocessing taken care of, we can now define the deep neural net model. Start by using the parameters defined below. (Later on, after you’ve defined evaluation metrics and evaluated the model, you can come back and tweak these parameters to compare results.)

To keep things simple, we will train for 1000 steps—but feel free to play around with this parameter.

We can now evalute the overall model’s performance using the held-out test set.

You can try retraining the model using different parameters. In the end, you will find that a deep neural net does a decent job in predicting income.

But what is missing here is evaluation metrics with respect to subgroups.

Evaluating for Fairness Using a Confusion Matrix

While evaluating the overall performance of the model gives us some insight into its quality, it doesn’t give us much insight into how well our model performs for different subgroups.

When evaluating a model for fairness, it’s important to determine whether prediction errors are uniform across subgroups or whether certain subgroups are more susceptible to certain prediction errors than others.

A key tool for comparing the prevalence of different types of model errors is a confusion matrix. Recall from the Classification module of Machine Learning Crash Course that a confusion matrix is a grid that plots predictions vs. ground truth for your model, and tabulates statistics summarizing how often your model made the correct prediction and how often it made the wrong prediction.

Let’s start by creating a binary confusion matrix for our income-prediction model—binary because our label (income_bracket) has only two possible values (<50K or >50K). We’ll define an income of >50K as our positive label, and an income of <50k as our negative label.

NOTE: Positive and negative in this context should not be interpreted as value judgments (we are not suggesting that someone who earns more than 50k a year is a better person than someone who earns less than 50k). They are just standard terms used to distinguish between the two possible predictions the model can make.

Cases where the model makes the correct prediction (the prediction matches the ground truth) are classified as true, and cases where the model makes the wrong prediction are classified as false.

Our confusion matrix thus represents four possible states:

  • true positive: Model predicts >50K, and that is the ground truth.
  • true negative: Model predicts <50K, and that is the ground truth.
  • false positive: Model predicts >50K, and that contradicts reality.
  • false negative: Model predicts <50K, and that contradicts reality.

NOTE: If desired, we can use the number of outcomes in each of these states to calculate secondary evaluation metrics, such as precision and recall.

Plot the Confusion Matrix

The following cell define a function that uses the sklearn.metrics.confusion_matrix module to calculate all the instances (true positive, true negative, false positive, and false negative) needed to compute our binary confusion matrix and evaluation metrics.

We will also need help plotting the binary confusion matrix. The function below combines various third-party modules (pandas DataFrame, Matplotlib, Seaborn) to draw the confusion matrix.

Now that we have all the necessary functions defined, we can now compute the binary confusion matrix and evaluation metrics using the outcomes from our deep neural net model. The output of this cell is a tabbed view, which allows us to toggle between the confusion matrix and evaluation metrics table.

Are there any significant disparities in error rates that suggest the model performs better for one subgroup than another?


Precision: 0.6904
Recall: 0.6131
False Positive Rate: 0.1234
False Omission Rate: 0.1653

In your work, make sure that you make a good decision about the tradeoffs between false positives, false negatives, true positives, and true negatives. For example, you may want a very low false positive rate, but a high true positive rate. Or you may want a high precision, but a low recall is okay.

Choose your evaluation metrics in light of these desired tradeoffs.