FairCal

Despite being widely used, face recognition models suffer from bias: the probability of a false positive (incorrect face match) strongly depends on sensitive attributes such as the ethnicity of the face. As a result, these models can disproportionately and negatively impact minority groups, particularly when used by law enforcement.

The majority of bias reduction methods have several drawbacks: they use an end-to-end retraining approach, may not be feasible due to privacy issues, and often reduce accuracy. An alternative approach is post-processing methods that build fairer decision classifiers using the features of pre-trained models, thus avoiding the cost of retraining. However, they still have drawbacks: they reduce accuracy (AGENDA, PASS, FTC), or require retuning for different false positive rates (FSN).

In this work, we introduce the Fairness Calibration method, FairCal, a post-training approach that simultaneously:

increases model accuracy (improving the state-of-the-art);
produces fairly-calibrated probabilities;
significantly reduces the gap in the false positive rates;
does not require knowledge of the sensitive attribute (group identity such as race, ethnicity, etc.);
does not require retraining, training an additional model, or retuning.

We apply it to the task of Face Verification, and obtain state-of-the-art results with all the above advantages. We do so by applying a post-hoc calibration method to pseudo-groups formed by unsupervised clustering.

(Lines closer together is better for fairness) Illustration of improved fairness / reduction in bias, as measured by the FPRs evaluated on intra-ethnicity pairs on the RFW dataset with the FaceNet (Webface) feature model. At a Global FPR of 5% using the baseline method Black people are 15X more likely to false match than white people. Our method reduces this to 1.2X (while SOTA for post-hoc methods is 1.7X).

Fairness and Bias in Face Verification

The Face Verification problem consists in given two images, decide if it is a genuine/imposter pair.

Genuine Pair

Imposter Pair

Chouldechova (2017) showed that maximum two of the following three conditions can be satisfied:

Fairness Calibration, i.e., calibrated fairly for different subgroups:
Predictive Equality, i.e., equal False Positive Rates (FPRs) across different subgroups:
Equal Opportunity, i.e., equal False Negative Rates across different subgroups:

In the particular context of policing, predictive equality is considered more important than equal opportunity, as false positive errors (false arrests) risk causing significant harm, especially to members of subgroups already at disproportionate risk for police scrutiny or violence. Hence we choose to omit equal opportunity as our goal and note that no prior method has targeted Fairness Calibration. Predictive equality is measured by comparing the FPR on each subgroup at one global FPR.

Work on bias mitigation for deep Face Verification models can be divided into two main camps:

methods that let a model learn less-biased representations during training, and
post-processing approaches that attempt to remove bias after a model is trained.

Our work focuses on (ii) post-hoc methods.

Comparison of desirable features of the different post-hoc fairness methods for face verification.

Baseline Approach

Given a trained neural network \(f\) that encodes an image \(\boldsymbol{x}\) into an embedding \(\boldsymbol{z} = f(\boldsymbol{x})\), the baseline classifier for the face verification problem is the following.

1) Given an image pair \((\boldsymbol{x}_1,\boldsymbol{x}_2)\): compute the feature embedding pair \((\boldsymbol{z}_1, \boldsymbol{z}_2)\).
2) Compute the cosine similarity score \(s(\boldsymbol{x}_1,\boldsymbol{x}_2)=\frac{\boldsymbol{z}_1^T \boldsymbol{z}_2}{\|\boldsymbol{z}_1\| \|\boldsymbol{z}_2\|}\).
3) Given a predefined threshold \(s_{\rm{thr}}: s(\boldsymbol{x}_1,\boldsymbol{x}_2) > s_{\rm{thr}} \implies\) genuine pair!

We build our proposed method FairCal based on two main ideas:

1) Use the feature vector to define population subgroups;
2) Use post-hoc calibration methods that convert cosine similarity scores into probabilities of genuine (or imposter) pair.

Calibration stage

Let \(\mathcal{Z}^{\rm{cal}}\) denote the feature embeddings of a set of face images.

1) Apply \(K\)-means algorithm to \(\mathcal{Z}^{\rm{cal}}\), partitioning the embedding space into \(K\) clusters \(\mathcal{Z}_1,\ldots,\mathcal{Z}_K\)
2) Form the \(K\) calibration sets of cosine similarity scores:

\[S^{\rm{cal}}_k = \left\{ s(\boldsymbol{x}_1,\boldsymbol{x}_2): f(\boldsymbol{x}_1) \in \mathcal{Z}_k \text{ or } f(\boldsymbol{x}_2) \in \mathcal{Z}_k \right\}, \quad k = 1, \dots, K\]

3) For \(k=1,\ldots,K\) estimate the calibration map \(\mu_k\) that calibrates the scores:

\[\mu_k (s(\boldsymbol{x}_1,\boldsymbol{x}_2)) = \mathbb{P}[Y=1\mid S=s, f(\boldsymbol{x}_1) \in \mathcal{Z}_k \text{ or } f(\boldsymbol{x}_2) \in \mathcal{Z}_k]\]

For FairCal we chose Beta Calibration (Kull et al, 2017) as the post-hoc calibration method but experiments show similar performance with other calibration methods.

Test stage

1) Given an image pair (\(\boldsymbol{x}_1\), \(\boldsymbol{x}_2\)), compute (\(z_1\), \(z_2\)), and the cluster of each image feature: \(k_1\) and \(k_2\)
2) The model’s confidence \(c\) in it being a genuine pair is:

\[c(\boldsymbol{x}_1,\boldsymbol{x}_2) = \theta\ \mu_{k_1}(s(\boldsymbol{x}_1,\boldsymbol{x}_2))\ + (1-\theta)\ \mu_{k_2}(s(\boldsymbol{x}_1,\boldsymbol{x}_2))\]

where \(\theta = \frac{\|S^{\rm{cal}}_{k_1}\|}{\|S^{\rm{cal}}_{k_1}\|+\|S^{\rm{cal}}_{k_2}\|}\) is the relative population fraction of the two clusters.

3) Given a predefined threshold \(c_{\rm{thr}}: c(\boldsymbol{x}_1,\boldsymbol{x}_2) > c_{\rm{thr}} \implies\) genuine pair!

Illustrative diagram of our FairCal method. Highlighted in red are the components that distinguish it from the baseline approach.

Results

Our results show that among post hoc calibration methods,

1) FairCal has the best Fairness Calibration.
2) FairCal has the best Predictive Equality, i.e., equal FPRs,
3) FairCal has the best global accuracy,
4) FairCal does not require the sensitive attribute and outperforms methods that use this knowledge,
5) FairCal does not require retraining of the classifier, or any additional training.

Unsupervised Clusters

In order to not rely on the sensitive attribute like the Oracle method, our FairCal method uses unsupervised clusters computed with the K-means algorithm based on the feature embeddings of the images. We found them to have semantic meaning.

Examples of clusters obtained with the K-means algorithm (K=100) on the RFW dataset based on the feature embeddings computed with the FaceNet model. The left cluster is predominantly composed of Caucasian Blonde Women, while the right cluster is composed of Indian Men with Moustaches.

Citation

You can see the full paper here. Please cite as

@inproceedings{salvador2022faircal,
    title={FairCal: Fairness Calibration for Face Verification},
    author={Tiago Salvador and Stephanie Cairns and Vikram Voleti and Noah Marshall and Adam M Oberman},
    booktitle={International Conference on Learning Representations},
    year={2022},
    url={https://openreview.net/forum?id=nRj0NcmSuxb}
}