Authors: Jacqueline He, Mengzhou Xia, Christiane Fellbaum, Danqi Chen
This repository contains the code for our EMNLP 2022 paper, "MABEL: Attenuating Gender Bias using Textual Entailment Data".
MABEL (a Method for Attenuating Bias using Entailment Labels) is a task-agnostic intermediate pre-training technique that leverages entailment pairs from NLI data to produce representations which are both semantically capable and fair. This approach exhibits a good fairness-performance tradeoff across intrinsic and extrinsic gender bias diagnostics, with minimal damage on natural language understanding tasks.
With the transformers
package installed, you can import the off-the-shelf model like so:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/mabel-bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("princeton-nlp/mabel-bert-base-uncased")
MABEL Models | ICAT ↑ |
---|---|
princeton-nlp/mabel-bert-base-uncased | 73.98 |
princeton-nlp/mabel-bert-large-uncased | 73.45 |
princeton-nlp/mabel-roberta-base | 69.68 |
princeton-nlp/mabel-roberta-large | 69.49 |
Note: The ICAT score is a bias metric that consolidates a model's capacity for language modeling and stereotypical association into a single numerical indicator. More information can be found in the StereoSet (Nadeem et al., 2021) paper.
Before training, make sure that the counterfactually-augmented NLI data, processed from SNLI and MNLI, is downloaded and stored under the training
directory as entailment_data.csv
.
1. Install package dependencies
pip install -r requirements.txt
2. Run training script
cd training
chmod x run.sh
./run.sh
You can configure the hyper-parameters in run.sh
accordingly. Models are saved to out/
. The optimal set of hyper-parameters varies depending on the choice of backbone encoder, and the full training details can be found in the paper.
If you use your own trained model instead of our provided HF checkpoint, you must first run python -m training.convert_to_hf --path /path/to/your/checkpoint --base-model bert
(which converts the checkpoint to a standard BertForMaskedLM model - use --base_model roberta
for RobertaForMaskedLM) prior to intrinsic evaluation.
Also, please note that we use Meade et al.'s method of computation and datasets for both StereoSet and CrowS-Pairs; this is why the metrics for the pre-trained models are not directly comparable to those reported in the original benchmark papers.
1. StereoSet (Nadeem et al., 2021)
Command:
python -m benchmark.intrinsic.stereoset.predict --model_name_or_path princeton-nlp/mabel-bert-base-uncased &&
python -m benchmark.intrinsic.stereoset.eval
Output:
intrasentence
gender
Count: 2313.0
LM Score: 84.5453251710623
SS Score: 56.248299466465376
ICAT Score: 73.98003496789251
Collective Results:
Models | LM ↑ | SS ◇ | ICAT ↑ |
---|---|---|---|
bert-base-uncased | 84.17 | 60.28 | 66.86 |
princeton-nlp/mabel-bert-base-uncased | 84.54 | 56.25 | 73.98 |
bert-large-uncased | 86.54 | 63.24 | 63.62 |
princeton-nlp/mabel-bert-large-uncased | 84.93 | 56.76 | 73.45 |
roberta-base | 88.93 | 66.32 | 59.90 |
princeton-nlp/mabel-roberta-base | 87.44 | 60.14 | 69.68 |
roberta-large | 88.81 | 66.82 | 58.92 |
princeton-nlp/mabel-roberta-large | 89.72 | 61.28 | 69.49 |
◇: The closer to 50, the better.
2. CrowS-Pairs (Nangia et al., 2021)
Command:
python -m benchmark.intrinsic.crows.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased
Output:
====================================================================================================
Total examples: 262
Metric score: 50.76
Stereotype score: 51.57
Anti-stereotype score: 49.51
Num. neutral: 0.0
====================================================================================================
Collective Results:
Models | Metric Score ◇ |
---|---|
bert-base-uncased | 57.25 |
princeton-nlp/mabel-bert-base-uncased | 50.76 |
bert-large-uncased | 55.73 |
princeton-nlp/mabel-bert-large-uncased | 51.15 |
roberta-base | 60.15 |
princeton-nlp/mabel-roberta-base | 49.04 |
roberta-large | 60.15 |
princeton-nlp/mabel-roberta-large | 54.41 |
◇: The closer to 50, the better.
- Occupation Classification
See benchmark/extrinsic/occ_cls/README.md
for full training instructions and results.
- Natural Language Inference
See benchmark/extrinsic/nli/README.md
for full training instructions and results.
- Coreference Resolution
See benchmark/extrinsic/coref/README.md
for full training instructions and results.
1. GLUE (Wang et al., 2018)
We fine-tune on GLUE through the transformers library, following the default hyper-parameters.
A straightforward way is to download the current transformers repository:
git clone https://github.com/huggingface/transformers
cd transformers
pip install .
Then set up the environment dependencies:
cd ./examples/pytorch/text-classification
pip install -r requirements.txt
Here is a sample script for one of the GLUE tasks, MRPC:
# task options: cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte
export TASK_NAME=mrpc
export OUTPUT_DIR=out/
CUDA_VISIBLE_DEVICES=0 python run_glue.py \
--model_name_or_path princeton-nlp/mabel-bert-base-uncased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir $OUTPUT_DIR
2. SentEval Transfer Tasks (Conneau et al., 2018)
Preprocess:
Make sure you have cloned the SentEval repo and added its contents into this repository's transfer
folder, and run ./get_transfer_data.bash
in data/downstream
to download the evaluation data.
Command:
python -m benchmark.transfer.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased --task_set transfer
Output:
------- ------- ------- ------- ------- ------- ------- -------
| MR | CR | SUBJ | MPQA | SST2 | TREC | MRPC | Avg. |
------- ------- ------- ------- ------- ------- ------- -------
| 78.33 | 85.83 | 93.78 | 89.13 | 85.50 | 85.20 | 68.87 | 83.81 |
------- ------- ------- ------- ------- ------- ------- -------
Collective Results:
Models | Transfer Avg. ↑ |
---|---|
bert-base-uncased | 83.73 |
princeton-nlp/mabel-bert-base-uncased | 83.81 |
bert-large-uncased | 86.54 |
princeton-nlp/mabel-bert-large-uncased | 86.09 |
- Evaluation code for StereoSet and CrowS-Pairs is adapted from "An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models" (Meade et al., 2022).
- Model implementation code is adapted from SimCSE (Gao et al., 2021).
- Evaluation code for the transfer tasks relies on the SentEval package here, and adapts from a script prepared by SimCSE (Gao et al., 2021).
- Evaluation code for GLUE relies on the Huggingface implementation of the transformers (Wolf et al., 2019) package.
- Training and evaluation for e2e span-based coreference resolution follows from this Pytorch implementation (Xu and Choi, 2020).
- Repository is formatted with .
@inproceedings{he2022mabel,
title={{MABEL}: Attenuating Gender Bias using Textual Entailment Data},
author={He, Jacqueline and Xia, Mengzhou and Fellbaum, Christiane and Chen, Danqi},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
year={2022}
}