This repository contains code and datasets for running the main experiments covered in "Analysis and Evaluation of Language Models for Word Sense Disambiguation" (Computational Linguistics, 2021).
The CoarseWSD-20 dataset is a coarse-grained sense disambiguation built from Wikipedia (nouns only) targetting 2 to 5 senses of 20 ambiguous words. It was specifically designed to provide an ideal setting for evaluating WSD models (e.g. no senses in test sets missing from training), both quantitavely and qualitatively.
In this repository we share the following versions of the CoarseWSD-20 dataset used in our experiments:
- Full CoarseWSD-20
- Balanced
- Nshots (1, 3, 10, 30 - w/3 sets for each)
- Fractional (1%, 5%, 10%, 50%, 100%)
- Out-of-domain
This project was developed on Python 3.6.5 from Anaconda distribution v4.6.2. As such, the pip requirements assume you already have packages that are included with Anaconda (numpy, etc.). After cloning the repository, we recommend creating and activating a new environment to avoid any conflicts with existing installations in your system:
$ git clone https://github.com/danlou/bert-disambiguation.git
$ cd bert-disambiguation
$ conda create -n bert-disambiguation python=3.6.5
$ conda activate bert-disambiguation
# $ conda deactivate # to exit environment when done with project
To install additional packages used by this project run:
pip install -r requirements.txt
The WordNet package for NLTK isn't installed by pip, but we can install it easily with:
$ python -c "import nltk; nltk.download('wordnet')"
Note: The WordNet package is only needed to replicate the experiments on WordNet, but not for the rest of the experiments (e.g. in CoarseWSD or any other dataset).
The feature extraction method used in the paper involves two steps: (1) computing sense embeddings from the training set and (2) leveraging those precomputed sense embeddings to disambiguate contextual embeddings by finding the most similar sense embedding. These two steps have separate scripts, which can be used as explained below.
You may use the create_1nn_vecs.py script to create sense embeddings from a particular set from our CoarseWSD-20 datasets.
$ python create_1nn_vecs.py -nlm_id bert-base-uncased -dataset_id CoarseWSD-20 -out_path vectors/CoarseWSD-20.bert-base-uncased.txt
If you want to train on a different training set, such as the balanced version of CoarseWSD-20, just replace '-dataset_id CoarseWSD-20' with '-dataset_id CoarseWSD-20_balanced'.
Precomputed sense embeddings for the full CoarseWSD-20 training set are also available at vectors.
To evaluate the 1NN method, you may use the eval_1nn.py script, providing paths for the test set and precomputed sense embeddings.
$ python eval_1nn.py -nlm_id bert-base-uncased -dataset_id CoarseWSD-20 -sv_path vectors/CoarseWSD-20.bert-base-uncased.txt
[WIP] This is still being merged. In the meantime, you can check the code here if interested.
To run our fastText experiments, first follow these installation instructions.
In case you're interested in running the fastText baseline with pretrained embeddings run:
$ cd external/fastText # from repo home
$ wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
$ unzip crawl-300d-2M-subword.zip
The ftx_baseline.py script handles both creating the fastText classification models (FTX-Base and FTX-Crawl) and evaluating.
To configure the script, you can edit the dataset_id and model_id variables starting at line 86.
Predictions from our experiments are available at results.
The reference paper is available here:
@article{loureiro2021analysis,
title={Analysis and evaluation of language models for word sense disambiguation},
author={Loureiro, Daniel and Rezaee, Kiamehr and Pilehvar, Mohammad Taher and Camacho-Collados, Jose},
journal={Computational Linguistics},
volume={47},
number={2},
pages={387--443},
year={2021},
publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
}