Source code for ACL 2023 paper: Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction. arxiv
Document-level relation extraction (DocRE) attracts more research interest recently. While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied: Do they make the right predictions according to rationales?
In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model. Specifically, we first conduct annotations to provide the rationales considered by humans in DocRE.Then, we conduct investigations and reveal the fact that: In contrast to humans, the representative state-of-the-art (SOTA) models in DocRE exhibit different decision rules. Through our proposed RE-specific attacks, we next demonstrate that the significant discrepancy in decision rules between models and humans severely damages the robustness of models and renders them inapplicable to real-world RE scenarios. After that, we introduce mean average precision (MAP) to evaluate the understanding and reasoning capabilities of models.
According to the extensive experimental results, we finally appeal to future work to consider evaluating both performance and the understanding ability of models for the development of their applications.
dataset/docred/dev_keys_new.json
: DocRED with Human-annotated Word-level Evidence(HWE) dataset.
Statistics of the 699 documents (the same as DocRED_HWE's) from the original validation dataset of DocRED:
- evidence sentence number: 12000
- relational fact number: 7342
- document number: 699
annotation errors in DocRED.xls
: Annotation errors corrected by our annotators on the validation set of DocRED.
MAP_metric.ipynb
: Evaluating with MAP metricplot.ipynb
: Ploting MAP curves and TopK-F1 curves.eval_attack_docunet.ipynb
: Evaluating DocuNet's performance under two attacks.MAP_metric.py
: evaluate model with MAP (mean average precision)IG_inference.py
: Calculating integrated gradient (IG) to attribute ATLOP.get_ds.py
: Generate dataset for evalution.run_attacks.py
: All attacks on ATLOP.
pip install -r requirements.pip.txt
conda install --file requirements.conda.txt
- install in a conda virtual environment
Step1. Prepare original ATLOP trained model, saved to saved_dict/
, name it saved_dict/model_bert.ckpt
or saved_dict/model_roberta.ckpt
Step2. Use IG to generate the weights of every token for specific relation fact
python IG_inference.py --infer_method INFER_METHOD --load_path LOAD_PATH --model_name_or_path MODEL_NAME_OR_PATH --transformer_type TRANSFORMER_TYPE
INFER_METHOD
is the attribution method, useig_infer
orgrad_infer
,LOAD_PATH
is your saved model checkpoint,MODEL_NAME_OR_PATH
andTRANSFORMER_TYPE
are BERT's parameter, you can set toroberta-large
androberta
, respectively.- result of IG will be saved to
dataset/ig_pkl
folder
Step3. Generate ENP_TOPK dataset(entity pair with topk attributed tokens), and entity name attack dataset (three types mentioned in paper)
python getds.py --model_type MODEL_TYPE
MODEL_TYPE
is your saved model type, should beroberta-large
orbert-base-cased
- ENP_TOPK dataset and entity name attack dataset will be stored in
dataset/docred/enp_topk/
anddataset/attack_pkl/
folder, respectively
python MAP_metric.py --model_type MODEL_TYPE
- use IG result to generate the new MAP evaluation, output will be saved to
dataset/keyword_pkl/
, which you can draw the line chart like inplot.ipynb
python run_attack.py --model_type MODEL_TYPE
- run the word-level evidence attack and entity name attack, output will be printed in STDOUT
If you have any questions, please contact Haotian Chen, we will reply it as soon as possible.
MIT