The source code of the MECT for ACL 2021 paper:
Shuang Wu, Xiaoning Song, and Zhenhua Feng. 2021. MECT: Multi-metadata embedding based cross- transformer for Chinese named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 1529-1539, Online. Association for Computational Linguistics.
Models and results can be found at our paper in ACL 2021 or arXiv.
MECT has the lattice and radical streams, which not only possesses FLAT’s word boundary and semantic learning ability but also increases the structure information of Chinese character radicals. With the structural characteristics of Chinese characters, MECT can better capture the semantic information of Chinese characters for Chinese NER.
If you want to use our codes in your research, please cite:
@inproceedings{wu-etal-2021-mect,
title = "{MECT}: {M}ulti-Metadata Embedding based Cross-Transformer for {C}hinese Named Entity Recognition",
author = "Wu, Shuang and
Song, Xiaoning and
Feng, Zhenhua",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.121",
doi = "10.18653/v1/2021.acl-long.121",
pages = "1529--1539",
}
The code has been tested under Python 3.7. The required packages are as follows:
torch==1.5.1
numpy==1.18.5
FastNLP==0.5.0
fitlog==0.3.2
you can click here to know more about FastNLP. And you can click here to know more about Fitlog.
-
Download the pretrained character embeddings and word embeddings and put them in the data folder.
- Character embeddings (gigaword_chn.all.a2b.uni.ite50.vec): Google Drive or Baidu Pan
- Bi-gram embeddings (gigaword_chn.all.a2b.bi.ite50.vec): Baidu Pan
- Word(Lattice) embeddings (ctb.50d.vec): Baidu Pan
-
Get Chinese character structure components(radicals). The radicals used in the paper are from the online Xinhua dictionary. Due to copyright reasons, these data cannot be published. There is a method that can be replaced by 漢語拆字字典, but inconsistent character decomposition methods cannot guarantee repeatability.
-
Modify the
Utils/paths.py
to add the pretrained embedding and the dataset -
Run following commands
- Weibo dataset
python Utils/preprocess.py python main.py --dataset weibo
- Resume dataset
python Utils/preprocess.py python main.py --dataset resume
- Ontonotes dataset
python Utils/preprocess.py python main.py --dataset ontonotes
- MSRA dataset
python Utils/preprocess.py --clip_msra python main.py --dataset msra
- Thanks to Dr. Li and his team for contributing the FLAT source code.
- Thanks to the author team and contributors of FastNLP.