Dynamic Feature Composition for Name Tagging

Code for our ACL2019 paper Reliability-aware Dynamic Feature Composition for Name Tagging.

Input Data Set Directory Structure

<input_dir>
- embed.vocab.tsv (embedding vocab file, 1st column: token, 2nd column: index)
- embed.count.tsv (embedding token frequency file, 1st column: token, 2nd column: frequency)
- bc
  - train.tsv (training set)
  - dev.tsv (development set)
  - test.tsv (test set)
  - token.vocab.tsv (token vocab file, 1st column: token, 2nd column: index)
  - char.vocab.tsv (character vocab file: 1st column: character, 2nd column: index)
  - label.vocab.tsv (label vocab file: 1st column: label, 2nd column: index)
- bn
- mz
- nw
- tc
- wb

Note:

Other subsets have train.tsv, dev.tsv, test.tsv, token.vocab.tsv, char.vocab.tsv, and label.vocab.tsv in their directories.
In our experiments, we generated *.vocab.tsv from a merged data set of all subsets.
In our experiments, we use CoNLL format files generated from OntoNotes 5.0 with Pradhan et al.'s scripts, which can be found at https://cemantix.org/data/ontonotes.html.

Pre-processing

The following functions in proprocess.py can be used to create vocab and frequency files.

build_all_vocabs takes as input a list of CoNLL format files, and generate {token,char,label}.vocab.tsv in output_dir.
build_embed_vocab takes a pre-trained embedding file as input and return the embedding vocab.
build_embed_token_count takes a pre-trained embedding file as input and generate an embedding token frequency file.

Train LSTM-CNN

python train_lstmcnn_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
  --embed_vocab <embedding_vocab_file> --char_dim 50 --seed <random_seed>

This script train a model for each subset (which can be specified with the --datasets argument) and report within-subset (within-genre) and cross-subset (cross-genre) performance.

Train LSTM-CNN with Dynamic Feature Composition

python train_lstmcnn_dfc_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
  --embed_vocab <embedding_vocab_file> --embed_count <embedding_freq_file> --char_dim 50 --seed <random_seed>

Requirement

Python 3.5
Pytorch 1.0

Resources

We use the 100d case-sensitive word embedding in Pre-trained Word Embeddings

Reference

Lin, Y., Liu, L., Ji, H., Yu, D., Han, J. (2019) Reliability-aware Dynamic Feature Composition for Name Tagging. Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics.

@article{lin2019reliability,
  title={Reliability-aware Dynamic Feature Composition for Name Tagging},
  author={Lin, Ying and Liu, Liyuan and Ji, Heng and Yu, Dong and Han, Jiawei},
  booktitle={Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019)},
  year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
runs		runs
README.md		README.md
conlleval.py		conlleval.py
constant.py		constant.py
data.py		data.py
model.py		model.py
module.py		module.py
preprocess.py		preprocess.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Feature Composition for Name Tagging

Input Data Set Directory Structure

Pre-processing

Train LSTM-CNN

Train LSTM-CNN with Dynamic Feature Composition

Requirement

Resources

Reference

About

Releases

Packages

Languages

limteng-rpi/neural_name_tagging

Folders and files

Latest commit

History

Repository files navigation

Dynamic Feature Composition for Name Tagging

Input Data Set Directory Structure

Pre-processing

Train LSTM-CNN

Train LSTM-CNN with Dynamic Feature Composition

Requirement

Resources

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages