training
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
This file gives instructions for training and evaluating the EasyCCG parser. SETUP ===== Install Torch and Lua: - Use the instructions here: http://torch.ch/ - Set up an enviornment variable pointing at the torch-lua executable: export TORCH=path/to/torch-lua Set up some pre-trained word embeddings: - Download some embeddings, e.g. Turian's: http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-scaled.EMBEDDING_SIZE=50.txt.gz - Create a subfolder (e.g. 'turian'), and put the embeddings in the folder with the name embeddings.raw - The training program expects that the first entry in the file should correspond to a rare-word embedding - check that this is the case. - Pre-process the embeddings, with the command: ./splitEmbeddings.sh embeddings_folder To get a dependency-based evaluation, you'll need some scripts from the C&C parser: - Download the C&C parser - Set up an environment variable pointing at the C&C folder: export CANDC=path/to/candc/ - Copy across some extra scripts, using: cp eval_scripts/* $CANDC/src/scripts/ccg/ Set up some training data: - Make folders containing supertagged sentences in the format word|pos|category, in a file called gold.stagged. POS will be ignored by the code, so can be empty. - For evaluating the parser, you'll also need a gold.deps file containing test dependencies. - CCGBank test/training data in the correct format can be extracted from CCGBank using the C&C parser's 'generate' program. Helpful notes on compiling/training the C&C are available here: http://aclweb.org/aclwiki/index.php?title=Training_the_C&C_Parser - Steve Clark has released some very useful out-of-domain data for training/testing: https://sites.google.com/site/stephenclark609/resources - Matt Honnibal also has an excellent annotated Wikipedia corpus. - By default, the parser uses a 'seenRules' file, which tell it which category combinations have occured in CCGBank. This needs to be in the training data folder. Normally, you can just copy this from this folder. - If you've added new categories, the seenRules file needs to be updated, or you can let the parser use any valid CCG combinations (using the "-s" flag). You may also want to update the unaryRules file. TRAINING AND EVALUATING ======================= Training should then be straightforward, with a command like the following: ./do_training.sh my_model turian experiments/wsj_train/ experiments/wsj00/ This creates a model in the folder "turian/my_model", using embeddings from the "turian" folder, training data from "experiments/wsj_train/gold.stagged" and development data from "experiments/wsj_dev/gold.stagged". All being well, it will output a labelled dependency score on the development data. To use lexical features (not recommended), add a file called 'frequentwords' to the training folder before training, containing one word per line. These words will be used as additional features. I found that this didn't help accuracy, and made things much slower. To evaluate on test data, use something like: ./do_experiments.sh turian/my_model experiments/wsj23/ my_model Evaluating on Biomedical text is a bit more work, because we need to convert to Stanford dependencies: - Put the grs2sd-1.00 and markedup_sd-1.00 files in the same folder as gold.stagged - You'll also need to put the gold-standard dependencies in that folder, in the file gold.deps - Copy the script eval_test.py to that folder - Use this script: ./do_experiments_genia.sh model_folder experiments/genia/ my_model