COMBO is jointly trained neural tagger, lemmatizer and dependency parser implemented in python 3 using Keras framework. It took part in 2018 CoNLL Universal Dependency shared task and ranked 3rd/4th in the official evaluation.
The COMBO description can be found here: Semi-Supervised Neural System for Tagging, Parsing and Lematization.
Training your own model:
python main.py --mode autotrain --train train_data.conllu --valid valid_data.conllu --embed external_embedding.txt --model model_name.pkl --force_trees
Making predictions:
python main.py --mode predict --test test_data.conllu --pred output_path.conllu --model model_name.pkl
Models trained on UD dataset:
Language | Treebank | LAS | MLAS | BLEX | Model |
---|---|---|---|---|---|
Afrikaans | af_afribooms | 84.72 | 72.91 | 74.98 | 377 MB |
Ancient Greek | grc_perseus | 74.20 | 53.30 | 54.29 | 101 MB |
Ancient Greek | grc_proiel | 76.45 | 59.95 | 67.47 | 101 MB |
Arabic | ar_padt | 71.95 | 62.75 | 64.38 | 737 MB |
Armenian | hy_armtdp | 28.15 | 5.02 | 11.25 | 738 MB |
Basque | eu_bdt | 83.12 | 68.82 | 77.96 | 737 MB |
Bulgarian | bg_btb | 89.36 | 81.10 | 79.98 | 738 MB |
Buryat | bxr_bdt | 15.16 | 1.09 | 1.92 | 90 MB |
Catalan | ca_ancora | 90.54 | 83.11 | 85.20 | 737 MB |
Chinese | zh_gsd | 63.92 | 53.48 | 57.84 | 744 MB |
Croatian | hr_set | 86.32 | 71.12 | 79.74 | 737 MB |
Czech | cs_cac | 90.72 | 83.27 | 86.69 | 740 MB |
Czech | cs_fictree | 91.83 | 84.23 | 87.81 | 740 MB |
Czech | cs_pdt | 90.34 | 84.04 | 86.96 | 740 MB |
Danish | da_ddt | 83.43 | 74.22 | 77.58 | 737 MB |
Dutch | nl_alpino | 87.15 | 74.93 | 77.06 | 737 MB |
Dutch | nl_lassysmall | 84.27 | 72.65 | 75.44 | 737 MB |
English | en_ewt | 82.31 | 73.33 | 76.52 | 737 MB |
English | en_gum | 82.82 | 73.24 | 73.57 | 737 MB |
English | en_lines | 80.33 | 72.25 | 74.01 | 737 MB |
Estonian | et_edt | 83.46 | 75.79 | 72.07 | 738 MB |
Finnish | fi_ftb | 86.89 | 78.42 | 81.06 | 739 MB |
Finnish | fi_tdt | 85.93 | 78.65 | 72.39 | 739 MB |
French | fr_gsd | 85.42 | 77.08 | 79.72 | 738 MB |
French | fr_sequoia | 88.99 | 81.48 | 84.67 | 738 MB |
French | fr_spoken | 74.31 | 63.43 | 65.34 | 738 MB |
Galician | gl_ctg | 81.17 | 68.15 | 73.60 | 736 MB |
Galician | gl_treegal | 73.21 | 52.88 | 62.86 | 736 MB |
German | de_gsd | 77.43 | 54.28 | 68.59 | 738 MB |
Gothic | got_proiel | 65.87 | 50.81 | 59.30 | 48 MB |
Greek | el_gdt | 88.49 | 76.15 | 78.57 | 738 MB |
Hebrew | he_htb | 63.69 | 50.26 | 53.58 | 737 MB |
Hindi | hi_hdtb | 91.43 | 76.23 | 86.29 | 593 MB |
Hungarian | hu_szeged | 79.47 | 66.09 | 72.51 | 737 MB |
Indonesian | id_gsd | 78.40 | 67.30 | 75.10 | 737 MB |
Irish | ga_idt | 69.24 | 37.31 | 47.32 | 206 MB |
Italian | it_isdt | 91.03 | 83.18 | 84.76 | 737 MB |
Italian | it_postwita | 73.99 | 61.14 | 62.98 | 737 MB |
Japanese | ja_gsd | 73.69 | 57.82 | 60.62 | 743 MB |
Kazakh | kk_ktb | 22.38 | 4.40 | 7.86 | 738 MB |
Korean | ko_gsd | 80.66 | 74.49 | 66.13 | 741 MB |
Korean | ko_kaist | 84.88 | 76.92 | 72.40 | 743 MB |
Kurmanji | kmr_mg | 21.95 | 2.26 | 05.01 | 45 MB |
Latin | la_ittb | 85.54 | 79.84 | 83.51 | 526 MB |
Latin | la_perseus | 68.07 | 49.77 | 52.75 | 526 MB |
Latin | la_proiel | 70.08 | 56.82 | 64.94 | 526 MB |
Latvian | lv_lvtb | 80.71 | 66.22 | 71.80 | 637 MB |
North Sámi | sme_giella | 57.16 | 39.66 | 45.03 | 47 MB |
Norwegian | no_bokmaal | 89.33 | 79.51 | 84.68 | 737 MB |
Norwegian | no_nynorsk | 88.36 | 79.32 | 82.89 | 737 MB |
Norwegian | no_nynorsklia | 68.26 | 57.51 | 60.98 | 737 MB |
Old Church Slavonic | cu_proiel | 71.14 | 56.52 | 66.04 | 48 MB |
Old French | fro_srcmf | 84.81 | 76.75 | 81.20 | 52 MB |
Persian | fa_seraji | 86.14 | 80.30 | 76.29 | 737 MB |
Polish | pl_lfg | 94.62 | 86.44 | 89.31 | 737 MB |
Polish | pl_sz | 91.38 | 80.45 | 85.59 | 737 MB |
Polish | poleval2018 | 86.11 | 76.18 | 79.86 | 115 MB |
Portuguese | pt_bosque | 87.57 | 74.31 | 80.31 | 737 MB |
Romanian | ro_rrt | 85.31 | 76.84 | 79.54 | 737 MB |
Russian | ru_syntagrus | 91.10 | 85.37 | 87.16 | 741 MB |
Russian | ru_taiga | 74.24 | 61.59 | 64.36 | 741 MB |
Serbian | sr_set | 87.27 | 73.79 | 79.92 | 738 MB |
Slovak | sk_snk | 83.76 | 63.97 | 75.34 | 54 MB |
Slovenian | sl_ssj | 85.72 | 75.07 | 81.11 | 737 MB |
Slovenian | sl_sst | 58.12 | 45.93 | 50.94 | 737 MB |
Spanish | es_ancora | 89.68 | 82.60 | 84.51 | 737 MB |
Swedish | sv_lines | 81.97 | 66.26 | 77.01 | 737 MB |
Swedish | sv_talbanken | 85.89 | 77.68 | 80.74 | 737 MB |
Turkish | tr_imst | 63.54 | 52.51 | 58.89 | 737 MB |
Ukrainian | uk_iu | 84.71 | 69.88 | 77.97 | 738 MB |
Upper Sorbian | hsb_ufal | 21.30 | 1.45 | 4.53 | 139 MB |
Urdu | ur_udtb | 81.53 | 55.70 | 72.49 | 485 MB |
Uyghur | ug_udt | 63.10 | 40.71 | 52.76 | 165 MB |
Vietnamese | vi_vtb | 42.53 | 35.11 | 38.47 | 736 MB |
CC BY-NC-SA 4.0
@InProceedings{rybak-wrblewska:2018:K18-2,
author = {Rybak, Piotr and Wr{\'{o}}blewska, Alina},
title = {Semi-Supervised Neural System for Tagging, Parsing and Lematization},
booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
month = {October},
year = {2018},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
pages = {45--54},
url = {http://www.aclweb.org/anthology/K18-2004}
}