Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Latest commit

 

History

History
438 lines (397 loc) · 11.6 KB

README.md

File metadata and controls

438 lines (397 loc) · 11.6 KB


The FLORES-200 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

The creation of FLORES-200 doubles the existing language coverage of FLORES-101. Given the nature of the new languages, which have less standardization and require more specialized professional translations, the verification process became more complex. This required modifications to the translation workflow. FLORES-200 has several languages which were not translated from English. Specifically, several languages were translated from Spanish, French, Russian and Modern Standard Arabic. Moreover, FLORES-200 also includes two script alternatives for four languages.


Composition

FLORES-200 consists of translations from 842 distinct web articles, totaling 3001 sentences. These sentences are divided into three splits: dev, devtest, and test (hidden). On average, sentences are approximately 21 words long.

Download

⚠️ This repository is no longer being updated ⚠️

For newer versions of this dataset, see https://github.com/openlanguagedata/flores and https://www.oldi.org.

The original version of the dataset can still be downloaded here and is also available on HuggingFace here.

SPM and Dictionary

  • Dictionary Download here
  • SPM Model Download here

Example SentencePiece Usage

Note: Install SentencePiece from here

flores_dataset=/path/to/flores_dataset
fairseq=/path/to/fairseq
cd $fairseq

python scripts/spm_encode.py \
    --model flores_spm_model_here \
    --output_format=piece \
    --inputs=data_input_path_here \
    --outputs=data_output_path_here

Evaluation

We primarily evaluate with chrf :

sacrebleu -m chrf --chrf-word-order 2 {ref_file} < {hyp_file}

and also evaluate with spBLEU:

# tokenize with SPM
python scripts/spm_encode.py \
    --model flores_spm_model_here \
    --output_format=piece \
    --inputs={untok_hyp_file} \
    --outputs={hyp_file}

# calculate with sacrebleu
cat {hyp_file} | sacrebleu {ref_file}

Languages in FLORES-200

Language FLORES-200 code
Acehnese (Arabic script) ace_Arab
Acehnese (Latin script) ace_Latn
Mesopotamian Arabic acm_Arab
Ta’izzi-Adeni Arabic acq_Arab
Tunisian Arabic aeb_Arab
Afrikaans afr_Latn
South Levantine Arabic ajp_Arab
Akan aka_Latn
Amharic amh_Ethi
North Levantine Arabic apc_Arab
Modern Standard Arabic arb_Arab
Modern Standard Arabic (Romanized) arb_Latn
Najdi Arabic ars_Arab
Moroccan Arabic ary_Arab
Egyptian Arabic arz_Arab
Assamese asm_Beng
Asturian ast_Latn
Awadhi awa_Deva
Central Aymara ayr_Latn
South Azerbaijani azb_Arab
North Azerbaijani azj_Latn
Bashkir bak_Cyrl
Bambara bam_Latn
Balinese ban_Latn
Belarusian bel_Cyrl
Bemba bem_Latn
Bengali ben_Beng
Bhojpuri bho_Deva
Banjar (Arabic script) bjn_Arab
Banjar (Latin script) bjn_Latn
Standard Tibetan bod_Tibt
Bosnian bos_Latn
Buginese bug_Latn
Bulgarian bul_Cyrl
Catalan cat_Latn
Cebuano ceb_Latn
Czech ces_Latn
Chokwe cjk_Latn
Central Kurdish ckb_Arab
Crimean Tatar crh_Latn
Welsh cym_Latn
Danish dan_Latn
German deu_Latn
Southwestern Dinka dik_Latn
Dyula dyu_Latn
Dzongkha dzo_Tibt
Greek ell_Grek
English eng_Latn
Esperanto epo_Latn
Estonian est_Latn
Basque eus_Latn
Ewe ewe_Latn
Faroese fao_Latn
Fijian fij_Latn
Finnish fin_Latn
Fon fon_Latn
French fra_Latn
Friulian fur_Latn
Nigerian Fulfulde fuv_Latn
Scottish Gaelic gla_Latn
Irish gle_Latn
Galician glg_Latn
Guarani grn_Latn
Gujarati guj_Gujr
Haitian Creole hat_Latn
Hausa hau_Latn
Hebrew heb_Hebr
Hindi hin_Deva
Chhattisgarhi hne_Deva
Croatian hrv_Latn
Hungarian hun_Latn
Armenian hye_Armn
Igbo ibo_Latn
Ilocano ilo_Latn
Indonesian ind_Latn
Icelandic isl_Latn
Italian ita_Latn
Javanese jav_Latn
Japanese jpn_Jpan
Kabyle kab_Latn
Jingpho kac_Latn
Kamba kam_Latn
Kannada kan_Knda
Kashmiri (Arabic script) kas_Arab
Kashmiri (Devanagari script) kas_Deva
Georgian kat_Geor
Central Kanuri (Arabic script) knc_Arab
Central Kanuri (Latin script) knc_Latn
Kazakh kaz_Cyrl
Kabiyè kbp_Latn
Kabuverdianu kea_Latn
Khmer khm_Khmr
Kikuyu kik_Latn
Kinyarwanda kin_Latn
Kyrgyz kir_Cyrl
Kimbundu kmb_Latn
Northern Kurdish kmr_Latn
Kikongo kon_Latn
Korean kor_Hang
Lao lao_Laoo
Ligurian lij_Latn
Limburgish lim_Latn
Lingala lin_Latn
Lithuanian lit_Latn
Lombard lmo_Latn
Latgalian ltg_Latn
Luxembourgish ltz_Latn
Luba-Kasai lua_Latn
Ganda lug_Latn
Luo luo_Latn
Mizo lus_Latn
Standard Latvian lvs_Latn
Magahi mag_Deva
Maithili mai_Deva
Malayalam mal_Mlym
Marathi mar_Deva
Minangkabau (Arabic script) min_Arab
Minangkabau (Latin script) min_Latn
Macedonian mkd_Cyrl
Plateau Malagasy plt_Latn
Maltese mlt_Latn
Meitei (Bengali script) mni_Beng
Halh Mongolian khk_Cyrl
Mossi mos_Latn
Maori mri_Latn
Burmese mya_Mymr
Dutch nld_Latn
Norwegian Nynorsk nno_Latn
Norwegian Bokmål nob_Latn
Nepali npi_Deva
Northern Sotho nso_Latn
Nuer nus_Latn
Nyanja nya_Latn
Occitan oci_Latn
West Central Oromo gaz_Latn
Odia ory_Orya
Pangasinan pag_Latn
Eastern Panjabi pan_Guru
Papiamento pap_Latn
Western Persian pes_Arab
Polish pol_Latn
Portuguese por_Latn
Dari prs_Arab
Southern Pashto pbt_Arab
Ayacucho Quechua quy_Latn
Romanian ron_Latn
Rundi run_Latn
Russian rus_Cyrl
Sango sag_Latn
Sanskrit san_Deva
Santali sat_Olck
Sicilian scn_Latn
Shan shn_Mymr
Sinhala sin_Sinh
Slovak slk_Latn
Slovenian slv_Latn
Samoan smo_Latn
Shona sna_Latn
Sindhi snd_Arab
Somali som_Latn
Southern Sotho sot_Latn
Spanish spa_Latn
Tosk Albanian als_Latn
Sardinian srd_Latn
Serbian srp_Cyrl
Swati ssw_Latn
Sundanese sun_Latn
Swedish swe_Latn
Swahili swh_Latn
Silesian szl_Latn
Tamil tam_Taml
Tatar tat_Cyrl
Telugu tel_Telu
Tajik tgk_Cyrl
Tagalog tgl_Latn
Thai tha_Thai
Tigrinya tir_Ethi
Tamasheq (Latin script) taq_Latn
Tamasheq (Tifinagh script) taq_Tfng
Tok Pisin tpi_Latn
Tswana tsn_Latn
Tsonga tso_Latn
Turkmen tuk_Latn
Tumbuka tum_Latn
Turkish tur_Latn
Twi twi_Latn
Central Atlas Tamazight tzm_Tfng
Uyghur uig_Arab
Ukrainian ukr_Cyrl
Umbundu umb_Latn
Urdu urd_Arab
Northern Uzbek uzn_Latn
Venetian vec_Latn
Vietnamese vie_Latn
Waray war_Latn
Wolof wol_Latn
Xhosa xho_Latn
Eastern Yiddish ydd_Hebr
Yoruba yor_Latn
Yue Chinese yue_Hant
Chinese (Simplified) zho_Hans
Chinese (Traditional) zho_Hant
Standard Malay zsm_Latn
Zulu zul_Latn

Updates to Previous Languages

Based on feedback and further Q/A, we've improved the quality of several languages:

  • Quechua (quy_Latn)
  • Aymara (ayr_Latn)
  • Cebuano (ceb_Latn)
  • Kimbundu (kmb_Latn)
  • Umbundu (umb_Latn)

As a result, the results between FLORES-101 and FLORES-200 for these languages will differ slightly.

Map between FLORES-101 Language Codes and FLORES-200 Language Codes

FLORES-200 code FLORES-101 code
afr_Latn afr
amh_Ethi amh
arb_Arab ara
asm_Beng asm
ast_Latn ast
azj_Latn azj
bel_Cyrl bel
ben_Beng ben
bos_Latn bos
bul_Cyrl bul
cat_Latn cat
ceb_Latn ceb
ces_Latn ces
ckb_Arab ckb
cym_Latn cym
dan_Latn dan
deu_Latn deu
ell_Grek ell
eng_Latn eng
est_Latn est
fin_Latn fin
fra_Latn fra
fuv_Latn ful
gle_Latn gle
glg_Latn glg
guj_Gujr guj
hau_Latn hau
heb_Hebr heb
hin_Deva hin
hrv_Latn hrv
hun_Latn hun
hye_Armn hye
ibo_Latn ibo
ind_Latn ind
isl_Latn isl
ita_Latn ita
jav_Latn jav
jpn_Jpan jpn
kam_Latn kam
kan_Knda kan
kat_Geor kat
kaz_Cyrl kaz
khm_Khmr khm
kir_Cyrl kir
kor_Hang kor
lao_Laoo lao
lij_Latn Latvian
lim_Latn kea
lin_Latn lin
lit_Latn lit
ltz_Latn ltz
lug_Latn lug
luo_Latn luo
lvs_Latn lav
mal_Mlym mal
mar_Deva mar
mkd_Cyrl mkd
mlt_Latn mlt
khk_Cyrl mon
mri_Latn mri
mya_Mymr mya
nld_Latn nld
nob_Latn nob
npi_Deva npi
nso_Latn nso
nya_Latn nya
oci_Latn oci
gaz_Latn orm
ory_Orya ory
pan_Guru pan
pes_Arab fas
pol_Latn pol
por_Latn por
pbt_Arab pus
ron_Latn ron
rus_Cyrl rus
slk_Latn slk
sna_Latn sna
snd_Arab snd
som_Latn som
spa_Latn spa
srp_Cyrl srp
swe_Latn swe
swh_Latn swh
tam_Taml tam
tel_Telu tel
tgk_Cyrl tgk
tgl_Latn tgl
tha_Thai tha
tur_Latn tur
ukr_Cyrl ukr
umb_Latn umb
urd_Arab urd
uzn_Latn uzb
vie_Latn vie
wol_Latn wol
xho_Latn xho
yor_Latn yor
zho_Hans zho_simpl
zho_Hant zho_trad
zsm_Latn msa
zul_Latn zul

Previous FLORES Releases

FLORES-101

FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages.

FLORESv1

FLORESv1 included Nepali, Sinhala, Pashto, and Khmer.

Citation

If you use this data in your work, please cite:

@article{nllb2022,
  author    = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi,  Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang},
  title     = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  year      = {2022}
}

@inproceedings{,
  title={The FLORES-101  Evaluation Benchmark for Low-Resource and Multilingual Machine Translation},
  author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela},
  year={2021}
}

@inproceedings{,
  title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
  author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1902.01382},
  year={2019}
}