An N-Gram Based Language Identifier

Main modules are in package langid:

Required package(s): pyyaml

Sample codes are in folder sample. This folder includes:

Data were downloaded from Leipzig Corpora in 4 languages: English, German, Italian and Spanish.
List of used data:
The sentence file in each archive was split into 20K chunks, the first 10 chunks were used as testing data, while the 11th chunk was used as training data.
The language code is the name of each training file and is the prefix of each testing file.

The path of this project must be in PYTHONPATH in order to run sample codes.
train_languages.py: Read each training file in folder sample/training and write the corresponding model to folder sample/models
test_languages.py: Read each testing file in folder sample/testing and report the accuracy

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
langid		langid
sample		sample
README.md		README.md

Provide feedback