This project implements the approach described on: William B. Cavnar and John M. Trenkle - N-Gram-Based Text Categorization.
Main modules are in package langid
:
ngramprofile.py
: language profile class and methodsutils.py
: helper functions to iterate n-grams in a file
Required package(s): pyyaml
Sample codes are in folder sample. This folder includes:
- Data:
- Data were downloaded from Leipzig Corpora in 4 languages: English, German, Italian and Spanish.
- List of used data:
- The sentence file in each archive was split into 20K chunks, the first 10 chunks were used as testing data, while the 11th chunk was used as training data.
- The language code is the name of each training file and is the prefix of each testing file.
- Codes:
- The path of this project must be in
PYTHONPATH
in order to run sample codes. train_languages.py
: Read each training file in folder sample/training and write the corresponding model to folder sample/modelstest_languages.py
: Read each testing file in folder sample/testing and report the accuracy
- Result: The accuracy of this dataset is
1.0
.