Skip to content

bichngocdo/language-identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

An N-Gram Based Language Identifier

This project implements the approach described on: William B. Cavnar and John M. Trenkle - N-Gram-Based Text Categorization.

Main modules are in package langid:

  • ngramprofile.py: language profile class and methods
  • utils.py: helper functions to iterate n-grams in a file

Required package(s): pyyaml

Sample Codes

Sample codes are in folder sample. This folder includes:

  1. Data:
  • Data were downloaded from Leipzig Corpora in 4 languages: English, German, Italian and Spanish.
  • List of used data:
  • The sentence file in each archive was split into 20K chunks, the first 10 chunks were used as testing data, while the 11th chunk was used as training data.
  • The language code is the name of each training file and is the prefix of each testing file.
  1. Codes:
  • The path of this project must be in PYTHONPATH in order to run sample codes.
  • train_languages.py: Read each training file in folder sample/training and write the corresponding model to folder sample/models
  • test_languages.py: Read each testing file in folder sample/testing and report the accuracy
  1. Result: The accuracy of this dataset is 1.0.

About

A simple implementation of language identifier based on n-gram

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages