GitHub - bbdouglas/tweet-lang: Python-based language identification of tweets.

bbdouglas / tweet-lang Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Python-based language identification of tweets.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README		README
de.json		de.json
detect.py		detect.py
en.json		en.json
ja.json		ja.json
train.py		train.py

Repository files navigation

This is a simple example of how n-grams and cosine similarity can be used to 
statistically detect language. The system is designed to work with Twitter,
and has rudimentary support for twitter metadata and formatting. However, the
technique could be easily adopted to other domains.

Since each tweet has an associated user profile language, the application of
guessing what language a tweet is in is rather academic. The more practical use
is for detecting when a tweet is not written in the designated profile language,
which actually happens quite often. The idea would be to gather tweets from a
single profile language, with the assumption being that the majority are, in
fact, in that language. The similarity metric can be used to identify outliers.

There are two CLI python scripts: train.py and detect.py. train.py is run over
a set of json-formatted tweet files, each file corresponding to one profile
language, and produces a language model. detect.py uses the model to predict the
language of new tweets. The language detection is done purely on the text of the
tweet, without any help from the tweet metadata. The following represents sample
usage of the scripts.

    $ ./train.py en.json de.json ja.json 
    processing en.json...
    processing de.json...
    processing ja.json...
    model written to langid.model
    $ ./detect.py langid.model "Go #Giants! Beat the #Tigers"
    en

For more background, see the blog posts and demos at:

    http://blog.bbdouglas.com