-
Notifications
You must be signed in to change notification settings - Fork 0
bbdouglas/tweet-lang
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a simple example of how n-grams and cosine similarity can be used to statistically detect language. The system is designed to work with Twitter, and has rudimentary support for twitter metadata and formatting. However, the technique could be easily adopted to other domains. Since each tweet has an associated user profile language, the application of guessing what language a tweet is in is rather academic. The more practical use is for detecting when a tweet is not written in the designated profile language, which actually happens quite often. The idea would be to gather tweets from a single profile language, with the assumption being that the majority are, in fact, in that language. The similarity metric can be used to identify outliers. There are two CLI python scripts: train.py and detect.py. train.py is run over a set of json-formatted tweet files, each file corresponding to one profile language, and produces a language model. detect.py uses the model to predict the language of new tweets. The language detection is done purely on the text of the tweet, without any help from the tweet metadata. The following represents sample usage of the scripts. $ ./train.py en.json de.json ja.json processing en.json... processing de.json... processing ja.json... model written to langid.model $ ./detect.py langid.model "Go #Giants! Beat the #Tigers" en For more background, see the blog posts and demos at: http://blog.bbdouglas.com
About
Python-based language identification of tweets.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published