-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
30 lines (25 loc) · 1.45 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
This is a simple example of how n-grams and cosine similarity can be used to
statistically detect language. The system is designed to work with Twitter,
and has rudimentary support for twitter metadata and formatting. However, the
technique could be easily adopted to other domains.
Since each tweet has an associated user profile language, the application of
guessing what language a tweet is in is rather academic. The more practical use
is for detecting when a tweet is not written in the designated profile language,
which actually happens quite often. The idea would be to gather tweets from a
single profile language, with the assumption being that the majority are, in
fact, in that language. The similarity metric can be used to identify outliers.
There are two CLI python scripts: train.py and detect.py. train.py is run over
a set of json-formatted tweet files, each file corresponding to one profile
language, and produces a language model. detect.py uses the model to predict the
language of new tweets. The language detection is done purely on the text of the
tweet, without any help from the tweet metadata. The following represents sample
usage of the scripts.
$ ./train.py en.json de.json ja.json
processing en.json...
processing de.json...
processing ja.json...
model written to langid.model
$ ./detect.py langid.model "Go #Giants! Beat the #Tigers"
en
For more background, see the blog posts and demos at:
http://blog.bbdouglas.com