To support T99666: Provide a service to detect which language the user is writing on a language identification system based on fastttext model as outlined in https://arxiv.org/pdf/2305.13820.pdf is to be created.
- Source: https://github.com/laurieburchell/open-lid-dataset
- Model: https://data.statmt.org/lid/lid201-model.bin.gz
- Model license: the GNU General Public License v3.0.
Additional information
- Given a text, the system will predict the language and also should print a score indicating the confidence level.
- Ownership of the service: @santhosh , Language-Team
- fasttext is very fast for language detection and inference is almost instant. No GPU required.
- The model supports detecting 201 languages. However, supporting a larger set of languages, for example, all 320 languages in which wikipedia exist is a future goal. This require good quality dataset and @santhosh has been looking into this. A program to prepare such dataset is at https://github.com/santhoshtr/wikisentences and currently checking if the authors of the above mentioned paper are interested in this exploration. The hardware setup and time required to train a new model is not that elaborate. About 2 hours with a sufficiently powerful machine is enough. No GPU required.