Distinguish between different variations of the same language #46

BLKSerene · 2022-07-15T11:01:04Z

Hi, I'm wondering whether it is possible for lingua to distinguish between variations of the same language, for example: Simplified Chinese and Traditional Chinese, Norwegian Bokmål and Norwegian Nynorsk.
AFAIK, langdetect could distinguish between Simplified and Traditional Chinese while other alternatives can't.

The text was updated successfully, but these errors were encountered:

pemistahl · 2022-07-19T20:39:36Z

Hi @BLKSerene, thank you for your request.

The library already distinguishes between Bokmal and Nynorsk. As for Simplified and Traditional Chinese, I could not find suitable training corpora yet which solely consist of either Simplified or Traditional Chinese. Do you know a good source for those perhaps?

BLKSerene · 2022-07-20T03:49:12Z

There are two UD Chinese corpora.
Simplified Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSDSimp
Traditional Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSD
What are the requirements of the training data? And license?

pemistahl · 2022-07-20T09:06:36Z

Ah, those look suitable, thank you.

For LanguageModelFilesWriter being able to create the language models, it needs training data in plain text without any annotations etc. So I would need to use a custom parser for the UD files first. The license should allow to use the language models created from the training data.

BLKSerene · 2022-07-20T13:44:54Z

The conllu package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu

yanqianglu · 2023-08-26T22:26:14Z

1 on the feature request 🙏

yudelevi · 2024-07-03T17:29:53Z

If it helps anyone: in the meanwhile I've had some success identifying traditional and simplified Chinese with hanzidentifier which is based on zhon

pemistahl added enhancement New feature or request new feature labels Jul 19, 2022

pemistahl changed the title ~~[Feature Request] Distinguish between different variations of the same language~~ Distinguish between different variations of the same language Jul 19, 2022

pemistahl removed the new feature label Jul 19, 2022

BLKSerene closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2023

pemistahl reopened this Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between different variations of the same language #46

Distinguish between different variations of the same language #46

BLKSerene commented Jul 15, 2022

pemistahl commented Jul 19, 2022

BLKSerene commented Jul 20, 2022

pemistahl commented Jul 20, 2022

BLKSerene commented Jul 20, 2022

yanqianglu commented Aug 26, 2023

yudelevi commented Jul 3, 2024

Distinguish between different variations of the same language #46

Distinguish between different variations of the same language #46

Comments

BLKSerene commented Jul 15, 2022

pemistahl commented Jul 19, 2022

BLKSerene commented Jul 20, 2022

pemistahl commented Jul 20, 2022

BLKSerene commented Jul 20, 2022

yanqianglu commented Aug 26, 2023

yudelevi commented Jul 3, 2024