Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between different variations of the same language #46

Open
BLKSerene opened this issue Jul 15, 2022 · 6 comments
Open

Distinguish between different variations of the same language #46

BLKSerene opened this issue Jul 15, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@BLKSerene
Copy link

Hi, I'm wondering whether it is possible for lingua to distinguish between variations of the same language, for example: Simplified Chinese and Traditional Chinese, Norwegian Bokmål and Norwegian Nynorsk.
AFAIK, langdetect could distinguish between Simplified and Traditional Chinese while other alternatives can't.

@pemistahl
Copy link
Owner

Hi @BLKSerene, thank you for your request.

The library already distinguishes between Bokmal and Nynorsk. As for Simplified and Traditional Chinese, I could not find suitable training corpora yet which solely consist of either Simplified or Traditional Chinese. Do you know a good source for those perhaps?

@pemistahl pemistahl added enhancement New feature or request new feature labels Jul 19, 2022
@pemistahl pemistahl changed the title [Feature Request] Distinguish between different variations of the same language Distinguish between different variations of the same language Jul 19, 2022
@BLKSerene
Copy link
Author

There are two UD Chinese corpora.
Simplified Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSDSimp
Traditional Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSD
What are the requirements of the training data? And license?

@pemistahl
Copy link
Owner

Ah, those look suitable, thank you.

For LanguageModelFilesWriter being able to create the language models, it needs training data in plain text without any annotations etc. So I would need to use a custom parser for the UD files first. The license should allow to use the language models created from the training data.

@BLKSerene
Copy link
Author

The conllu package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu

@BLKSerene BLKSerene closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2023
@pemistahl pemistahl reopened this Aug 26, 2023
@yanqianglu
Copy link

1 on the feature request 🙏

@yudelevi
Copy link

yudelevi commented Jul 3, 2024

If it helps anyone: in the meanwhile I've had some success identifying traditional and simplified Chinese with hanzidentifier which is based on zhon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants