Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbalanced Language Detection Confidence with Mixed French and English Text #236

Open
yenaing-oo opened this issue Aug 28, 2024 · 0 comments

Comments

@yenaing-oo
Copy link

I'm encountering an issue where detecting text that contains nearly equal amounts of French and English consistently yields a confidence value of 1.0 for French, while English receives 0.0 confidence.

Here’s a minimal example:

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

text = "In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais. Avec la mondialisation, la capacité à communiquer à travers les cultures est un avantage considérable."

confidence_values = detector.compute_language_confidence_values(text)
for confidence in confidence_values:
    print(f"{confidence.language.name}: {confidence.value:.2f}")

output:

FRENCH: 1.00
ENGLISH: 0.00

I attempted to reduce the influence of shared words between French and English by adding with_minimum_relative_distance:

detector = LanguageDetectorBuilder.from_languages(*languages).with_minimum_relative_distance(0.9).build()

However, the output remains the same:

FRENCH: 1.00
ENGLISH: 0.00

When I remove one of the French sentences, the confidence shifts entirely to English:

text = "In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Being multilingual not only enhances communication but also opens up opportunities for personal and professional growth. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais."

Output:

ENGLISH: 1.00
FRENCH: 0.00

Is there a way to adjust the confidence values so that they better reflect the actual balance of languages in the text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant