You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue where detecting text that contains nearly equal amounts of French and English consistently yields a confidence value of 1.0 for French, while English receives 0.0 confidence.
Here’s a minimal example:
fromlinguaimportLanguage, LanguageDetectorBuilderlanguages= [Language.ENGLISH, Language.FRENCH]
detector=LanguageDetectorBuilder.from_languages(*languages).build()
text="In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais. Avec la mondialisation, la capacité à communiquer à travers les cultures est un avantage considérable."confidence_values=detector.compute_language_confidence_values(text)
forconfidenceinconfidence_values:
print(f"{confidence.language.name}: {confidence.value:.2f}")
output:
FRENCH: 1.00
ENGLISH: 0.00
I attempted to reduce the influence of shared words between French and English by adding with_minimum_relative_distance:
When I remove one of the French sentences, the confidence shifts entirely to English:
text="In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Being multilingual not only enhances communication but also opens up opportunities for personal and professional growth. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais."
Output:
ENGLISH: 1.00
FRENCH: 0.00
Is there a way to adjust the confidence values so that they better reflect the actual balance of languages in the text?
The text was updated successfully, but these errors were encountered:
I'm encountering an issue where detecting text that contains nearly equal amounts of French and English consistently yields a confidence value of 1.0 for French, while English receives 0.0 confidence.
Here’s a minimal example:
output:
I attempted to reduce the influence of shared words between French and English by adding with_minimum_relative_distance:
However, the output remains the same:
When I remove one of the French sentences, the confidence shifts entirely to English:
Output:
Is there a way to adjust the confidence values so that they better reflect the actual balance of languages in the text?
The text was updated successfully, but these errors were encountered: