Unbalanced Language Detection Confidence with Mixed French and English Text #236

yenaing-oo · 2024-08-28T18:14:32Z

I'm encountering an issue where detecting text that contains nearly equal amounts of French and English consistently yields a confidence value of 1.0 for French, while English receives 0.0 confidence.

Here’s a minimal example:

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

text = "In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais. Avec la mondialisation, la capacité à communiquer à travers les cultures est un avantage considérable."

confidence_values = detector.compute_language_confidence_values(text)
for confidence in confidence_values:
    print(f"{confidence.language.name}: {confidence.value:.2f}")

output:

FRENCH: 1.00
ENGLISH: 0.00

I attempted to reduce the influence of shared words between French and English by adding with_minimum_relative_distance:

detector = LanguageDetectorBuilder.from_languages(*languages).with_minimum_relative_distance(0.9).build()

However, the output remains the same:

FRENCH: 1.00
ENGLISH: 0.00

When I remove one of the French sentences, the confidence shifts entirely to English:

text = "In today’s interconnected world, learning multiple languages has become more important than ever. With globalization, the ability to communicate across cultures is a significant advantage. Being multilingual not only enhances communication but also opens up opportunities for personal and professional growth. Dans le monde interconnecté d'aujourd'hui, apprendre plusieurs langues est plus important que jamais."

Output:

ENGLISH: 1.00
FRENCH: 0.00

Is there a way to adjust the confidence values so that they better reflect the actual balance of languages in the text?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbalanced Language Detection Confidence with Mixed French and English Text #236

Unbalanced Language Detection Confidence with Mixed French and English Text #236

yenaing-oo commented Aug 28, 2024

Unbalanced Language Detection Confidence with Mixed French and English Text #236

Unbalanced Language Detection Confidence with Mixed French and English Text #236

Comments

yenaing-oo commented Aug 28, 2024