Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of LatinBackoffLemmatizer with plural nouns of the second declension #1198

Open
DavideMassidda opened this issue Jan 17, 2023 · 4 comments
Assignees

Comments

@DavideMassidda
Copy link

DavideMassidda commented Jan 17, 2023

Processing Latin plural nouns from the second declension, sometimes the LatinBackoffLemmatizer adds a trailing digit.

I observed this strange behaviour with the term "lupus":

from cltk.lemmatize.lat import LatinBackoffLemmatizer
lemmatizer = LatinBackoffLemmatizer()

lupus = ['lupi','luporum','lupis','lupos','lupi','lupis']

lemmatizer.lemmatize(lupus)
[('lupi', 'lupus'), ('luporum', 'lupus1'), ('lupis', 'lupus1'), ('lupos', 'lupus1'), ('lupi', 'lupus'), ('lupis', 'lupus1')]

On the other hand, the term "amicus" does not present this bug:

amicus = ['amici','amicorum','amicis','amicos','amici','amicis']

lemmatizer.lemmatize(amicus)
[('amici', 'amicus'), ('amicorum', 'amicus'), ('amicis', 'amicus'), ('amicos', 'amicus'), ('amici', 'amicus'), ('amicis', 'amicus')]

I guess the fault lies with the DictLemmatizer:

lemmatizer = LatinBackoffLemmatizer(verbose=True)
lemmatizer.lemmatize(lupus)
[('lupi', 'lupus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('luporum', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>'), ('lupis', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>'), ('lupos', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>'), ('lupi', 'lupus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('lupis', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>')]
lemmatizer.lemmatize(amicus)
[('amici', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicorum', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicis', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicos', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amici', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicis', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>')]

Environment: Windows 10 Python 3.9.15 cltk 1.1.6

@DavideMassidda DavideMassidda changed the title Strange behavior of LatinBackoffLemmatizer with plural nouns of the second declension Strange behaviour of LatinBackoffLemmatizer with plural nouns of the second declension Jan 17, 2023
@clemsciences
Copy link
Member

Different lemmas can have an identical form. For example: jus is the form of a lemma meaning "law", "right" and an other lemma meaning "gravy", "juice". In order to distinguish them, ambiguous lemmas get a trailing number. Here it can be jus1 and jus2.

The rule-based lemmatizer is this one (https://github.com/cltk/lat_models_cltk/blob/master/lemmata/latin_lemmata_cltk.py), as far as I know.

@clemsciences
Copy link
Member

@diyclassics can probably give you more details on how to know which meaning is attached to which lemma.

@DavideMassidda
Copy link
Author

Thank you very much, Clément! So, this isn't a bug, but a precise choice: the final number is used to disambiguate. Good to know!

@clemsciences clemsciences removed the bug label Jan 21, 2023
@clemsciences
Copy link
Member

This is not a bug, but this must be better documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants