Wikidata:Wikidata Lexeme Forms

From Wikidata
Jump to navigation Jump to search
Lexeme Form through which nanosized (L601984) was created.
Video introduction to the tool (recorded for Lexicodays 2024)

Wikidata Lexeme Forms is a tool to create a Lexeme with a set of Forms, e. g. the declensions of a noun or the conjugations of a verb, or to edit the Forms of an existing Lexeme.

Usage

[edit]

You select a template on the index page (e. g. “English noun”), fill in the Forms based on the example sentences (the first Form will become the lemma), and then submit the page to create the Lexeme (to which you will be redirected).

If a Lexeme with the same lemma already exists, you will be warned of the duplicate and can decide whether you want to go ahead or not.

Finding words to add

[edit]

There is no list of Lexemes that need to be added. The tool does not suggest any. You can randomly guess missing Lexemes, or you can search the internet or books for word lists, and see how many of those need to be added.

If you guess a lemma that matches an existing Lexeme, the tool will warn you not to create a duplicate. The “advanced” mode will let you specify which pre-existing Lexeme to add the Forms to.

Multiple variants

[edit]

In some languages, there can be multiple Forms with the same grammatical features. For example, some German words have two genitive or dative singular Forms („des Hunds/Hundes“, „dem Kind/Kinde“). To create several Forms for them, you can specify the different variants, separated by slashes.

This should not be confused with multiple spelling variants of the same Form, e.g., “color/colour” in English. Those should be added as additional representations of the same Form, with different language codes indicating where the spelling is used, as seen on colour/colour/color (L1347); you can do this using the ?language_code parameter of § Edit mode, see below.

Gadget

[edit]

There is a gadget to automatically add links to this tool on Lexeme pages, in case the Lexeme’s Forms are incomplete. You can enable it in the gadget preferences:

Lexeme Forms

When on a lexeme page, this gadget checks whether Wikidata Lexeme Forms has a matching template. If it does, links to edit the lexeme using the tool are added to the sidebar.

Note: This gadget depends on website lexeme-forms.toolforge.org.

The gadget will automatically determine which template(s) best match the current Lexeme, based on its language, lexical category, and statements, and link them in the ⧼lexicographical-data⧽ sidebar section. This is not always unambiguous – when it suggests more than one template, you’ll have to decide which one matches the current template best. (Even when it only suggests one template, you should still make sure it’s correct, of course.) When it doesn’t suggest any templates at all, the tool might not support this kind of Lexeme yet – see the section below for how you can add support for them.

Bulk mode

[edit]

In bulk mode, you can create many Lexemes from the same template at once. You specify the Form representations in a single text field, where each line creates one Lexeme and the Forms of that Lexeme are separated by vertical pipe (|) or Tab characters. (Many spreadsheet and similar programs separate columns by tabs when copying into plain text, so you can prepare your Lexemes there and then directly paste them into bulk mode.) Lines may also begin with a Lexeme ID, also separated from the Forms by pipes or tabs, to add (some) Forms to existing Lexemes. As in other modes, you can separate multiple variants of a Form with slashes; as in advanced mode, you can leave Form representations blank (i.e., have two or more consecutive separators, e.g., A||C) to skip creating those Forms; and Lexemes that look like potential duplicates are unconditionally skipped (if you still want to create them, do it in one of the other modes where you can confirm that you’re aware of the potential duplicate and still want to go ahead). After submitting the Form, you will be shown the URLs of all the newly created Lexemes, or warning alerts for Forms that were skipped as duplicates.

Bulk mode is currently restricted to users who are autoconfirmed.

Edit mode

[edit]

In edit mode, you edit the Forms of a particular Lexeme, specified in the URL: for example, to edit example (L2237) using the english-noun template, go to the URL https://lexeme-forms.toolforge.org/template/english-noun/edit/L2237. The tool will try to match the Lexeme’s existing Forms to the Forms in the template and sort them into the input fields accordingly. By editing the contents of the input fields, you can add, edit, or remove Forms: as usual, a slash can be used to separate multiple Forms with the same grammatical features, see § Multiple variants. If any Forms cannot be matched to a single template Form, they are listed at the top; you can drag’n’drop them into an input field to manually match them to a template Form, and grammatical features and statements will be added as needed.

You can also edit form representations in a different language code than the “main” code of the template, by adding a ?language_code parameter to the URL; for example, https://lexeme-forms.toolforge.org/template/english-noun/edit/L1347?language_code=en-gb will edit the en-gb form representations of colour/colour/color (L1347) instead of the usual en ones. The “main” form representations are shown as the placeholder (for instance, if in the previous URL you remove colour from the first input, you should see the placeholder color); when there are § Multiple variants, make sure to enter the form representations in the same order as in the “main” language, so that the form representations all match up.

Wikifunctions support

[edit]

The tool features experimental support for generating forms using Wikifunctions (Q104587954). If you enable it in the tool’s settings, then supported templates will show one or more buttons next to the heading. Clicking one of those buttons will generate forms based on the first form (the lemma) and put the results into any form fields that are still empty. Please give feedback on how this feature works for you so we can evaluate it and decide when to make it available for everyone! To add support to a template, add the functions to the wiki page (see /English, /French or /Croatian for examples), then ping Lucas Werkmeister on the talk page.

Language support

[edit]

To start adding support for a new language, enter the language name in English here, follow the instructions, and then {{Ping}} Lucas Werkmeister on the talk page:


To add a new template for an already supported language, go to the subpage for that language and start with the inputbox there. The following languages are currently supported:

Additionally, the following languages have translations but no templates yet, or their templates still need some more work before they can be added, or Lucas Werkmeister just hasn’t found the time to add them yet:

Please only add templates for languages you speak yourself, and speak well – there isn’t yet any tool that can be used to automatically migrate a large set of Forms to a different data model (e. g. replace a certain grammatical feature item ID with another one across all Lexemes in a certain language), so we should try to get this right from the start.

There are also instructions for transcribing these templates, though you shouldn’t need to worry about that part (that’s Lucas Werkmeister’s responsibility).

Monitoring

[edit]

You can see edits made using this tool since on the recent changes list. Updates to the tool are usually logged on Wikitech.

For edits prior to (when the tool switched OAuth consumers due to T286414), use this recent changes link instead.

Known issues

[edit]

Lexemes you create using this tool will not be added to your watchlist, even if you have the Add pages I create and files I upload to my watchlist setting enabled. (I have not tested this, but I assume that for the same reason, Lexemes you edit using this tool will not be added to your watchlist, even if you have the Add pages and files I edit to my watchlist setting enabled.)

Programmatic usage

[edit]

The tool can also be used programmatically, e. g. by other tools or external code. Just don’t flood the tool with requests too much, please.

No promises as to the stability of any API are made, but breaking changes will most likely increase the API version number at the beginning of the path, and in that case the old path will likely be changed to return HTTP 410 Gone.

Duplicates API

[edit]

To search for duplicates of a potential new Lexeme by its lemma (or, equivalently, to search for existing Lexemes by lemma), send a GET request to https://lexeme-forms.toolforge.org/api/v1/duplicates/www/language-code/lemma, where language-code is a language code like en or de-at and lemma is the lemma you’re looking for (which may contain slashes, if necessary). To search test.wikidata.org, replace the www with test.

The response is either a JSON array with objects for the search results, where each object has id, label, description and uri members, or HTTP 204 No Content if there are no results.

You must specify a header Accept: application/json when sending requests to this API, otherwise the results may be returned in an HTML format that’s specific to this tool and not useful outside of it. (Note that the curl command-line tool sends Accept: */* by default, which means you get HTML back if you don’t explicitly specify a different Accept header.)

Matching API

[edit]

To match a Lexeme against all the templates the tool knows, send a GET request to https://lexeme-forms.toolforge.org/api/v1/match_template_to_lexeme/www/lexeme-id, where lexeme-id is the ID of the Lexeme (e. g. L123). To match a Lexeme against just one template, append it to the URL, i. e. https://lexeme-forms.toolforge.org/api/v1/match_template_to_lexeme/www/lexeme-id/template-name. To use test.wikidata.org Lexemes and templates, replace the www with test.

The response is a JSON object; for the first version, it maps template names to match objects, whereas the second version returns a single match object directly. Match objects have the following structure:

{
  "language": true,
  "lexical_category": false,
  "matched_statements": {},
  "missing_statements": {},
  "conflicting_statements": {}
}

"language" and "lexical_category" indicate whether the Lexeme matches the template’s language and lexical category, respectively. "matched_statements", "missing_statements" and "conflicting_statements" are statement lists, with the same format as the "claims" in the Wikibase JSON data model, containing statements in the Lexeme that match statements expected by the template, statements in the template that don’t match any statements in the Lexeme, and statements in the Lexeme that conflict with statements expected by the template. (Whether an extra Lexeme statement is considered a conflict or not currently depends on the property used: extra instance of (P31) are fine, but extra values on most other properties, e.g. grammatical gender (P5185) or transitivity (P9295), are not.)

This API is used by the user script documented in a previous section.

Templates API

[edit]

To get the templates which the tool uses, send a GET request to https://lexeme-forms.toolforge.org/api/v1/template/template-name, where template-name is the name of a single template to return, or omit template-name altogether to get all templates at once. The response is a JSON object, either a single template or a map from template names to templates. Template redirects are represented as HTTP redirects in the former case and as string values (name of the target template) in the latter; ambiguous former template names (such as portuguese-adjective, which was split into portuguese-adjective-biform and portuguese-adjective-uniform) are represented as lists of replacement templates in the former case and lists of those templates’ names in the latter case.

Note that most templates were contributed by Wikidata users on wiki pages, and they were not asked to license them under any special license, so the templates are published under the same license as the non-structured data of Wikidata, CC BY-SA 4.0. The "@attribution" member of each template object contains the names of the "users" who contributed to the template, as well as the "title" of the Wikidata page where the full history may be seen.

Automatically generating Forms

[edit]

You can pre-populate the form shown on the page by specifying form parameters in the URL. Form representations can be given as a form_representation parameter (usually occurs more than once); you can also specify where the pre-populated data comes from in a generated_via parameter, which is included in the summary (i. e. you can use [[these links]] but not [these ones]). This works both in “regular” mode (to create a new lexeme) and in “edit” mode when editing a new lexeme; in “edit” mode, the URL parameters are only used for forms that don’t exist yet.

For example, the following URL was used to create the Lexeme musher (L42850) with just a single button press:

https://lexeme-forms.toolforge.org/template/english-noun/?form_representation=musher&form_representation=mushers&generated_via=manual input

The user can still verify all the Forms and check that they are correct before submitting the page, and the tool will also ask them to confirm the new Lexeme is not a duplicate (as will be the case if you load the above URL now, since musher (L42850) was already created).

You can use this feature to write tools or user scripts for automatically generating Forms: you just have to take care of the Form generation itself, Wikidata Lexeme Forms handles OAuth, grammatical feature item IDs and duplicate detection, and the user still has an opportunity to correct any problem with the auto-generated content before it is even added to Wikidata.

If you are confident that your generated forms are correct, you can also use § Bulk mode to create or update many lexemes at once.

Passing data through the tool

[edit]

The target_hash URL parameter can be used to pass data through the tool: with ...&target_hash=something in the URL that was used to load the tool, #something will be appended to the URL to which you are redirected after making an edit. This is useful in combination with § Automatically generating Forms: you can automatically generate not just the forms, but also any other data, and while the target_hash won’t do anything within the tool itself, you can then read it and do things with it on the page of the generated lexeme, using a user script. (Like form_representation, target_hash works both in “regular” and “edit” mode.)