Page MenuHomePhabricator

[Research Area] Determine a strategy for controlled vocabularies in Toolhub
Closed, ResolvedPublic

Description

Research Area: Controlled Vocabulary

Tool discoverability is currently hindered by the absence of a standard system for defining tools, their functionality, their audiences, etc. The current approach used by tool directories involves freeform keywords, which are i18n-unfriendly and lead to keyword duplication. (It turns out that software directories in general, including Packagist and npm, work this way, which is not exactly inspiring.)

The challenge is to create a system of standardized vocabulary, maintained by the community, that can be used to describe tools. Where possible, we should re-use existing software/platforms and re-use existing vocabularies, rather than invent new ones. One idea was to use Wikidata identifiers, which may work for most terms but not all. Creating a totally novel system risks duplication of effort.

Another design challenge would be to create a system that would be easy to use in the context of manually updating a JSON toolinfo file, if this is a use case we want to support. What would be the language-neutral "tokens" used?

If we do not use an existing system, we need to figure out where to store this new system, in such a way that the content can be updated if needed.

Different options we can consider:

  • Use Wikidata as the backend
    • Advantages: Reasonably well understood; very flexible and accommodating; many terms are already included and translated
    • Disadvantages: Concepts people may want to use to describe their tool may not be considered notable by Wikidata standards. I am researching the extent to which this is going to be a problem.
  • Create our own Wikibase to use as the backend
    • Advantages: Maintains compatibility with Wikidata while allowing us to expand beyond its usual limits
    • Disadvantages: Wouldn't be as well known (Q numbers would not be Wikidata Q numbers); requires significant effort to maintain the Wikibase software and an active community to maintain the content; could potentially duplicate the effort of Wikidata
  • Create a project on Translatewiki.net
    • Advantages: The platform is designed for i18n which is what we're trying to do here; it has some familiarity within our community; we could tap into an existing volunteer base
    • Disadvantages: Not a usual platform for many Wikimedians; could complicate updating the standard over time
  • Have a page on Meta translated with the Translate extension
    • Advantages: A workflow that certain community members would be familiar with; the content would "live" on a wiki with SUL rather than somewhere else
    • Disadvantages: The Translate extension

The problem is resolved when:

  • We know what vocabularies are going to be used and for what purposes (e.g. relevant audiences, tool purpose)
  • We determine which existing vocabulary/vocabularies to use
  • For novel vocabularies, we determine where the data will live and how this data gets updated, including technical and social considerations.
  • A decision is made as to whether we can user-friendly language-neutral tokens to describe concepts in this controlled vocabulary for the purposes of toolinfo files or if we will keep information described via controlled vocabulary out of the toolinfo standard.

Event Timeline

Harej renamed this task from Determine a strategy for controlled vocabularies in the tools catalog to [Research Area] Determine a strategy for controlled vocabularies in the tools catalog.Feb 7 2018, 1:55 AM
Harej updated the task description. (Show Details)

Straw proposal:

  • Keywords will no longer be included in toolinfo files. For initial discovery of these tools, the text of the tool description (and the tool name) is adequate. Keywords are then appended later.
  • The existing keyword field will be used as plain text as a supplement to tool descriptions. Over time it will be phased out to avoid confusion.
  • Volunteers append keywords via UI. User enters text and is recommended Wikidata items to use.
  • Satisfactory result found? Then the Wikidata item is used. Otherwise? User can create a novel keyword token.
  • Novel keyword tokens are part of the broader application i18n system, and would be translated via translatewiki.
  • We could use this same system for controlled vocabularies where we want users to only select from a pre-specified list of options. (Of course, the list can be changed if needed.)
  • If a novel token ends up overlapping with a Wikidata item (i.e. a Wikidata item is created later) then a backend process can be used to consolidate the two. In the long term we may want to create an interface.
  • In the long term, we could use the text of tool descriptions to recommend keyword tags and other pieces of metadata.

Next questions:

  • What metadata fields are going to rely on Wikidata fields / controlled vocabularies? (Depends on what metadata fields we want to use.)
  • For controlled vocabularies, what will the process look like for updates? How do we co-ordinate with other tool developers if we decide to phase out a given term in favor of other ones?
Harej renamed this task from [Research Area] Determine a strategy for controlled vocabularies in the tools catalog to [Research Area] Determine a strategy for controlled vocabularies in Toolhub.Mar 29 2018, 12:39 AM