Page MenuHomePhabricator

Represent Statement and Reference URIs as Skolem IRIs consistent with RFC5785
Closed, DeclinedPublic

Description

Note: this relates more to my localized use of Wikibase RDF serialization than to the Wikidata Query Service directly, though it may also be relevant to the WDQS.

It is my opinion that the RDF representation of statement and reference URIs should conform to a W3C standard (RFC5785) so that other libraries (like JSON-LD), for example, can recognize them as Skolem IRIs (or uniquely minted identifiers).

One possible scenario is that a JSON-LD consumer may want to frame an entity, and this would require it to make the statements and references into bnodes so that their values can be formatted as sets or lists. Having these types of URIs in the .well-known namespace simplifies the parsing task.

This seems relatively trivial to do. I have already made the experimental changes in my instance that touches these files: https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/includes/Rdf/RdfVocabulary.php
and
https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/includes/Rdf/FullStatementRdfBuilder.php

to produce the intended output attached.

Skolemization
RFC5785

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm not sure how skolemization applies to statements: as I understand, skolemization is applied to bNodes having no own identity, however statements do have their own IDs. With references and values, we essentially already doing something like skolemization, but not using /.well-known/ URIs. I am not sure how useful would it be to add /.well-known/ prefix - is there some tools that rely on it?

A JSON-LD framing pipeline (like the one I have developed for Fedora Commons) is one use case. The JSON-LD library does not understand Skolem IRIs, so these type of identifier URIs would need to be converted into bnodes first. A Skolem IRI parsing function can be generic only if identifiers are recognizable in a standard namespace. Of course, one could hard code your namespaces into the parser as well, but this does not facilitate the interoperability of the data which is what is important and why there are recommendations like RFC5785.

Here is a basic gist of how an entity can be framed and it demonstrates the utility of JSON-LD framing for presentation tools.
http://json-ld.org/playground/#/gist/2e4059a897a90acd22d778267f230bd4

You may notice that this only includes the item graph and that the reference exists within its associated statement node, something that is rather cumbersome to resolve using the default RDF serialization formats.

I believe that it would benefit Wikibase and Wikidata to standardize the statement and reference identifiers not only as 128-bit UUIDs (which the current reference hash ID is not) but also as Skolem IRIs according to the W3C recommendation [1], (even if the immediate utility of a breaking change may not seem worthwhile).

[1] rdf11-concepts/#section-skolemization

But I'm not sure converting statements, references and values to bnodes is the right thing. References and values are shared between items, converting them to bnodes may create wrong impression I'm afraid. Especially this is true with the introduction of normalized values.

also as Skolem IRIs according to the W3C recommendation

W3C recommendation talks about bnodes. So I'm not sure how it is relevant here.

Statement IDs should definitely be represented as bnodes (internally) and skolem IRIs externally because they are uniquely defined within an entity node representation. They have no meaning outside of the entity.

The typing semantics of Wikibase values are very obscure and entirely too complex for most normal reuse implementations of the data. If values are intended to be "shared between items" by an external consumer, then they should be represented as another entity type, and optimally their URIs should be dereferenceable. However, we know that this is not the case, so my personal "impression" of these things is already wrong.

Similarly confusing is the muddled reference implementation. My use case simply needs the references to be presented in the context of the statement that gives the reference a meaning. In my estimation, a reference is just a statement about a statement in the context of a one item, so I do not see how or why a reference can be "shared between items". Note that if the reference statement itself was semantically equal to another in the same item, it should therefore simply be a bnode!

because they are uniquely defined within an entity node representation. They have no meaning outside of the entity.

Not exactly the case, e.g. see: https://www.mediawiki.org/wiki/Wikibase/API#wbsetclaimvalue
As you notice, claims have externally-visible IDs.

The fact remains that the claim without its entity relationship, represented in the GUID by the Q prefix, would be lost into a vacuum of nothing. And really, the concatenation of an entity ID with its statement UUID (with the expectation that a parser can understand the $ as a delimiter) is a rather questionable convention. I guess I am not clear on why the MW API should constrain RDF serialization. They are separate implementations. Is there a convenient "round trip" import from RDF mechanism available in the API? If not, who cares about what the MW API expects.

The basic problem is with the "claim" design. It seems to me that Statement GUIDs are actually unnecessary overhead because the subject of a claim is always the item/entity. There is really no need to mint a GUID subject for the claim. If you needed to have a separate statement node, it may have been better to do something like this:

<> wikibase:hasClaim _:b1
_:b1 wdt:someprop "somevalue"

A bnode is always an object of a <> resource first.

I can add here that in fcrepo4, that with PR #1187 they have changed to not use RFC5785 for representing Skolemized bnodes. Instead, a new fragment URI convention has been implemented, so internally minted UUIDs are appended to the resource subject as a fragment (aka Hash URI identifier) rather than creating a new resource node. This convention actually makes more sense than RFC5785 for statements and references I suspect. Graph serializations then would "naturally" entail these identifier bnodes in a single resource/entity context, and this then facilitates round-tripping and other downstream from RDF operations, like JSON-LD framing.

Smalyshev changed the task status from Open to Stalled.Dec 21 2017, 2:15 AM
dcausse subscribed.

I don't think we will use RFC5785 for bnodes skolemization either (see T245541).
The statement ids are already parsed by some apache rewrite rules (T203397) but also given the discussions going on T214680 I'm assuming that the format of the reified statement IRIs is already interpreted by tools so changing it now seems a bit complicated.
Please feel free to re-open if you believe this is still worth pursuing.