Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed-world approach to SHACL shape for schema.org #3408

Open
mhoangvslev opened this issue Nov 15, 2023 · 4 comments
Open

Closed-world approach to SHACL shape for schema.org #3408

mhoangvslev opened this issue Nov 15, 2023 · 4 comments
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).

Comments

@mhoangvslev
Copy link

I encountered this problem:
RDFLib/pySHACL#215

In summary, I purposefully generated an erroneous markup where nutrition is part of Product and used pySHACL to validate it. Because of the lack of constraint, pySHACL could not infer that. The issue arises from the open-world nature of RDF, and SHACL rules can be used to constrain the usage of schema:nutrition to specific classes if a more closed-world approach is desired.

Is there a possibility to further refine the OWL ontology for schema.org?

@mfhepp
Copy link
Contributor

mfhepp commented Nov 17, 2023

I do not think we should add any constraints using OWL axioms (e.g. disjointness axioms etc.), for the following reasons:

  1. The semantics of domain and range in OWL and RDFS is not widely understood and counterintuitive for many developers (see here for the RDFS mechanism for domain and range and here for the refined OWL semantics):
    • In a nutshell, in plain RDFS, the rdfs:domain of a property is a mere cue that if that property is applied to an entity, that this entity is then of that type, i.e. adding an informal hint of an additional type membership for that entity.
    • In OWL, the semantics of rdfs:domain and rdfs:range is more formally defined in that an actual additional rdf:type assertion will be added to either the subject or the object of the respective triple, e.g. either the product or the value.
  2. Schema.org defined its own mechanism via schema:domainIncludes and schema:rangeIncludes in order to avoid practical problems that might arise from the naive usage of rdfs:range and rdfs:domain, e.g. that additional type membership assertions are added by an OWL reasoner instead of any kind of error message.
  3. The exact semantics of these two properties is vague by design, because it will depend on the data and the application what will be the most appropriate action:
    • A publisher of data will most likely want to detect the usage of incompatible or undefined properties.
    • A consumer of data may want to ignore either the individual property or the entire block of data or even discard the entire dataset. At Web scale, however, a consumer will instead often want to try to repair many of such errors in data (e.g. https vs. http namespace, maybe even some spelling mistakes.
  4. There is no need to add such SHACL or OWL axioms directly to schema.org, because
    • a standard check that only properties defined for the type can be implemented e.g. in SPARQL or another query language with ease;
    • a SHACL file that derives hard constraints from the vague domain and range information can be produced automatically.
  5. It may even be harmful, because such statements are not as generally applicable as the rest of the vocabulary. Note that schema.org tries to strike a fine balance, in many subtle parts of the design, between precision on one hand, and ambiguity on the other hand. This is a very old problem, see e.g. the Wikipedia page on Ontological Commitment.

As for SHACL: IMO, what would be a good approach was if one set of authoritative SHACL shapes for classes and properties was automatically produced for each release and added to a release as a separate resource.

There are some tools (not tested them myself) that might help with that task (this will require adding support for the schema-specific domain and range properties):

Hope you find this long ;-) comment useful!

@mhoangvslev
Copy link
Author

For future readers who require citation, here is one:

  • From "Domain Specific Semantic Validation of Schema.org Annotations"

The vocabulary [of schema.org] covers local businesses, products, events, recipes, people and much more and is
adapted and supported by the big search engine providers. This naturally makes the vocabulary quite heterogeneous. The vocabulary is also semantically imperfect [9]. For instance classes may inherit properties improperly (e.g. a waterfall can have a telephone number) and not formally strict, but this is rather a design decision to facilitate rapid and decentralized evolution of the vocabulary. The side effect of this feature is that picking the right classes and properties for a domain can be quite challenging and low quality annotations in terms of conforming to the rules of a field (e.g. tourism) may occur.

@mhoangvslev
Copy link
Author

After a while, I figured out a quick and dirty way to perform the type checking under CWA:

  • Recursively bring all parents' properties to the child class, then close the definition with sh:closed true.
  • Here is the Python function that work with this version of schema.org shape.
def close_ontology(graph: ConjunctiveGraph):
    """Load an input SHACL shape graph and close each shape 
    by bringing all property from parent class to currend class shape 
    then add sh:closed at the end
    """             
    query = f"""
    SELECT DISTINCT ?shape ?parentShape ?parentProp WHERE {{
        ?shape  a <http://www.w3.org/ns/shacl#NodeShape> ;
                a <http://www.w3.org/2000/01/rdf-schema#Class> ;
                <http://www.w3.org/2000/01/rdf-schema#subClassOf>* ?parentShape .
                
        ?parentShape <http://www.w3.org/ns/shacl#property> ?parentProp .
        FILTER(?parentShape != ?shape)
    }}
    """ 
    
    results = graph.query(query)
    visited_shapes = set()
    for result in results:
        shape = result.get("shape")
        parent_prop = result.get("parentProp")
        graph.add((shape, URIRef("http://www.w3.org/ns/shacl#property"), parent_prop))
        graph.add((shape, URIRef("http://www.w3.org/ns/shacl#closed"), Literal(True)))
        
        # subj sh:ignoredProperties ( rdf:type owl:sameAs )
        # https://www.w3.org/TR/turtle/#collections
        if shape not in visited_shapes:
            ignored_props = graph.collection(BNode())
            ignored_props  = [URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"), URIRef("http://www.w3.org/2002/07/owl#sameAs")]
            
            graph.add((shape, URIRef("http://www.w3.org/ns/shacl#ignoredProperties"), ignored_props.uri))
            visited_shapes.add(shape)
    
    # Replace xsd:float with xsd:double
    for prop in graph.subjects(URIRef("http://www.w3.org/ns/shacl#datatype"), URIRef("http://www.w3.org/2001/XMLSchema#float")):
        graph.set((prop, URIRef("http://www.w3.org/ns/shacl#datatype"), URIRef("http://www.w3.org/2001/XMLSchema#double")))
    
    return graph

Copy link

github-actions bot commented Mar 4, 2024

This issue is being nudged due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).
Projects
None yet
Development

No branches or pull requests

2 participants