Commons talk:Structured data/Modeling/Source
These earlier notes and resources may be inspiring:
- Notes from first data modelling discussion at Wikimania 2019 - quality assessment modelling was discussed here, and the discussion is recorded in the notes.
- Notes from second data modelling discussion at Wikimania 2019
- Properties table - already contains some first ideas on quality assessment modelling
- Interesting Wikimedia Commons files collected because of their structured data modelling challenges
Source from Wikimania 2019
editCopy here for future reference Get the data! If we look at 100,000 random images, what is in the source field ?
Immediate source of image
edit- Own work --> only for photos of people & places ?
- distinguish photos/scans of their own artwork
- created digital original drawings/artwork
- created diagrams -- software used ?
- Easy enough to create a Q-item for "original creation by uploader" (now created as original creation by uploader (Q66458942)) as value for a master "Source of file": property ---- NB: Temp property P828 used for this role. Will need to be replaced.
- A bot might indentify cases that look dubious, and mark with a qualifier
- BUT -- if we have this model with a top-level statement we can't have any second level of qualifers to clarify the nature of statements being made in first level qualifiers -- eg one might want "applies to part", or "sourcing circumstances", or to distinguish immediate vs ultimate source URL
- From the internet
- Q-item for "file available on the internet", with qualifiers specifying detailed provenance
- Q-item for "user modification of file available on the internet"
- Which property will point to this Q item? A new property: Source, taking a value indicating the nature of the source, with qualifiers adding further info. A "source" statement of this kind should become mandatory, with a limited closed vocabulary of possible values. Make upload wizard enforce the making of a choice.
t
- Commons best practice: URL for image URL for description by source --> two qualifiers for this ?
- ADDED: The "description by source" URL might be well handled by described at URL (P973) "described at url" as a separate main statement.
- ISSUE: We might *only* have the description page URL -- and it might no longer exist. So we might need to specify that an image used to exist at a particular institution (or website), but not be able to say what the URL used to be.
- Commons best practice: URL for image URL for description by source --> two qualifiers for this ?
- In practice might have:
- Some url
- Some url with a description
- Some url with a source site (url Flickr, or Europeana, Internet Archive Books)
- Identifier properties are subclass of source url
- --> Q. Do we want to start minting new properties for identifiers from such sites, or just use URLs as per others. What are pros and cons ? Is this a workaround for not being able to find URLs that start with .... in SPARQL (because the indexing isn't there?)
- Identifier properties are subclass of source url
- In practice might have:
- What sorts of free text do we find in the source fields ?
- -- maybe this is the last 20% we should try to capture, after we've got the easiest 80%, But how to assess/record completeness of extraction from source field ?
- What sorts of free text do we find in the source fields ?
- Institution templates
- Can become quite baroque with Partnership branding, links from the same item to multiple catalogues or services
- Even without an institution template, images from institutions may have more than 2 URLs, eg from LoC : https://commons.wikimedia.org/wiki/File:Portrait_of_Billie_Holiday_LCCN2004663026.jpg
- Institution templates
- Ultimate source vs convenience source / proximate source ?
- A library image might be on Flickr; a auction house image might be reproduced in a newspaper
- Can be the other way round: newspaper image cropped from Flickr https://commons.wikimedia.org/wiki/File:Frenkie_de_Jong_(2019).jpg
- Ultimate source vs convenience source / proximate source ?
- Sources which are offline, but eg which have been scanned
- eg images from art books --> full bibliographic
- See also other version section for derived works
- Q-item as top-level value to indicate "derived from file or files on Commons" ?
- Comment: "Other version" is only relevant if we host the work(s) that the file was derived from. But we may not. eg scan of a page from a book, diagrams based on a diagram in a book (simple enough so no copyright), photograph of a copyright-expired painting, a photo of a dress based on a Mondrian painting "based on" property
Also: some operations -- rotation, colour modification, cropping, etc may have been undertaken by user prior to upload.
- -- so distinguish "scan of image" from "user-modified scan of image" in top-level source statement ?
Source of things shown within the image
edit(eg : a photo of a 2D collage of objects)
- Esp. important because these things may have different copyright status
- -- qualifiers below "depicts" statement ?
- -- how to indicate things if there is no obvious Q-item for something in the image, but neverthess one wants to identify it & record information relating to it? Should a "depicts" = "somevalue" statement be created to record information about particular parts of the image ?
- -- qualifiers below "depicts" statement ?
Will often be handled by the Q-items for the value(s) of the depicts statements
Other
edit- copyright checkers may be closely tied to source: should the statements be similarly related -- or is it enough to put verification info as a qualifier or reference on the copyright status. Will SDC even have/display references ?
Metadata has provenance too
edit- -- on Wikidata we would indicate this in references, statement by statement. But will Commons have references?
End of copy Multichill (talk) 18:36, 12 September 2019 (UTC)
Simple own work source
edit@Jheald: you and some others worked on this during Wikimania, right? I would like to focus on a specific case to see if we can solve that: Own work uploads. Like for example the files uploaded as part of Wiki Loves Monuments. Would it be as simple as "new propery: Source of file" -> original creation by uploader (Q66458942) ("original creation by uploader")? Combined with author and license it would mean we can start converting some data. Multichill (talk) 18:02, 17 September 2019 (UTC)
- Hi @Multichill: thanks for pinging me. Yes, exactly. The strong conclusion I got from that workshop was the usefulness of a top-level property "Nature of source of material", taking as values a very small number of different generic types of origin, that a Commons file could have. All Commons files would eventually be expected to have a statement of this kind. For material with some kinds of origin (eg "taken from the internet"), one would then expect further statements to give details of where from and when, etc. But the simplest case would be own work, for which I created the value original creation by uploader (Q66458942); as used eg on File:Petra_Al-Kaznah_by_Night.jpg, using a has cause (P828) property as a stop-gap until the new property was proposed and created. Ideally, the statement should also have a reference (eg imported from: file description page, with date) -- statements like this need provenance, I think: we should say where they have come from, if we're doing a full-scale roll-out (other values might be eg "decared by author via Upload Wizard", etc).
- Unfortunately I see that d:User:MisterSynergy has since deleted Q66458942, but I've asked him to restore it. Jheald (talk) 15:47, 18 September 2019 (UTC)
- @Multichill: One question that might be worth a thought is whether the property should be just "Nature of source of material", or whether it would make sense to also combine in "Nature of material" -- so whether values should just be "original creation by uploader", or whether it would make sense for the value to be eg "original photograph by uploader" / "original drawing by uploader" / "original sculpture made and photographed by uploader" etc. I come and go between which of the two I prefer. On the one hand there is a certain discipline in trying to identify conceptual orthogonality and then represent it with orthogonal properties. On the other hand, the more specific declarations about the nature of the material may bring out more honest statements, and the greater specificity and concreteness may be easier for some people. In practical terms, by making all of the latter classes subclasses of "original creation by uploader", the same "nature of source" information would be easy enough to extract either way, under either approach, whether for templates or querying or whatever. I oscillate as to which of the two approaches would be better to go for. Jheald (talk) 08:51, 19 September 2019 (UTC)
- @Jheald: I let this sink in a bit. If we look at the current situation, we care on Commons about the immediate source: Taken and uploader yourself, transfer from some other wiki, taken from Flickr, from some museum website, etc.
- Right now, that's what I would like to model. I would probably like to call the property "source of file" to keep it generic as we do right now. Once we want to model immediate source and underlying source, we can just use some qualifiers. That way in easy situations we just have a clear statement, but we also keep the ability to model more complex situations. Do you agree? I'm probably just going to propose a new property to complete the basic information properties that are currently mandatory ({{No source}}, {{No author}} & {{No license}}). Multichill (talk) 17:46, 3 October 2019 (UTC)
- @Multichill: One question that might be worth a thought is whether the property should be just "Nature of source of material", or whether it would make sense to also combine in "Nature of material" -- so whether values should just be "original creation by uploader", or whether it would make sense for the value to be eg "original photograph by uploader" / "original drawing by uploader" / "original sculpture made and photographed by uploader" etc. I come and go between which of the two I prefer. On the one hand there is a certain discipline in trying to identify conceptual orthogonality and then represent it with orthogonal properties. On the other hand, the more specific declarations about the nature of the material may bring out more honest statements, and the greater specificity and concreteness may be easier for some people. In practical terms, by making all of the latter classes subclasses of "original creation by uploader", the same "nature of source" information would be easy enough to extract either way, under either approach, whether for templates or querying or whatever. I oscillate as to which of the two approaches would be better to go for. Jheald (talk) 08:51, 19 September 2019 (UTC)
Property proposal
editSee d:Wikidata:Property proposal/Source of file. Multichill (talk) 16:29, 13 October 2019 (UTC)
- We now have source of file (P7482). Multichill (talk) 09:25, 27 October 2019 (UTC)
Files from the internet
edit@Jheald: maybe you can describe your proposal on how to model files found on the internet? I proposed:
- File:Verfroller Brug Haarlem.jpg source of file (P7482) → Flickr (Q103204), qualified with
URL (P2699)described at URL (P973) → https://www.flickr.com/photos/16782093@N03/3422126298
I think your proposal is to do:
- File:Verfroller Brug Haarlem.jpg source of file (P7482) → file available on the internet, qualified with operator (P137) → Flickr (Q103204) and qualified with
URL (P2699)described at URL (P973) → https://www.flickr.com/photos/16782093@N03/3422126298
Correct? Multichill (talk) 09:48, 27 October 2019 (UTC)
- Somewhere else was mentioned that described at URL (P973) is probably better than URL (P2699) because it's more specific and the link is usually not a deeplink to the file, but to a page containing the file. It's suggested that maybe Commons compatible image available at URL (P4765) could be added too in some cases to deeplink to the file.
- I'm not getting any input so I'm just going to go ahead and implement the second proposal. I just created file available on the internet (Q74228490) for this. Multichill (talk) 15:27, 9 November 2019 (UTC)
- Ok, test edit. I did the same thing on the other Geograph files in Category:Dornoch Firth. What do you think? (@Jheald: ). Multichill (talk) 21:03, 9 November 2019 (UTC)
- Looks good, especially P7384. I start getting used to Commons-style "statement groups".
- It's just that somevalue/unknown isn't exactly the best supported feature around Wikidata and even more so here. Jura1 (talk) 00:09, 10 November 2019 (UTC)
Scanned Files
edit@Multichill, Jura1, Jheald, and Schlurcher: I was thinking about modeling of files (graphics or text) scanned from books. For example, in files like my recent upload File:Chwała olimpijczykom - s.087a- Urszula Stępińska.tif where I scanned a photo from Glory to the Olympians, 1939-1945 (Q97940059) book. I think the best way to model that would be
source of file |
| ||||||||||||
add value |
published in |
| ||||||||||||
add value |
The published in (P1433)=Glory to the Olympians, 1939-1945 (Q97940059) statement is to indicate that that photograph was published in that book, but might have been published in other (earlier) books which would be listed in additional published in (P1433) statements. I do not know if it is worth to add information about who scanned it, it is usually not relevant except that on Commons you might want to look for other scans by the same person or ask them to rescan at higher resolution, etc. If we want to add optional qualifier like that we might need to propose new property as I do not see anything relevant. Other files like my 2007 upload File:Lokajski - Ślub powstańczej pary (1944).jpg might get statements:
source of file |
| ||||||||||
add value |
published in |
| ||||||||||||
add value |
since I do know where it was published but can no longer find location of the source website. Does that sound reasonable? --Jarekt (talk) 18:15, 3 August 2020 (UTC)
- I missed the ping. I liked the fact that source of file (P7482) has a limited number of options. I don't think changing this for scans is a good plan. Multichill (talk) 20:25, 9 August 2020 (UTC)
Additional URL types
editWhat would people think about using the source of file property, as documented here, with additional URL types for further source linking? In particular, I would like to include the link to the IIIF manifest and direct file location for uploads. This could look like this:
It seems like if we have this data to add, this would probably be the best place to add it. Thoughts? Pinging Multichill, Jheald, Jarekt. Dominic (talk) 16:10, 13 July 2021 (UTC)
- I am OK with this, as long as there is the basic
source of file |
| ||||||||||||
add value |
- part. One thing I would change would be to replace generic URL (P2699) with more specific Commons compatible image available at URL (P4765). --Jarekt (talk) 01:14, 14 July 2021 (UTC)
- Okay, thanks. I've seen that property, but when I read the scope discussion it seems like it's for a different purpose (as currently envisioned). Its constraints currently limit use to Wikibase items, for example. But if you think it's better, that is fine with me. Dominic (talk) 21:40, 14 July 2021 (UTC)
How to tag user-created maps on Commons?
editSee Commons:Village_pump/Technical#Structured_data_for_user-created_maps?
DOI
editSome images (such as File:ETH-BIB-Spiegel b. Bern, Wabern, Bern-Weissenbühl, Liebefeld, Blick nach Südsüdwesten (SSW)-LBS R1-941228.tif) can be identified uniquely by a DOI (10.3932/ethz-a-000283338). Should such images be tagged with DOI (P356) or source of file (P7482)? --1-Byte (talk) 11:18, 27 October 2024 (UTC)
- I would say the DOI is not the source of the file, but an identifier. So I would suggest DOI (P356) --> 10.3932/ethz-a-000283338. The source could also be set, but the same as for any other file availible via an URL. --Schlurcher (talk) 15:14, 27 October 2024 (UTC)
- So similar to how URN-NBN (P4109) is handled for File:Skeppet Skuldas undergång - SMV - SVA BB 5303 26.wav --Schlurcher (talk) 08:18, 29 October 2024 (UTC)