Page MenuHomePhabricator

License type filter for media search
Closed, ResolvedPublic

Description

As a commons user, I want to be able to filter my search results by license type so that I can select images that I can use.

Acceptance Criteria:

  • A "license" filter is added that allows users to only see results that have a certain license type.
  • The license filter should only appear on the Images, Audio, Other, and Video tabs.
  • The license filter should have the following options:
    • "Use with attribution" - this is everything with the CC-BY license
    • "Use with attribution and same license" - this is everything with the CC BY-SA license
    • "No restrictions" - this is everything that's either CC0 or in the public domain
    • "Other" - everything else
  • The license filter should use only license data contained in structured data statements
  • The license filter should behave and look like the "media size" filter on desktop. This task does not cover the mobile UI (this will be done as part of T258615)

filters.jpg (755×1 px, 335 KB)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Unlike some of the other filters, this one won't come cheap: this data is not currently indexed in Elastic so we can't currently build this filter.
My ballpark estimate for indexing this data, populating the index & building the keyword filter etc. would be "weeks", but I'll defer to Discovery-ARCHIVED for a better estimate.
(FYI - license info is available from CommonsMetadata extension, which parses it out of the wikitext content, so would likely be a onsave or post-save hook/process to update the index)

If this can come from existing data, such as the structured data for licenses (P275=Q18199165, CC-SA 4.0, has 2.5M results on commonswiki), then the backend support is already implemented, we need only track down the appropriate P's and Q's. If we want to source licenese data from somewhere else then that data needs to integrate somewhere in the ContentHandler::getFieldsForSearchIndex() and ContentHandler::getDataForSearchIndex() implementations. Once the code to start injecting this data into the search indexes is deployed it will be approximately 8 weeks before all pages have the content populated.

Sadly, it’s not (consistently) in structured data (yet?)
It lives in wikitext & Extension:CommonsMetadata extracts it on-demand (but it’s not stored anywhere)
I suspect there’ll be hooks in both of those methods, where CommonsMetadata could then add those fields to?

I wonder, would the liceneses extracted from CommonsMetadata be consistent enough to map directly to wikidata properties? It would be a lie, but it seems it would be a cleaner way into the future if we could translate the CommonsMetadata license information into an equivalent wikidata statement, and then add that to the statements indexed.

For adding a new field, ContentHandler::getDataForSearchIndex() calls HookRunner::onSearchDataForIndex(). For whatever reason its SearchEngine that calls HookRunner::onSearchIndexFields() after all content models have had their fields from ContentHandler::getFieldsForSearchIndex() merged.

Ramsey and I are talking through ways to encourage the community to more quickly move the license information from the text to structured data so that we don't have to index the text-based license data. Approximately 7.5 million files on Commons have the license information in the SD now; Ramsey approximates that the bot that's currently working through moving this data is about 8 months away from getting through the remaining 52 million files on commons. If we can speed that up (or maybe even if we can't), there's no reason to do the work to index the text-based metadata. Stay tuned.

Update: We've decided to go with only using the license information in the structured data, and not the text-based license data. I've reflected this in the acceptance criteria in the task description.

I don't see any mention of copyright status. A lot of files on Commons don't have a license because these files are in the public domain, see for example https://commons.wikimedia.org/wiki/File:Georges_Ricard-Cordingley_(1873-1939)_-_Deep_Sea_Fishing_(morning)_-_RCIN_406333_-_Royal_Collection.jpg or https://commons.wikimedia.org/wiki/Special:ListFiles/BotMultichillT Not sure how to fit that exactly in the wording. Flickr has something similar, see https://www.flickr.com/search/?text=house going from the most restrictive to the most liberal. I like that approach.

The Creative Commons search has nice facets, see https://ccsearch.creativecommons.org/search?q=house . I would *LOVE* to have faceted search on Commons and "use" and "licenses" just be two of the many possible facets.

I wonder, would the liceneses extracted from CommonsMetadata be consistent enough to map directly to wikidata properties? It would be a lie, but it seems it would be a cleaner way into the future if we could translate the CommonsMetadata license information into an equivalent wikidata statement, and then add that to the statements indexed.

For adding a new field, ContentHandler::getDataForSearchIndex() calls HookRunner::onSearchDataForIndex(). For whatever reason its SearchEngine that calls HookRunner::onSearchIndexFields() after all content models have had their fields from ContentHandler::getFieldsForSearchIndex() merged.

Another approach would be that I use the output of that extension as input for by robot that adds statements to the structured data.
Mass adding licenses in structured data is extremely easy. Just pick your favorite category (https://commons.wikimedia.org/wiki/Category:CC-BY-SA-3.0-NL) and have a robot add the relevant statement.
Mass adding licenses correctly and all in one edit is much harder. Say that the file has two licenses. I generally want to add the two licenses to the structured data and not just one of them. We also have to deal with fun edge cases like fallback licenses on public domain art. I doubt the extension handles all of these cases correctly so I rather stick to bulk importing the easy cases based on Wikitext.

Change 622426 had a related patch set uploaded (by Anne Tomasevich; owner: Anne Tomasevich):
[mediawiki/extensions/WikibaseMediaInfo@master] Add license filter

https://gerrit.wikimedia.org/r/622426

@matthiasmullie any thoughts on getting around the search string character limit here? Now that I've rebased this patch onto the one that swaps out the search modules, I'm hitting the default limit of 300 characters for the cc-by and cc-by-sa filter options, and the "other" option is over the hard limit of 2048 (by about 150 characters, but this could always grow as more license items are added).

I might suggest integrating license filtering a little closer to the backend, such that searching by license is something a user could plausibly type into the search box. The expansion to verbose terms would then happen in the backend and avoid the large character limit. I would have to check with @dcausse on preferred implementation, but likely a transformation stage in the query parser can transform something like haslicense:cc-by into haswbstatement:... prior to keyword parsing.

@matthiasmullie any thoughts on getting around the search string character limit here? Now that I've rebased this patch onto the one that swaps out the search modules, I'm hitting the default limit of 300 characters for the cc-by and cc-by-sa filter options, and the "other" option is over the hard limit of 2048 (by about 150 characters, but this could always grow as more license items are added).

I think you're on the wrong solution path here. You seem to try to do everything on the client side. The number of licenses is huge, see https://commons.wikimedia.org/wiki/Category:Primary_license_tags_(flat_list)

Why are you trying to do everything client side? Feels like you're working around adding relevant indexes. See https://ccsearch.creativecommons.org/ and https://opensource.creativecommons.org/cc-search/ for inspiration.

I might suggest integrating license filtering a little closer to the backend, such that searching by license is something a user could plausibly type into the search box. The expansion to verbose terms would then happen in the backend and avoid the large character limit. I would have to check with @dcausse on preferred implementation, but likely a transformation stage in the query parser can transform something like haslicense:cc-by into haswbstatement:... prior to keyword parsing.

I'd suggest that haslicense becomes a plain keyword implementation and possibly reusing existing HasWbStatementFeature methods to build its elastic query, the constraint is that it'll have to be written in the WikibaseCirrusSearch extension (or an extension that depends on it).

Change 636006 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseCirrusSearch@master] Introduce haslicense: search keyword

https://gerrit.wikimedia.org/r/636006

Change 636006 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Introduce haslicense: search keyword

https://gerrit.wikimedia.org/r/636006

@Cparle @matthiasmullie I noticed the change was merged. Do you have a pointer where the actual mappings are stored? See https://github.com/multichill/toollabs/blob/master/bot/commons/own_work_sdoc.py#L66 for a list of mappings that I would expect (minus the variants).

The code has been merged, but the complete mapping has not yet been compiled (AFAIK), so I guess that list will come in handy!

Moving back to "ready for development" for remaining work:

  • license mapping
  • UI
AnneT added a subscriber: Cparle.

I've updated the UI patch but will keep it labeled WIP until the license mapping config is in place.

Change 643062 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/mediawiki-config@master] Add mapping between search terms for haslicense: query and statements

https://gerrit.wikimedia.org/r/643062

Change 643062 merged by jenkins-bot:
[operations/mediawiki-config@master] Add mapping between search terms for haslicense: query and statements

https://gerrit.wikimedia.org/r/643062

@AnneT the config patch has been merged so you're good to go on the UI patch. Options are

  • haslicense:cc-by
  • haslicense:cc-by-sa
  • haslicense:unrestricted
  • haslicense:other

@AnneT the config patch has been merged so you're good to go on the UI patch. Options are

  • haslicense:cc-by
  • haslicense:cc-by-sa
  • haslicense:unrestricted
  • haslicense:other

I just noticed your patch. Not sure how well this scales. Maybe put it somewhere in MediaWiki namespace on Commons as json in the future so it can be maintained by interface admins?

Some lines are invalid, because these are not valid licenses please remove them:
'P275=Q6908632', copyright licence = CC-BY-SA - This is a family, should never be used on Commons
'P275=Q6905323',
copyright licence = CC-BY - Same here
'P275=Q7257361', // copyright licence = Creative Commons Public Domain Mark - Public domain Mark is not a license, should never be used as one

To the unrestricted section:
'P6216=Q88088423', // copyright status = copyrighted, dedicated to the public domain by copyright holder (Q88088423)

Also wondering if we've missed some...I'm testing my patch locally and an image with the license Creative Commons Attribution-Share Alike 3.0 it is popping up under "other" instead of share with attributtion/cc-by-sa.

I've tested the UI patch and confirmed it's working when pointing to production API endpoints. It's now ready for code review.

Change 643555 had a related patch set uploaded (by Anne Tomasevich; owner: Anne Tomasevich):
[mediawiki/extensions/WikibaseMediaInfo@master] Change order of bitmap tab filters to match design

https://gerrit.wikimedia.org/r/643555

Change 644866 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseCirrusSearch@master] Allow license config from messages

https://gerrit.wikimedia.org/r/644866

I just noticed your patch. Not sure how well this scales. Maybe put it somewhere in MediaWiki namespace on Commons as json in the future so it can be maintained by interface admins?

That is coming soon - it will be at https://commons.wikimedia.org/wiki/MediaWiki:Wikibasecirrus-license-mapping
That page won't be used until the override in existing code is gone (which I plan to deploy Thursday)
It's the same config that was previously in code, but with the fixes you suggested.

Change 646690 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[operations/mediawiki-config@master] Remove license map from config

https://gerrit.wikimedia.org/r/646690

Change 622426 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Add license filter

https://gerrit.wikimedia.org/r/622426

I just noticed your patch. Not sure how well this scales. Maybe put it somewhere in MediaWiki namespace on Commons as json in the future so it can be maintained by interface admins?

That is coming soon - it will be at https://commons.wikimedia.org/wiki/MediaWiki:Wikibasecirrus-license-mapping
That page won't be used until the override in existing code is gone (which I plan to deploy Thursday)
It's the same config that was previously in code, but with the fixes you suggested.

It'll be another week (next Thu) - patch did not get merged in time.
By then, UI should also be out & everything in this ticket is complete.

Change 644866 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Allow license config from messages

https://gerrit.wikimedia.org/r/644866

Change 649617 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/mediawiki-config@master] Remove license mapping for search for labs

https://gerrit.wikimedia.org/r/649617

Change 643555 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Change order of bitmap tab filters to match design

https://gerrit.wikimedia.org/r/643555

Change 649617 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove license mapping for search for labs

https://gerrit.wikimedia.org/r/649617

Change 649866 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseCirrusSearch@master] Allow dangling commas in whitespace-separated license config

https://gerrit.wikimedia.org/r/649866

Change 646690 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove license map from config

https://gerrit.wikimedia.org/r/646690

Etonkovidova subscribed.

Verified in commons wmf.22 - moving to Design QA for @mwilliams quick review (if there is no follow-ups please move it to Verify in production to check wmf.25 the latest patches on the task).

Notes
(1) The task specifies that

The license filter should only appear on the Images, Audio, and Video tabs.

The license filter appears on Other tab too

Screen Shot 2020-12-22 at 11.23.15 AM.png (496×901 px, 82 KB)

(2) (minor) All licenses filter is the first filter (from left to right) on Video, Audio, and Other tabs. Images tab has 'All images sizes' as the first filter. It's slightly inconsistent.

Screen Shot 2020-12-22 at 11.53.02 AM.png (207×610 px, 86 KB)
Screen Shot 2020-12-22 at 11.53.13 AM.png (290×544 px, 40 KB)

(3) (minor) In some languages the label "Use with attribution and same license" might become too long to correctly fit in the UI
e.g. in German it occupies almost the whole width of the drop-down menu and on mobile it overflows the width.

desktopmobile
Screen Shot 2020-12-22 at 10.11.13 AM.png (335×673 px, 174 KB)
Screen Shot 2020-12-22 at 10.11.39 AM.png (478×402 px, 46 KB)

Thanks for your review, @Etonkovidova!

(1) I think this task was created before the Other tab was a thing. I figured since it's a media tab, the license filter applies, but we should confirm this (@Ramsey-WMF?)

(2) A patch to fix this was merged last week but didn't make it onto the train, so it should show up next year 😂

(3) Good find, I'll fix this ASAP!

(1) I think this task was created before the Other tab was a thing. I figured since it's a media tab, the license filter applies, but we should confirm this (@Ramsey-WMF?)

Correct! I've updated the task description ✔

Thanks, @Ramsey-WMF and @AnneT for a quick feedback! I'll keep the task in Design QA just for @mwilliams review.

Change 654286 had a related patch set uploaded (by Anne Tomasevich; owner: Anne Tomasevich):
[mediawiki/extensions/WikibaseMediaInfo@master] Improve select menu width styles

https://gerrit.wikimedia.org/r/654286

Change 654286 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Improve select menu width styles

https://gerrit.wikimedia.org/r/654286

Checked in commons wmf.26 - looks fine, including the latest patch for wrapping. Also I did additional cross-browser check.

However, there are some questions (they might be not actual issues!) that I decided to file as a separate task - T272000: MediaSearch - issues for Other License filter for Images . The observations listed there seem to be out of the scope of this task.

Change 649866 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Allow dangling commas in whitespace-separated license config

https://gerrit.wikimedia.org/r/649866