Search engines such as Cirrus should examine the content of all slots when updating the search index.
Description
Details
Subject | Repo | Branch | Lines /- | |
---|---|---|---|---|
Add a way to extract content scoped search index data | mediawiki/extensions/Wikibase | master | 15 -1 |
Event Timeline
Two big questions here are:
- One document or multiple documents? (I think the trend is for now for one document)
- If the answer is one document, how to reconcile slots with potential intersections? I.e., if both slots want to put something in opening_text, what happens? Etc.
For now, I'd blindly concatenate. That's the baseline.
We have to answer similar questions for a lot of things, including the generation of the HTML the user will see. I plan an RFC about that question.
- At least for cirrus, it pretty much needs to be one document if we want any kind of interaction between fields of multiple content types.
- I think, again only wrt cirrus, this is going to depend heavily on how those fields get into the queries issued. The current method with a variety of hard coded field names really pushes for the ability to overwrite, such as work on file media info which will overwrite opening_text field on file pages. The two will have to be figured out in parallel i suppose.
(sorry I'm very new to MCR)
How will this work regarding namespaces?
I mean can there be a mix of namespaces here or is there a single top level namespace somewhere?
Should we set up some kind of meeting to sync on this and develop strategy? Maybe on the hackathon? I am personally still rather fuzzy on how this whole thing is supposed to work and on MCR details too, and I am suspecting I am not the only one :)
FWIW for the initial release of the SDoC multi-lingual captions stuff, I used the onSearchDataForIndex hook to write search data for MediaInfo slots
Update: We switched to CirrusSearchBuildDocumentParse in 2a0610b8a2d05d872878da292117f140520f5098.
That hook's interface is actually not MCR compatible, since it only takes a singe Content object. I commented on the patch here in phab.
I worked around that in MediaInfo by using WikiPage::factory( $title )->getRevisionRecord() ... ought we raise a ticket to make the hook MCR compatible? Not really sure what's using the hook, so I'm not sure how to proceed ...
@Cparle this ticket here *is* about making sure all slots are passed to cirrus. Cirrus should then also pass them on via its own hooks. Changing a hook signature isn't trivial though, it's generally better to introduce a new hook.
I think this ticket here is sufficient to track the need to do this. Your workaround should be fine for MediaInfo for now. Perhaps, add a comment to your hook handler that points to this ticket.
Change 472647 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Adding note about workaround pending T190066
Change 472647 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Adding note about workaround pending T190066
Change 837128 had a related patch set uploaded (by DCausse; author: DCausse):
[mediawiki/extensions/Wikibase@master] Add a way to extract content scoped search index data
Change 837128 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add a way to extract content scoped search index data