Page MenuHomePhabricator

Citoid inserts bad information from Indian news sites
Open, Needs TriagePublic

Description

See these search results:

https://en.wikipedia.org/w/index.php?sort=relevance&search=( "Times of India" OR "India Today" ) insource:/\| *last3 *= *Ist/&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current={}&ns0=1

and this discussion thread:

https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:RefToolbar&oldid=940509175#Indian_sources_mis-read

This problem is currently affecting over 1,000 articles on the English Wikipedia. If Citoid can't find proper author information on these sites, it should not attempt to retrieve author data from them.

Event Timeline

I've opened a new translator request to Zotero for this one: https://github.com/zotero/translators/issues/2118

What's going on is Zotero's embedded metadata translator is putting the entire "byline" in the author field, and splitting it. PTI stands for Press Trust of India, I think.

. If Citoid can't find proper author information on these sites, it should not attempt to retrieve author data from them.

Unfortunately, in the short term, it's easier to just write a translator than to exclude all malformatted metadata - or at least, that's kind of a can of worms. I've opened https://github.com/zotero/translators/issues/2119 but that's kind of a bandaid. See also https://github.com/zotero/translators/blob/master/Embedded Metadata.js#L590

We could fork our version of the Zotero translators to basically disallow using the byline entirely, or the entire "allow low quality metadata function" which is very aware function name! - but we did that before, and it was a bit of a maintenance headache, so we switched to using their repo directly and not using a fork, since they have a lot more people working on it than we do / it's better maintained, generally speaking. There's also https://github.com/zotero/translators/issues/1092 which might address that better but it's been stalled I think.

tl;dr fastest way to fix this is to fix times of india explicitly, other ways are all a bit stalled, because we no longer run the citoid native scraper but use Zotero's.

Two more examples of widely-cited, American, sites:

Change 574504 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/zotero@master] Squashed commit of the following:

https://gerrit.wikimedia.org/r/574504

Change 574504 merged by jenkins-bot:
[mediawiki/services/zotero@master] Update Zotero to rMWf0cff95fed17

https://gerrit.wikimedia.org/r/574504

I just wanted to do a noop deploy of zotero as part of T235411 and figured that the change never made it to production. It is deployed to staging though.
Is there anything blocking this or is it okay to deploy?

(Adding @Pchelolo, just because you triggered the merge :-) )

I just wanted to do a noop deploy of zotero as part of T235411 and figured that the change never made it to production. It is deployed to staging though.
Is there anything blocking this or is it okay to deploy?

(Adding @Pchelolo, just because you triggered the merge :-) )

This change is only part of the changes needed to fix the bug; it also requires

https://github.com/zotero/translators/pull/2122 which is not yet merged, and then the submodule which points to that repo to be updated in the translation-server (zotero) repository. So basically I never bothered deploying it, but deploying it is harmless.

Deployed that as said on IRC. As far as I can tell it looks good...

"All instances" is a bit misleading. That's searching for the "last3=Ist|" string.

Other instances of garbage/other timezones could be around.

@Jonesey95: That's not surprising, as I don't see any change at all on enwiki in the way it handles the ToI or IT examples I gave in the original discussion there or the USA Today and LA Times examples given here.

Hi folks!

I hope it's appropriate to do this on a Phabricator thread -- but we've identified a couple of threads such as this one where there are some issues with Citoid and automatic extraction / generation of metadata, and @diegodlh has been working on a community based solution for this problem, called Web2Cit. Web2Cit aims to solve some of those problems without having users to fiddle with Zotero translators or having a lot of technical skills.

On May 11 at 4 PM UTC we will be running a workshop to show the tool and allow users to test the early adopters version. If you're interested, you can register here: https://us06web.zoom.us/meeting/register/tZIpfu2upj4sE9ZrqblmM3-QujaeqekAAINK

If you want to know more about Web2Cit or the workshop, check here: https://meta.wikimedia.org/wiki/Web2Cit/Workshops

We would also greatly appreciate it if you happen to know anyone that might be interested in attending such a workshop and can handle some technical complexity.

cheers,
scann