Page MenuHomePhabricator

Handle property datatype changing gracefully
Open, HighPublic

Description

Pywikibot will completely break whenever T95686: [Task] write a maintenance script to migrate properties from string to new identifier datatype is used, as it caches the property datatype for one hundred years, and has done that since April 2013 {6ea6d333}.

The simple workaround, e.g. for people using old versions of pywikibot, is to delete the apicache each time a property datatype changes, but that isnt a user-friendly solution.

Changes that could mean Pywikibot is less impacted by this:

Pywikibot could force a cache refresh more regularly.

Ideally we detect that a property has a new datatype, and clear the relevant cache entry.

Also, Pywikibot can eliminate the reliance on getPropertyType, as many code paths already have the properties datatype from when the entity was loaded, e.g. via wbgetentities. If getPropertyType is only used infrequently, the caching can be eliminated.

Event Timeline

jayvdb raised the priority of this task from to High.
jayvdb updated the task description. (Show Details)
jayvdb added subscribers: jayvdb, Legoktm.
jayvdb set Security to None.

https://gerrit.wikimedia.org/r/137737 and https://gerrit.wikimedia.org/r/227252 basically mean that getPropertyType only gets called when Property is constructed manually, right?

https://gerrit.wikimedia.org/r/137737 and https://gerrit.wikimedia.org/r/227252 basically mean that getPropertyType only gets called when Property is constructed manually, right?

Yes. Property is a parent class of Claim, so the current exposure is quite large; roughly any time a Claim is instantiated directly in order to be used with addClaim or similar. i.e. reading will likely not break, at least not frequently and probably easily fixable, however writing to Wikibase will likely be very broken.

Do we expect this to happen again? A property's datatype is supposed to be immutable, and except for this one weird case, it is. Unless we can also vary on some other constant changing, I don't see any good cache invalidation strategy here.

If we expect this kind of change to happen again, I think we should lower the cache time to 1 day and accept the minor performance hit. If not, lets just tell users to rm -rf their apicache and continue on.

On cache invalidation, we can force a refresh of the getPropertyType property-datatype mapping whenever a new datatype is encountered. What would be nice is if Wikibase provided an API call to list all datatypes, so we can detect when the list of datatypes changes. i.e. expose https://test.wikidata.org/wiki/Special:ListDatatypes via the API.
However we dont need that, as we can get the list of datatypes from the paraminfo of wbformatvalue.datatype. However we need to be mindful of T55265, if that problem hasnt been fixed yet.

The harder, and much lower exposed problem, is if pywikibot users are using CachedRequest for entities, as the cached serialisation will have the old datatype in it, not the new datatype. IMO this problem is the callers responsibility, and our solution should be "delete your apicache".

I'm not sure what's the best to do for pywikibot here, but the plan is to perform such a change only once, on a previously discussed (thus known in advance) list of properties. Because of that, I think having a one-off solution would be ok here.

If there's anything else you need from us in order to reflect the migration in pywikibot, please let us know and we'll try to figure out a way.

I would recommend lowering the caching. We really try to not change the datatype of a property but I can't promise this will be the only time.

By the way the cache time doesn't matter that much as the cached item does not store the time of expiration but the time of creation. So we could reduce the time and anyone who uses a version after that won't have cached it for about 100 years.

What would be nice is if Wikibase provided an API call to list all datatypes, so we can detect when the list of datatypes changes. i.e. expose https://test.wikidata.org/wiki/Special:ListDatatypes via the API.

I'm not too sure how this relates to this bug, but submitted https://gerrit.wikimedia.org/r/247080 for this anyways.

Does pywikibot differentiate between value type and data type? Does pywikibot even *use* the data type as such? The value type is not going to change, only the data type.

For example, for the "url" data type, the value type is "string". For the new ID data type, the value type will remain "string" when the data type is changed from "string" to "ID".

Also note that this kind of change is not going to happen often. It's a breaking change (to the knowledge base, not the software), and will be announced as such.

Pywikibot does use the data type, quite extensively. e.g.
https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/page.py#L3951

They are mapped to classes, and mapped to value-types.

With something like https://gerrit.wikimedia.org/r/247080 , we can easily create those mappings dynamically at startup, and provide fallback classes to unknown data types.

I see the current patch puts this new data structure into meta=siteinfo. I am wondering why there, instead of meta=wikibase. Is meta=wikibase being deprecated?

IMO ideally this mapping should also be exposed on Special:ListDatatypes (e.g. mention that a 'URL' is stored using value type 'string') and then build an api query module to provide access all the information on ListDatatypes, including the help strings. That would mean this new api dataset is aligned with an existing special page dataset, and the two should naturally grow together consistently.

Have a look into the file ApiQuerySiteinfo.php and how it works. It's impossible to inject something like meta=wikibase siprop=wikibase. Why do you call this "deprecated"? This never existed, as far as I can see.

What we currently have in https://gerrit.wikimedia.org/r/247080 is good enough for now, in my opinion. If people prefer a separate API module, somebody should write a specification for that, or even better: submit actual code. Otherwise I suggest we merge what we have.

Have a look into the file ApiQuerySiteinfo.php and how it works. It's impossible to inject something like meta=wikibase. Why do you call this "deprecated"? This never existed, as far as I can see.

It isnt related to Siteinfo. It exists, but on the client (client/includes/api/ApiClientInfo.php)

https://www.wikidata.org/w/api.php?action=query&meta=wikibase

Sorry, I got confused. I fixed my comment above. So what you are proposing is a new ApiRepoInfo? Sure, "somebody" could do that, as said above.