Page MenuHomePhabricator

Look in the database for malformed globe-coordiante precisions
Closed, ResolvedPublic

Description

In T144248: No RDF builder defined for data type globe-coordinate nor for value type bad in DispatchingValueSnakRdfBuilder::getValueBuilder it appears as though some invalid globe-coordinate data found its way into the database. To enable us to form some kind of strategy for dealing with this type of malformed precision values, we need to look into how many occurrences of invalid data there are and perhaps on which items they exist.

Acceptance Criteria:

  • Generate some kind of report detailing occurrences of the malformed precision on globe-coordinates

Event Timeline

ItamarWMDE renamed this task from Look in the database for other such malformed precisions to Look in the database malformed globe-coordiante precisions.May 25 2021, 10:37 AM
ItamarWMDE renamed this task from Look in the database malformed globe-coordiante precisions to Look in the database for malformed globe-coordiante precisions.May 25 2021, 10:40 AM

I ran this python script on https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 and it ran for about 5 hours. ( did not complete but I terminated it, as the json dump in about 62G of data) . Nevertheless for those 5 hours I did not get a malformed precision in the output ( badprecisions.txt ) fille. in otherwords they are extremely rare in the database or they are not present or the script is missing something?.

import sys
import json
import os
import bz2
import gzip
GlOBE_COORDINATE_PROPERTIES= [
    'P625', 'P626', 'P1259', 'P1332',
    'P1333', 'P1334', 'P1335', 'P2786',
    'P5140', 'P8981', 'P9149'
]
def read_dump(path):
    mode = 'r'
    file_ = os.path.split(path)[-1]
    if file_.endswith('.gz'):
        f = gzip.open(path, mode)
    elif file_.endswith('.bz2'):
        f = bz2.BZ2File(path, mode)
    elif file_.endswith('.json'):
        f = open(path, mode)
    else:
        raise NotImplementedError(f'Reading file {file_} is not supported')
    try:
        for line in f:
            if isinstance(line, bytes):
                line = line.decode('utf-8')
            try:
                yield json.loads(line.strip().strip(','))
            except json.JSONDecodeError:
                continue
    finally:
        f.close()

with open('badprecisions.txt', 'w') as f:
    f.write('')


for item in read_dump(sys.argv[1]):
    precision = 0
    for geoProperty in  GlOBE_COORDINATE_PROPERTIES:
	    for claim in item.get('claims', {}).get(geoProperty, []):
	        try:
	            precision = claim['mainsnak']['datavalue']['value']['precision']
	        except:
	            continue
                 if precision >= 360 or precision <= -360:
	                with open('badprecisions.txt', 'a') as f:
	                    f.write(item['id']  '\t' str(precision)  '\n')

I got a list of Items that give a precision = null for statements with coordinate location(P625), But for some reason, Q3642430 is not part of the list.

https://phabricator.wikimedia.org/P16319

Three notes:

  • Please edit your message and link to the paste instead of transcluding it (with {}). It's around 100K lines and It takes really long time to open this ticket now and my computer basically freezes because it's massive
  • We need to check for more than P625, you can check for datatype in the snak instead of property
  • The ones that have None are interesting but the main one is the ones that are not between -365 and 365. It seems the json output just gives an error: https://www.wikidata.org/wiki/Special:EntityData/Q3642430.json

new script https://gist.github.com/rosalieper/83354c6677c92a008f412b21d7d41575. Thanks for the feedbacks @Ladsgroup. I Will post the results of the script once it is done running.

There are 2 affected items with malformed coordinates.

Item ID--------- Precision

Q3629997---- 146706019195900
Q3642430---- 146706019195900

The script looks good and the findings are consistent with error messages found in logstash: 262 corresponding "RDF builder" error messages have been logged in the last 90 days, but none are found when excluding these two item IDs.

          #
  #      ##
  #     # #
#####     #
  #       #
  #       #
        #####

Three notes:

  • Please edit your message and link to the paste instead of transcluding it (with {}). It's around 100K lines and It takes really long time to open this ticket now and my computer basically freezes because it's massive
  • We need to check for more than P625, you can check for datatype in the snak instead of property
  • The ones that have None are interesting but the main one is the ones that are not between -365 and 365. It seems the json output just gives an error: https://www.wikidata.org/wiki/Special:EntityData/Q3642430.json

Why -365 and 365? If the precision is in degrees (as suggested by Wikibase DataModel § Geographic locations, then I assume that range should use 360, not 365. But also… what’s a negative precision supposed to mean anyways? (I don’t really understand this field, I think.)

This part of this task looks good to me