Page MenuHomePhabricator

Import UK lakes in Wikidata
Open, MediumPublic

Description

The UK Lakes Portal lists lakes in the UK. For each lake, it lists various useful statistics (surface, volume, …) which could be imported in Wikidata.

@Jheald has recently created UK Lakes Portal ID (P7548) to link to this portal from Wikidata.

It would be a good OpenRefine exercise to import these ids, as well as other statistics, from this database.

A table of the 1000 biggest lakes in their database can be found here: http://pintoch.ulminfo.fr/adc2c9aaba/lakes-portal.tsv

Event Timeline

Ecritures triaged this task as Medium priority.
Ecritures moved this task from Backlog to SPARQLstation on the Wiki-Techstorm-2019 board.
Ecritures removed a subscriber: Pintoch.

Hi @Pintoch. Thanks for this.

I would say that the critical thing for good matching of data like this is coordinates -- in particular, because there are a number of lochs and lakes in the UK that often share names with other lochs or lakes. So if you could adjust your scraper to get the Grid Reference as well, that would be incredibly useful.

When it comes to matching, my preferred strategy would be to try to match on geographical proximity, then use the name for validation. (In contrast to OpenRefine, which I think does things the other way round). So my steps for something like thisI would be:

  • first try to scrape *all* of the IDs and names and coordinates from the site; and run a WDQS query for all "bodies of water" in the UK and Northern Ireland.
  • then run the WDQS results through a script to give me the 8 nearest hits from the portal within 5 km of the WD coordinates.
  • then examine these potential matches in stages -- first identify exact name matches, and upload those with QS; then see if there are any matches with a Levenshtein distance of 1 or 2, sanity-check those, and upload them; then go through the rest of the match file by hand, identifying any further matches for upload.

Of course, this won't find any matches if the WD items don't have P31 statements to identify them as bodies of water; or don't have good coordinates. So once the above has been done, it's still worth running through the remainder of the extract in stages with OpenRefine. Something I may need a help-note on is the best way to then filter proposed matches to exclude candidates that have a country that is not 'UK', or to de-prioritise candidates that have a P31 that is not in the "body of water" tree (eg any settlements named after the lochs).

But the really useful thing would be if you could re-do the scrape to include the Grid References, so those can be turned into coordinates. As the data isn't directly exposed as HTML, if you already have a utility that can extract this, that would be very very helpful.

Pintoch subscribed.

@Jheald I don't think anyone is working on this anymore: if you are still interested in the scraped dataset, it is here: http://pintoch.ulminfo.fr/adc2c9aaba/lakes-portal.tsv

@Pintoch That's great. I'll try and get matching these this weekend. Is there any chance of the full dataset, beyond the first 1000? We currently have 1660 items for bodies of water in the UK (plus more that possibly don't have P31s), so it would be nice to be able to try to match them all. But thanks again for this!

I think it is a bit harder to extract more than 1000 records (I didn't cap it on purpose to make it manageable for the workshop).