Import UK lakes in Wikidata
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Pintoch
	Nov 14 2019, 3:48 PM

Description

The UK Lakes Portal lists lakes in the UK. For each lake, it lists various useful statistics (surface, volume, …) which could be imported in Wikidata.

@Jheald has recently created UK Lakes Portal ID (P7548) to link to this portal from Wikidata.

It would be a good OpenRefine exercise to import these ids, as well as other statistics, from this database.

A table of the 1000 biggest lakes in their database can be found here: http://pintoch.ulminfo.fr/adc2c9aaba/lakes-portal.tsv

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Pintoch	T236038 WORKSHOP: OpenRefine (10.30 - 12.15)
		Open		None	T238340 Import UK lakes in Wikidata

Event Timeline

Pintoch created this task.Nov 14 2019, 3:48 PM

Pintoch removed Pintoch as the assignee of this task.Nov 14 2019, 4:12 PM

Pintoch moved this task from Backlog to Community data imports with OpenRefine on the OpenRefine board.Nov 14 2019, 6:18 PM

Ecritures assigned this task to Pintoch.Nov 14 2019, 6:19 PM

Ecritures triaged this task as Medium priority.

Ecritures moved this task from Backlog to SPARQLstation on the Wiki-Techstorm-2019 board.

Ecritures removed a subscriber: Pintoch.

Hi @Pintoch. Thanks for this.

I would say that the critical thing for good matching of data like this is coordinates -- in particular, because there are a number of lochs and lakes in the UK that often share names with other lochs or lakes. So if you could adjust your scraper to get the Grid Reference as well, that would be incredibly useful.

When it comes to matching, my preferred strategy would be to try to match on geographical proximity, then use the name for validation. (In contrast to OpenRefine, which I think does things the other way round). So my steps for something like thisI would be:

first try to scrape *all* of the IDs and names and coordinates from the site; and run a WDQS query for all "bodies of water" in the UK and Northern Ireland.
then run the WDQS results through a script to give me the 8 nearest hits from the portal within 5 km of the WD coordinates.
then examine these potential matches in stages -- first identify exact name matches, and upload those with QS; then see if there are any matches with a Levenshtein distance of 1 or 2, sanity-check those, and upload them; then go through the rest of the match file by hand, identifying any further matches for upload.

Of course, this won't find any matches if the WD items don't have P31 statements to identify them as bodies of water; or don't have good coordinates. So once the above has been done, it's still worth running through the remainder of the extract in stages with OpenRefine. Something I may need a help-note on is the best way to then filter proposed matches to exclude candidates that have a country that is not 'UK', or to de-prioritise candidates that have a P31 that is not in the "body of water" tree (eg any settlements named after the lochs).

But the really useful thing would be if you could re-do the scrape to include the Grid References, so those can be turned into coordinates. As the data isn't directly exposed as HTML, if you already have a utility that can extract this, that would be very very helpful.

Pintoch removed Pintoch as the assignee of this task.Nov 15 2019, 4:31 PM

Pintoch subscribed.

Pintoch updated the task description. (Show Details)Nov 15 2019, 9:35 PM

Ecritures moved this task from SPARQLstation to Backlog on the Wiki-Techstorm-2019 board.Nov 15 2019, 9:54 PM

Jdforrester-WMF subscribed.Nov 15 2019, 10:15 PM

@Jheald I don't think anyone is working on this anymore: if you are still interested in the scraped dataset, it is here: http://pintoch.ulminfo.fr/adc2c9aaba/lakes-portal.tsv

@Pintoch That's great. I'll try and get matching these this weekend. Is there any chance of the full dataset, beyond the first 1000? We currently have 1660 items for bodies of water in the UK (plus more that possibly don't have P31s), so it would be nice to be able to try to match them all. But thanks again for this!

I think it is a bit harder to extract more than 1000 records (I didn't cap it on purpose to make it manageable for the workshop).

Import UK lakes in WikidataOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Import UK lakes in Wikidata
Open, MediumPublic
Actions

Related Objects
Search...