Assess the quality of open data in an open data portal #9

Stephen-Gates · 2016-03-12T01:57:33Z

Create a tool to assess the quality of open data in an open data portal.
a challenge by ODI Queensland

Build on prior r work:

testdat
An analysis by the ODI of CSV files on Data.gov.uk

Leverage existing validation tools:

CSVLint.io a tool from the ODI to validate CSV files
GoodTables by Open Knowledge Labs

Apply standards, best practices or quality measures:

Open Data Certificates by the ODI
Frictionless Data and Data Packages by Open Knowledge
W3C CSV for the Web
Tau - a metric to assess the timeliness of data in catalogues.

Assess an open data portal or two:

Use any or none of these suggestions to provide insights about the quality of open data and how it is published.

Help open data publishers improve so the data they publish can be used to deliver ongoing value.

Thinking about taking the challenge? Got questions? Reply below and we'll do our best to answer.

RMHogervorst · 2016-03-14T11:01:16Z

perhaps a web tool where you drop your file and it tells you what is needed to comply to standards?

pwalsh · 2016-03-14T11:20:51Z

@RMHogervorst that is essentially what GoodTables does:

http://goodtables.okfnlabs.org

It is also available as a CLI or a python lib:

https://github.com/okfn/goodtables

And, we are currently finishing off our Data Quality Dashboards, which could be used (they pretty much meet the challenge already :)):

Example data for quality assessment:

https://github.com/okfn/data-quality-uk-25k-spend

We are currently working on the feature/refactor branch of all these data-quality-* codebases, and will be happy for contributions and questions in around a week.

RMHogervorst · 2016-03-14T11:30:54Z

Oh great! That is very useful

Stephen-Gates · 2016-03-14T12:04:09Z

Thanks @pwalsh, great to see you here. I'll check out the data quality dashboard.
Hi @RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.

MilesMcBain · 2016-03-15T01:54:24Z

Thanks for this suggestion @Stephen-Gates. Despite work already done in this area, I think there is still some scope for R tools. Ideas that come to mind:

A tool which could be a combination of testdat and @tierneyn's visdat that tests data for compliance with open data standards and visualises where departures occur within the data frame.
A wrapper for 'write.csv()' that tests data and writes it to ODI standards.

ivanhanigan · 2016-03-15T02:01:14Z

Speaking as someone who has both a) worked at a data portal and b) have
published my own data data, so I agree with your aims to ensure quality.
However I hope that this will be a conscientiously constructive and
collegial process, rather than what could become quite easily (ie without
meaning to) a bit embarrassing for people who are shown their publishing
systems/publications are considered 'poor quality'. We don't want to
provide disincentives and shame those that are essentially altruistically
publishing data at a time when there is no real incentive to do so.

It will also be important to define what is considered 'good quality'....
eg some non-tidy data are well suited to their purpose, as pointed out here
by Jeff Leek http://simplystatistics.org/2016/02/17/non-tidy-data/

On Mon, Mar 14, 2016 at 11:04 PM, Stephen Gates [email protected]
wrote:

Thanks @pwalsh https://github.com/pwalsh, great to see you here. I'll
check out the data quality dashboard.
Hi @RMHogervorst https://github.com/RMHogervorst, my motivation is to
return some stats to a portal owner, and the data publishers that use it,
to illustrate quality. My experience is that some data is poorly published
and not reliably refreshed. I'd like to quantify that, show the publishers
and encourage some corrective actions.

—
Reply to this email directly or view it on GitHub
#9 (comment).

Stephen-Gates · 2016-03-15T02:40:47Z

@MilesMcBain I looked at visdat and was totally exciting to see it was inspired by CSV Fingerprints. I think your suggestion would be a wicked combo.

Stephen-Gates · 2016-03-15T02:49:29Z

ivanhanigan Totally agree. This is not a name and shame. I have spoken with some portal owners and data publishers and they're keen to understand how to improve and demonstrate that they are improving over time. So perhaps a tool that graphs progress of time would be useful?

Re: what is good quality data?

My simple approach is "is it published as promised".
E.g.

if you said you'd release it monthly, it should be
if you said find it here, it should be there
if you said it's a CSV, it should be
if you said column 2 is a date, it should be.

I'm sure there are more scientific definitions of data quality... feel free to use those also.

RMHogervorst · 2016-03-15T08:20:38Z

So these are actually two different use cases, there is the checking of meta data and the data.

Stephen-Gates · 2016-03-15T08:29:35Z

@RMHogervorst that's correct but the challenge is totally flexible.
Focus on what helps you, the community, and data publishers - or something else entirely ;-)

ivanhanigan · 2016-03-15T21:57:21Z

@Stephen-Gates I suggest these use cases have many dimensions. I'd like to specify the aim more before exploring the possibilities. In particular I note the different 'quality benchmarks' applicable to data portals run for government depts vs portals run for scientists. The former might be replete with administrators with data curation high in their work priorities, while the latter may be cobbled together by scientists eschewing the compulsion to compete and instead opting for open science, or alternately reacting to funders/journals requirements to publish supporting information and data with papers. The expectations you might have for quality metadata/data in the former might well be a lot higher than for the latter (and this would be justifiable given the lack of resourcing funders/universities give scientists to engage in data publishing activities).

Another dimension that is not clear in this thread is the spectrum between open data and mediated data. Often mediated data is easily available with portals simply requiring user registration so they can collect download statistics and analyse usage by demographic groups, or to meet data depositors requests to be made aware of proposed re-use so that they can keep in contact and provide collegial support for downstream users of their data. These data are not technically open, but in practice they are essentially open. I suspect quality may differ between purely open and mediated-but-easy-to-get-at data portals, and this might be worth thinking about too.

My 2cents.

Stephen-Gates · 2016-03-15T22:41:42Z

@ivanhanigan Great points. I think Governments are equally resource constrained when it comes to publishing open data and the variation in quality will be equally diverse. I understand that many research data portals are not technically open and may not present an API to the catalogue. So if anyone was considering the challenge, I'd suggest using a government CKAN portal that presents an open API. You could explore data.gov.au or data.qld.gov.au (see http://docs.ckan.org/en/latest/api/index.html).

Stephen-Gates · 2016-03-16T13:17:55Z

More food for thought:

Towards Common Methods for Assessing Open Data by the World Wide Web Foundation
The value of Open Data initiatives by Antonio Ibáñez Pascual
Peer Reviewed Publications on open data quality from the Vienna University
- as an aside one author has developed a CSVW metadata generator
- and a CSVW schema validator is under construction.

cofiem · 2016-04-18T11:08:38Z

To me, thinking about data science goes together with assessing quality. I think this collection of data science links and this list of public datasets are relevant to this topic.

Stephen-Gates changed the title ~~Assess the quality of open data~~ Assess the quality of open data in an open data portal Mar 12, 2016

Stephen-Gates mentioned this issue Mar 13, 2016

I'd like to repeat this analysis on other portals theodi/R-projects#4

Closed

ivanhanigan mentioned this issue Mar 30, 2016

R package to store/access metadata associated with data/functions #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assess the quality of open data in an open data portal #9

Assess the quality of open data in an open data portal #9

Stephen-Gates commented Mar 12, 2016

RMHogervorst commented Mar 14, 2016

pwalsh commented Mar 14, 2016

RMHogervorst commented Mar 14, 2016

Stephen-Gates commented Mar 14, 2016

MilesMcBain commented Mar 15, 2016

ivanhanigan commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

RMHogervorst commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

ivanhanigan commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

Stephen-Gates commented Mar 16, 2016

cofiem commented Apr 18, 2016

Assess the quality of open data in an open data portal #9

Assess the quality of open data in an open data portal #9

Comments

Stephen-Gates commented Mar 12, 2016

RMHogervorst commented Mar 14, 2016

pwalsh commented Mar 14, 2016

RMHogervorst commented Mar 14, 2016

Stephen-Gates commented Mar 14, 2016

MilesMcBain commented Mar 15, 2016

ivanhanigan commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

RMHogervorst commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

ivanhanigan commented Mar 15, 2016

Stephen-Gates commented Mar 15, 2016

Stephen-Gates commented Mar 16, 2016

cofiem commented Apr 18, 2016