Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

Assess the quality of open data in an open data portal #9

Open
Stephen-Gates opened this issue Mar 12, 2016 · 14 comments
Open

Assess the quality of open data in an open data portal #9

Stephen-Gates opened this issue Mar 12, 2016 · 14 comments

Comments

@Stephen-Gates
Copy link

Create a tool to assess the quality of open data in an open data portal.
a challenge by ODI Queensland

Build on prior r work:

Leverage existing validation tools:

Apply standards, best practices or quality measures:

Assess an open data portal or two:

Use any or none of these suggestions to provide insights about the quality of open data and how it is published.

Help open data publishers improve so the data they publish can be used to deliver ongoing value.

Thinking about taking the challenge? Got questions? Reply below and we'll do our best to answer.

@Stephen-Gates Stephen-Gates changed the title Assess the quality of open data Assess the quality of open data in an open data portal Mar 12, 2016
@RMHogervorst
Copy link

perhaps a web tool where you drop your file and it tells you what is needed to comply to standards?

@pwalsh
Copy link

pwalsh commented Mar 14, 2016

@RMHogervorst that is essentially what GoodTables does:

It is also available as a CLI or a python lib:

And, we are currently finishing off our Data Quality Dashboards, which could be used (they pretty much meet the challenge already :)):

Example data for quality assessment:

We are currently working on the feature/refactor branch of all these data-quality-* codebases, and will be happy for contributions and questions in around a week.

@RMHogervorst
Copy link

Oh great! That is very useful

@Stephen-Gates
Copy link
Author

Thanks @pwalsh, great to see you here. I'll check out the data quality dashboard.
Hi @RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.

@MilesMcBain
Copy link
Collaborator

Thanks for this suggestion @Stephen-Gates. Despite work already done in this area, I think there is still some scope for R tools. Ideas that come to mind:

  • A tool which could be a combination of testdat and @tierneyn's visdat that tests data for compliance with open data standards and visualises where departures occur within the data frame.
  • A wrapper for 'write.csv()' that tests data and writes it to ODI standards.

@ivanhanigan
Copy link
Contributor

Speaking as someone who has both a) worked at a data portal and b) have
published my own data data, so I agree with your aims to ensure quality.
However I hope that this will be a conscientiously constructive and
collegial process, rather than what could become quite easily (ie without
meaning to) a bit embarrassing for people who are shown their publishing
systems/publications are considered 'poor quality'. We don't want to
provide disincentives and shame those that are essentially altruistically
publishing data at a time when there is no real incentive to do so.

It will also be important to define what is considered 'good quality'....
eg some non-tidy data are well suited to their purpose, as pointed out here
by Jeff Leek http://simplystatistics.org/2016/02/17/non-tidy-data/

On Mon, Mar 14, 2016 at 11:04 PM, Stephen Gates [email protected]
wrote:

Thanks @pwalsh https://github.com/pwalsh, great to see you here. I'll
check out the data quality dashboard.
Hi @RMHogervorst https://github.com/RMHogervorst, my motivation is to
return some stats to a portal owner, and the data publishers that use it,
to illustrate quality. My experience is that some data is poorly published
and not reliably refreshed. I'd like to quantify that, show the publishers
and encourage some corrective actions.


Reply to this email directly or view it on GitHub
#9 (comment).

@Stephen-Gates
Copy link
Author

@MilesMcBain I looked at visdat and was totally exciting to see it was inspired by CSV Fingerprints. I think your suggestion would be a wicked combo.

@Stephen-Gates
Copy link
Author

ivanhanigan Totally agree. This is not a name and shame. I have spoken with some portal owners and data publishers and they're keen to understand how to improve and demonstrate that they are improving over time. So perhaps a tool that graphs progress of time would be useful?

Re: what is good quality data?

My simple approach is "is it published as promised".
E.g.

  • if you said you'd release it monthly, it should be
  • if you said find it here, it should be there
  • if you said it's a CSV, it should be
  • if you said column 2 is a date, it should be.

I'm sure there are more scientific definitions of data quality... feel free to use those also.

@RMHogervorst
Copy link

So these are actually two different use cases, there is the checking of meta data and the data.

@Stephen-Gates
Copy link
Author

@RMHogervorst that's correct but the challenge is totally flexible.
Focus on what helps you, the community, and data publishers - or something else entirely ;-)

@ivanhanigan
Copy link
Contributor

@Stephen-Gates I suggest these use cases have many dimensions. I'd like to specify the aim more before exploring the possibilities. In particular I note the different 'quality benchmarks' applicable to data portals run for government depts vs portals run for scientists. The former might be replete with administrators with data curation high in their work priorities, while the latter may be cobbled together by scientists eschewing the compulsion to compete and instead opting for open science, or alternately reacting to funders/journals requirements to publish supporting information and data with papers. The expectations you might have for quality metadata/data in the former might well be a lot higher than for the latter (and this would be justifiable given the lack of resourcing funders/universities give scientists to engage in data publishing activities).

Another dimension that is not clear in this thread is the spectrum between open data and mediated data. Often mediated data is easily available with portals simply requiring user registration so they can collect download statistics and analyse usage by demographic groups, or to meet data depositors requests to be made aware of proposed re-use so that they can keep in contact and provide collegial support for downstream users of their data. These data are not technically open, but in practice they are essentially open. I suspect quality may differ between purely open and mediated-but-easy-to-get-at data portals, and this might be worth thinking about too.

My 2cents.

@Stephen-Gates
Copy link
Author

@ivanhanigan Great points. I think Governments are equally resource constrained when it comes to publishing open data and the variation in quality will be equally diverse. I understand that many research data portals are not technically open and may not present an API to the catalogue. So if anyone was considering the challenge, I'd suggest using a government CKAN portal that presents an open API. You could explore data.gov.au or data.qld.gov.au (see http://docs.ckan.org/en/latest/api/index.html).

@Stephen-Gates
Copy link
Author

More food for thought:

@cofiem
Copy link

cofiem commented Apr 18, 2016

To me, thinking about data science goes together with assessing quality. I think this collection of data science links and this list of public datasets are relevant to this topic.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants