ODQ2015 - Open Data Quality: from Theory to Practice - Nexa Center for Internet & Society

In theory there is no difference between theory and practice; in practice there is.
Walter J. Savitch

Monday, 30 March 2015
10.00 – 18.00

Technische Universität München | Institut für Informatik

Boltzmannstr. 3 , 85748 Garching bei München – Room 00.08.038

The amount and variety of open data published by governments is increasing. A sufficient level of data quality is one of the preconditions for an effective open data reuse, which in turn is supposed to increase governments’ transparency and to create business opportunities. In particular, the “open” nature of a data set magnifies the implications of its contextual quality: open data are supposed to stimulate serendipitous reuse and generate unexpected mixes and matches, but this requires that virtually anybody with the necessary background is able to understand the data, port them to other systems, etc.

ODQ2015 aimed at setting up a multidisciplinary dialogue that translates theoretical inputs for the evaluation of data quality (including a focus on the assessment of linked open data quality) into the adoption of shared practices for quality assurance. To this end, the workshop firstly reviewed the main theoretical references on open data quality evaluation, including recent contributions by researchers in the field. Secondly, examples of good and bad practices of open government data publication have been presented and discussed. Finally, the workshop took theory and practice as inputs to define pragmatic “tool chains” and pipelines to be adopted by data publishers so to ensure that their data is of high quality, and can be enriched and versioned over time.

The workshop has been held as part of the Joint Research Action on “Evidence and Experimentation” (JRA3) of the Network of Excellence in Internet Science.

Programme and materials

Workshop Session #1: “Theory –> Practice”

The objective of this session is to present ongoing trends in measuring open data quality, as well as ongoing projects that particularly take into account open data quality as an enabler of their effectiveness.

10:00 – 10:10

Welcome and framework

Antonio Vetrò | SLIDES
Federico Morando

10:10 – 10:30

Round table

10:30 – 11:30

Modelling and measuring open data quality

Ulrich Atz, Open Data Institute, presenting the ODI’s Open Data Certificates | SLIDES (HTML).
Many discussions around the quality of open data are focused on the technical aspects. With Open Data Certificates we have taken a broader approach: we provide a framework for measuring how effective a publisher is in sharing a given dataset. For example, a Certificate encompasses rights and licensing, documentations, guarantees about availability and user support. The Certificate therefore offers a guide for leading practice in open data publishing.

Matthew Fullerton, SMARTLANE project, on quality measurement from the point of view of a platform for data exchange | SLIDES.
There has been a massive shift towards governments at all levels providing ready and open access to data. This has been made possible by internet and software technologies. Unfortunately, so much is possible with this technology that the focus can sometimes be on the data provision (usually in the form of a catalog) and not on the quality or usability of the data itself. Furthermore, tools for understanding the data or its quality, potentially in combination with other data sets, are usually not available. Catalogs however have great potential because the metadata can be useful for ensuring some aspects of quality directly, for example when the data was last updated, how often it will be updated, and which other data sets are relevant, but these aspects are often not used. And provision for data exploration, fusion and quality measurement tools can be efficiently realized when integrated into a catalog platform. It is time for these tools to be integrated into data provision, even if some tools are domain specific and complex. And irrespective of how well we can improve the situation with software, we need to lower the usage barriers, so that people start to use the data and software to generate information and knowledge from it. This will in turn improve the tools through feedback and improve the data as quality problems can first appear when trying to use the data. Here interested citizens can help in using the data that already exists, and transparency-driven platforms can help open up the data acquisition process.

Marco Torchiano, Politecnico di Torino, and Antonio Vetrò, TUM, on recent research focused on empirical assessment of open data quality | SLIDES

[Paper] Jürgen Umbrich, Sebastian Neumaier and Axel Polleres. Towards assessing the quality evolution of Open Data portals | SLIDES
In this work, we present the Open Data Portal Watch project, a public framework to continuously monitor and assess the (meta-)data quality in Open Data portals. We critically discuss the objectiveness of various quality metrics. Further, we report on early findings based on 22 weekly snapshots of 90 CKAN portals and highlight interesting observations and challenges.

[Paper] Michael Klaes, Adam Trendowicz and Andreas Jedlitschka. What Makes Big Data Different from a Data Quality Assessment Perspective? Practical Challenges for Data and Information Quality Research | SLIDES
High-quality data is a prerequisite for most types of analysis. However, since data quality does not come for free, it has to be assessed and managed continuously. The increasing quantity, diversity, and velocity that characterize big data today make these tasks even more challenging. We identified challenges that are specific for big data quality assessments and provide some pointers to promising solution ideas. Moreover, we motivate why big-data-specific challenges may also be worth to be considered when the quality of open data is in focus.

[Paper] Ertugrul Bircan Copur. Data quality modelling and measurements: A case study with German open government data | SLIDES
The focus of the work is on the quality of German open government data. We collected information about the negative and positive aspects of the datasets application developers made use of, and later evaluated the respective datasets using a set of metrics. The feedback from the developers and our own evaluations allow us to present empirical results on the quality of the data and good and bad practices during their construction. In the long run, we are hoping to develop a set of guidelines for open government data, in order to increase the overall quality of the data that is available to the public and create a standardised format to allow cross-set data usage.

11:30 – 11:50

Discussion after the presentations

11:50 – 12:40

Publishing and reusing

Giovanni Menduni, Politecnico di Milano, on data quality applied to OpenExpo | SLIDE

Stefano Gatti, CERVED group, on open data quality and commercial reuse | SLIDE

Lorenzo Canova, Nexa Center at Politecnico di Torino. How data quality affects reusabilty: a case study on published Italian Open Data | SLIDE

[Paper] Claudio Di Ciccio, Javier D. Fernandez and Jürgen Umbrich. Improving the usability of Open Data portals from a business process perspective | SLIDE
Open Data portals are considered to be the cornerstones of the Open Data movement, as they offer an infrastructure to publish, share and consume public information. From a business perspective, such portals can be seen as a non-profit data marketplace, in which users try to satisfy their demand and offer requirements in several different processes. In this work, we argue that studying these so far unexplored interaction processes bears the potential to make the portals more effective. We first outline a research roadmap to better understand the behaviour of consumers and publishers by mining the interaction logs of Open Data portals. Then, we discuss potential services on the basis of these outcomes, which can be integrated in current portals to optimize the interaction, improve data quality and user experience.

[Paper] Johann Höchtl. Institutionalising open data quality: processes, standards, tools | SLIDE
Open Data is a cheap resource, however if data quality is imposing usability constraints, data users will loose faith. The most prevalent data quality issues are profane and include broken links, missing or poor descriptions, erroneous encodings and irregular CSV files. Fixing these issues has to happen in the process domain, by agreeing on formats according to established standards and by supporting both data publishers and the data user community with tools, acting all along the data lifecycle. This paper briefly assesses the current state of affairs on data quality research, showcases some results on data quality audits and provides some concrete measures how to tackle the open data quality issue.

12:40 – 13:00

Discussion after the presentations

13:00 – 14:00

Lunch

Workshop Session #2 “Practice –> Policy”

Two groups worked in parallel to; 1) Discuss how open data quality requirements can be embedded at different institutional levels; 2) How to feed a shared data repository on open data quality measures.

Workshop Session #3 Wrap-up: 17:00 – 18:00

Key dates

Deadline for submission (short papers, 2-4 pages, via EasyChair): 6 February 2015 – Extended deadline: 15 February 2015

Notification of acceptance: 27 February 2015

Deadline for camera ready: 6 March 2015 Extended deadline: 13 March 2015

Workshop: 30 March 2015, 10:00 – 18:00

No registration fees are requested

Topics

Topics of interest include -but are not limited to- the following.

Tool chains for open data quality improvement (e.g., enrichment pipeline, versioning pipeline, quality assurance pipeline).
Open data quality metrics and empirical evaluations.
Good/bad practices in data disclosure.
Assessment of the quality of data services (e.g., API to expose frequently changing data).
Quality of linked data.

Submission guidelines

The workshop will feature invited talks as well as peer-reviewed paper presentations organized according to the topics categories defined above. Only extended and policy-oriented versions will be considered for publication in the Internet Policy Review.

Papers not exceeding 4 pages must be submitted electronically in .pdf format through EasyChair (via https://easychair.org/conferences/?conf=odq2015 – you will need to create an account if you haven’t one already) and conform to the Internet Policy Review guidelines (see: http://policyreview.info/authors). Each submission will be reviewed by at least two members of the Program Committee and will be evaluated on the basis of novelty, relevance, and appropriate comparison to related work. The program committee will make final decisions about which submissions to accept for presentation at the workshop.

In a second step, papers automatically qualify as manuscripts for the Internet Policy Review. Only the papers that include a policy angle (i.e., refer to or discuss governance, norms, ordering, standards, or further regulation, law) will be considered. Manuscripts can be further developed based on the reviews received at the workshop, but they should not exceed 25,000 characters (including blank spaces).