Open Data Quality: from Theory to Practice
30 March 2015 - Technische Universität München | Institut für Informatik
Boltzmannstr. 3 , 85748 Garching bei München - Room 00.08.038
In theory there is no difference between theory and practice; in practice there is.
Walter J. Savitch (remark overheard at a computer science conference)
The amount and variety of open data published by governments is increasing. A sufficient level of data quality is one of the preconditions for an effective open data reuse, which in turn is supposed to increase governments’ transparency and to create business opportunities. In particular, the “open” nature of a data set magnifies the implications of its contextual quality: open data are supposed to stimulate serendipitous reuse and generate unexpected mixes and matches, but this requires that virtually anybody with the necessary background is able to understand the data, port them to other systems, etc.
ODQ2015 aimed at setting up a multidisciplinary dialogue that translates theoretical inputs for the evaluation of data quality (including a focus on the assessment of linked open data quality) into the adoption of shared practices for quality assurance. To this end, the workshop firstly reviewed the main theoretical references on open data quality evaluation, including recent contributions by researchers in the field. Secondly, examples of good and bad practices of open government data publication have been presented and discussed. Finally, the workshop took theory and practice as inputs to define pragmatic “tool chains” and pipelines to be adopted by data publishers so to ensure that their data is of high quality, and can be enriched and versioned over time.
The workshop has been held as part of the Joint Research Action on “Evidence and Experimentation” (JRA3) of the Network of Excellence in Internet Science (http://internet-science.eu/groups/evidence-and-experimentation).
Programme and materials
Workshop Session #1 "Theory --> Practice”: 10:00 - 13:00
The objective of this session is to present ongoing trends in measuring open data quality, as well as ongoing projects that particularly take into account open data quality as an enabler of their effectiveness.
10:00 - 10:10: Welcome and framework (Antonio Vetrò and Federico Morando). See the slides by Antonio and Federico (PDF, 4.95 MB).
10:10 - 10:30: Round table: speak 2’ to introduce yourself and why you’re here
10:30 - 11:30 Modelling and measuring open data quality (presentations: 10' each)
- Ulrich Atz, Open Data Institute, presenting the ODI's Open Data Certificates. See Ulrich's slides (HTML).
Many discussions around the quality of open data are focused on the technical aspects. With Open Data Certificates we have taken a broader approach: we provide a framework for measuring how effective a publisher is in sharing a given dataset. For example, a Certificate encompasses rights and licensing, documentations, guarantees about availability and user support. The Certificate therefore offers a guide for leading practice in open data publishing.
- Matthew Fullerton, SMARTLANE project, on quality measurement from the point of view of a platform for data exchange. See Matthew's slides (PDF, 2.84 MB).
There has been a massive shift towards governments at all levels providing ready and open access to data. This has been made possible by internet and software technologies. Unfortunately, so much is possible with this technology that the focus can sometimes be on the data provision (usually in the form of a catalog) and not on the quality or usability of the data itself. Furthermore, tools for understanding the data or its quality, potentially in combination with other data sets, are usually not available. Catalogs however have great potential because the metadata can be useful for ensuring some aspects of quality directly, for example when the data was last updated, how often it will be updated, and which other data sets are relevant, but these aspects are often not used. And provision for data exploration, fusion and quality measurement tools can be efficiently realized when integrated into a catalog platform. It is time for these tools to be integrated into data provision, even if some tools are domain specific and complex. And irrespective of how well we can improve the situation with software, we need to lower the usage barriers, so that people start to use the data and software to generate information and knowledge from it. This will in turn improve the tools through feedback and improve the data as quality problems can first appear when trying to use the data. Here interested citizens can help in using the data that already exists, and transparency-driven platforms can help open up the data acquisition process.
- Marco Torchiano, Politecnico di Torino, and Antonio Vetrò, TUM, on recent research focused on empirical assessment of open data quality. See the slides by Marco and Antonio (PDF, 1.24 MB)
- [Paper] Jürgen Umbrich, Sebastian Neumaier and Axel Polleres. Towards assessing the quality evolution of Open Data portals. See Jürgen's slides (PDF, 3.13 MB).
In this work, we present the Open Data Portal Watch project, a public framework to continuously monitor and assess the (meta-)data quality in Open Data portals. We critically discuss the objectiveness of various quality metrics. Further, we report on early findings based on 22 weekly snapshots of 90 CKAN portals and highlight interesting observations and challenges.
- [Paper] Michael Klaes, Adam Trendowicz and Andreas Jedlitschka. What Makes Big Data Different from a Data Quality Assessment Perspective? Practical Challenges for Data and Information Quality Research. See the Michael's slides (PDF, 200 KB)
High-quality data is a prerequisite for most types of analysis. However, since data quality does not come for free, it has to be assessed and managed continuously. The increasing quantity, diversity, and velocity that characterize big data today make these tasks even more challenging. We identified challenges that are specific for big data quality assessments and provide some pointers to promising solution ideas. Moreover, we motivate why big-data-specific challenges may also be worth to be considered when the quality of open data is in focus.
- [Paper] Ertugrul Bircan Copur. Data quality modelling and measurements: A case study with German open government data. See Ertugrul's slides (PDF, 118.5 KB).
The focus of the work is on the quality of German open government data. We collected information about the negative and positive aspects of the datasets application developers made use of, and later evaluated the respective datasets using a set of metrics. The feedback from the developers and our own evaluations allow us to present empirical results on the quality of the data and good and bad practices during their construction. In the long run, we are hoping to develop a set of guidelines for open government data, in order to increase the overall quality of the data that is available to the public and create a standardised format to allow cross-set data usage.
11:30 - 11:50 Discussion after the presentations
11:50 - 12:40 Publishing and reusing (presentations: 10' each)
- Giovanni Menduni, Politecnico di Milano, on data quality applied to OpenExpo. See Giovanni's slides. (PDF, 1.51 MB).
- Stefano Gatti, CERVED group, on open data quality and commercial reuse. See Stefano's slides (PDF; 1.51 MB).
- Lorenzo Canova, Nexa Center at Politecnico di Torino. How data quality affects reusabilty: a case study on published Italian Open Data. See Lorenzo's slides (PDF, 353 KB).
- [Paper] Claudio Di Ciccio, Javier D. Fernandez and Jürgen Umbrich. Improving the usability of Open Data portals from a business process perspective. See Jürgen's slides (PDF, 1.51 MB).
Open Data portals are considered to be the cornerstones of the Open Data movement, as they offer an infrastructure to publish, share and consume public information. From a business perspective, such portals can be seen as a non-profit data marketplace, in which users try to satisfy their demand and offer requirements in several different processes. In this work, we argue that studying these so far unexplored interaction processes bears the potential to make the portals more effective. We first outline a research roadmap to better understand the behaviour of consumers and publishers by mining the interaction logs of Open Data portals. Then, we discuss potential services on the basis of these outcomes, which can be integrated in current portals to optimize the interaction, improve data quality and user experience.
- [Paper] Johann Höchtl. Institutionalising open data quality: processes, standards, tools. See Johann's slides (PDF, 1.51 MB).
Open Data is a cheap resource, however if data quality is imposing usability constraints, data users will loose faith. The most prevalent data quality issues are profane and include broken links, missing or poor descriptions, erroneous encodings and irregular CSV files. Fixing these issues has to happen in the process domain, by agreeing on formats according to established standards and by supporting both data publishers and the data user community with tools, acting all along the data lifecycle. This paper briefly assesses the current state of affairs on data quality research, showcases some results on data quality audits and provides some concrete measures how to tackle the open data quality issue.
The focus of this presentation is to view from a business perspective either the main open data quality issues or the most important use-cases by an Italian information-provider. Data and metadata issues and also values of the open-datasest used by these projects will be deepened with some qualitative and qualitative KPI (key process indicator).
This presentation focuses on what are the basic quality dimensions that affect directly data reusability. The final goal is to find possible solutions for improving data quality taking in consideration also some empirical evidence.
12:40 - 13:00 Discussion after the presentations
13:00 - 14:00 Lunch
Workshop Session #2 "Practice --> Policy”: 14:00 - 17:00
Two groups worked in parallel to; 1) Discuss how open data quality requirements can be embedded at different institutional levels; 2) How to feed a shared data repository on open data quality measures.
Workshop Session #3 Wrap-up: 17:00 - 18:00
Technische Universität München | Boltzmannstr. 3 , 85748 Garching bei München | Institut für Informatik | Room 00.08.038 (click on the link to see its position)
Deadline for submission (short papers, 2-4 pages, via EasyChair):
6 February 2015 - Extended deadline: 15 February 2015
Notification of acceptance: 27 February 2015
Deadline for camera ready:
6 March 2015 Extended deadline: 13 March 2015
Workshop: 30 March 2015, 10:00 - 18:00
No registration fees are requested
Topics of interest include -but are not limited to- the following.
- Tool chains for open data quality improvement (e.g., enrichment pipeline, versioning pipeline, quality assurance pipeline).
- Open data quality metrics and empirical evaluations.
- Good/bad practices in data disclosure.
- Assessment of the quality of data services (e.g., API to expose frequently changing data).
- Quality of linked data.
The workshop will feature invited talks as well as peer-reviewed paper presentations organized according to the topics categories defined above. Only extended and policy-oriented versions will be considered for publication in the Internet Policy Review.
Papers not exceeding 4 pages must be submitted electronically in .pdf format through EasyChair (via https://easychair.org/conferences/?conf=odq2015 - you will need to create an account if you haven't one already) and conform to the Internet Policy Review guidelines (see: http://policyreview.info/authors). Each submission will be reviewed by at least two members of the Program Committee and will be evaluated on the basis of novelty, relevance, and appropriate comparison to related work. The program committee will make final decisions about which submissions to accept for presentation at the workshop.
In a second step, papers automatically qualify as manuscripts for the Internet Policy Review. Only the papers that include a policy angle (i.e., refer to or discuss governance, norms, ordering, standards, or further regulation, law) will be considered. Manuscripts can be further developed based on the reviews received at the workshop, but they should not exceed 25,000 characters (including blank spaces).
- Antonio Vetrò - Technische Universität München (Organising Chair)
- Lorenzo Canova - Politecnico di Torino
- Raimondo Iemma - Politecnico di Torino
- Federico Morando - Politecnico di Torino
- Antonio Vetrò - Technische Universität München
- Lorenzo Canova - Politecnico di Torino
- Raimondo Iemma - Politecnico di Torino
- Maximilian Irlbeck - Technische Universität München
- Federico Morando - Politecnico di Torino
- Marco Torchiano - Politecnico di Torino
- Xin Wang - University of Southampton
Links, contacts, and logistics
opendataquality [at] in.tum [dot] de
Some indications on: