Status: ongoing Period: November 2022 – November 2025 Funding: 79.900 € Funding organization: Politecnico di Torino, Nexa Center for Internet & Society Person(s) in charge: Marco Rondina (PhD Student), Juan Carlos De Martin (PhD supervisor), Antonio Vetrò (PhD tutor)
Executive summary
The objective of this project is to develop practical techniques and industrial practices for accountable Artificial Intelligence (AI) decision-making.
Background
Artificial Intelligence (AI) applications automate decisions on relevant aspects of human lives. These kinds of automated decisions often unfairly discriminate against groups of individuals, on grounds that are unreasonable or inappropriate. The goal of the research is to translate the principles and guidelines of responsible and human-centric AI into techniques and actionable industrial practices.
Objectives
The PhD proposal aims at research, implement, and test techniques that can help detect bias and data quality issues in training data and mitigate them by experimenting with a variety of techniques. The objectives of the PhD plan are the following:
(i) Identify data quality and bias measures and test them on available datasets, their mutations, and synthetic datasets; experiment with the propagation of bias and quality problems to the output of classification/prediction tasks; identify and test mitigation techniques.
(ii) Identify guidelines and measures for the quality of dataset documentation; set up and perform measurements on available datasets; analyze results and possible consequences as “data cascades”.
(iii) Design and prototype informative, ethically sensitive data labels that can inform stakeholders (data maintainers, model builders, end users, etc.) about the risk of downstream effects from early data problems in AI pipelines. The data labels will be designed and tested with the aim of facilitating early intervention and mitigation of data cascades, including both human intervention (through interactive visualizations) and seamless implementation in the AI pipeline.
Results
An empirical investigation was conducted to investigate the state of dataset documentation practices by measuring the completeness of the documentation of several popular datasets in the ML/AI community. We analyzed the documentation presented in the very same place where the data can be accessed, to capture the peculiarities of the communities around some popular dataset repositories. A set of information that should always be clear to the users of the datasets, to achieve transparency and accountability, was adapted into a Documentation Test Sheet, which is able to measure the completeness of the documentation. It turned out that the information related to how to the use of the dataset was the most present. On the contrary, maintenance over time or processes behind the data generation were very poorly documented. In general, a lack of relevant information was observed, highlighting a lack of transparency. The analysis of the data shows the potential of repositories to help curators of datasets to produce better documentation, especially if they provide a more comprehensive documentation schema.
Related Publications
2024
Marco Rondina; Fabiana Vinci; Antonio Vetro’; Juan Carlos De Martin
Facial Analysis Systems and Down Syndrome Conference
Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Springer, 2024.
@conference{<LineBreak> 11583_2982543,
title = {Facial Analysis Systems and Down Syndrome},
author = {Marco Rondina and Fabiana Vinci and Antonio Vetro' and Juan Carlos De Martin},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
booktitle = {Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
publisher = {Springer},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
@conference{<LineBreak> 11583_2981538b,
title = {Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation},
author = {Marco Rondina and Antonio Vetro' and Juan Carlos De Martin},
url = {https://link.springer.com/chapter/10.1007/978-3-031-49008-8_7},
doi = {10.1007/978-3-031-49008-8_7},
isbn = {978-3-031-49008-8},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
booktitle = {Progress in Artificial Intelligence},
volume = {14115},
pages = {79–91},
publisher = {Springer},
abstract = {ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.