Status: concluded
Period: November 2019 – January 2023
Funding: in kind
Funding organization: Politecnico di Torino, Softeng research group, Nexa Center for Internet & Society
Person(s) in charge: Antonio Vetrò (Senior Researcher), Marco Torchiano (Nexa Faculty Fellow), Mariachiara Mecati (Ph.D Student)
Executive summary
The proposal of this PhD project is to investigate the impact of poor data quality and biases in the data on the automatic decisions made by software applications.
Background
Nowadays, many software systems make use of large amount of data (often, personal) to make recommendations or decisions that affect our daily lives. Consequently, computer-generated recommendations or decisions might be affected by poor quality and bias in the input data. This implies relevant ethical considerations on the impact (in terms of relevance and scale) on the life of persons affected by the output of software systems.
Objectives
The PhD proposal aims at investigating the impact of poor data quality and biases in the data on the automatic decisions made by software applications. As a minor aspect, the ethical character of algorithms and the relative effects on decisions will be also investigated.
The objectives of the PhD plan are the following ones:
• Build a conceptual and operational data measurement framework for identifying data input characteristics that potentially affect the risks of wrong or discriminating software decisions. This goal encompasses identifying which characteristics have an impact, and the measurement procedure.
• Collect empirical evidence concerning the actual impact of the measured data quality issues on automated decisions made by software systems. The evidence will be built by means of different research methods: case studies, experiments or simulations, depending on the availability of data, software and third-party collaborations. In particular a key achievement is the establishment of relational links between quality issues and output bias features.
• Design of mitigation and remediation strategies and specific techniques to reduce the problem: a proof of concept implementation should be provided. We anticipate not all aspects of the problem will be solvable computationally, in such cases it will be important to advance the knowledge in the area by identify explanations and provide critical reflections.
In addition, the secondary goal is to investigate how quality of software and bias incorporated in the algorithms can contribute to flawed decisions made by software applications:
• If any evidence is found, we have to investigate how this aspect is related to the previous one of data quality and bias.
• Design and prototyping of remediation techniques for the problem.
Results
First of all, an exploratory study has been conducted (“Identifying risks in datasets for automated decision–making”) with a view to investigating measurable characteristics of datasets which can lead to discriminating automated decisions. This initial study has been accepted for publication at the EGOV-CeDEM-ePart 2020 conference and has been selected as the Best Paper in the category The most innovative research contribution or case study” (which “Awards the paper with the most outof‐the‐box and forward-looking idea and concept. Relevance is more important than rigor”).
After that, a more detailed research has been submitted to the Government Information Quarterly (an International Journal of Information Technology Management, Policies, and Practices) . This subsequent study has been carried out by extending the set of imbalance measures with a view to examining in more depth the capability of such measures to detect imbalance among the classes of a given attribute in a dataset. Then, it has been taken into account a much larger number of datasets belonging to various application domains (from the criminal justice systems to financial services, but also social related topics, such as personal earnings and education), for the purpose of assessing whether the existing imbalance measures are able to reveal a discrimination risk when an ADM system is trained with such data. The final goal being to ensure a more conscious and responsible use of automatic decision-making (ADM) systems.