Data Bias: Identification of Mitigation and Remediation Strategies, Techniques and Tools

Status: 
concluded
Period: 
November 2019 - January 2023
Funding: 
in kind
Funding organization: 

Politecnico di Torino, Softeng research group, Nexa Center for Internet & Society

Person(s) in charge: 

Antonio Vetrò (Senior Researcher), Marco Torchiano (Nexa Faculty Fellow), Mariachiara Mecati (Ph.D Student)

Executive summary: 

The proposal of this PhD project is to investigate the impact of poor data quality and biases in the data on the automatic decisions made by software applications.

Background: 

Nowadays, many software systems make use of large amount of data (often, personal) to make recommendations or decisions that affect our daily lives. Consequently, computer-generated recommendations or decisions might be affected by poor quality and bias in the input data. This implies relevant ethical considerations on the impact (in terms of relevance and scale) on the life of persons affected by the output of software systems.

Objectives: 

The PhD proposal aims at investigating the impact of poor data quality and biases in the data on the automatic decisions made by software applications. As a minor aspect, the ethical character of algorithms and the relative effects on decisions will be also investigated.

The objectives of the PhD plan are the following ones:
• Build a conceptual and operational data measurement framework for identifying data input characteristics that potentially affect the risks of wrong or discriminating software decisions. This goal encompasses identifying which characteristics have an impact, and the measurement procedure.
• Collect empirical evidence concerning the actual impact of the measured data quality issues on automated decisions made by software systems. The evidence will be built by means of different research methods: case studies, experiments or simulations, depending on the availability of data, software and third-party collaborations. In particular a key achievement is the establishment of relational links between quality issues and output bias features.
• Design of mitigation and remediation strategies and specific techniques to reduce the problem: a proof of concept implementation should be provided. We anticipate not all aspects of the problem will be solvable computationally, in such cases it will be important to advance the knowledge in the area by identify explanations and provide critical reflections.

In addition, the secondary goal is to investigate how quality of software and bias incorporated in the algorithms can contribute to flawed decisions made by software applications:
• If any evidence is found, we have to investigate how this aspect is related to the previous one of data quality and bias.
• Design and prototyping of remediation techniques for the problem.

Results: 

First of all, an exploratory study has been conducted (“Identifying risks in datasets for automated decision–making”) with a view to investigating measurable characteristics of datasets which can lead to discriminating automated decisions. This initial study has been accepted for publication at the EGOV-CeDEM-ePart 2020 conference and has been selected as the Best Paper in the category The most innovative research contribution or case study" (which "Awards the paper with the most out­of‐the‐box and forward-looking idea and concept. Relevance is more important than rigor").
After that, a more detailed research has been submitted to the Government Information Quarterly (an International Journal of Information Technology Management, Policies, and Practices) . This subsequent study has been carried out by extending the set of imbalance measures with a view to examining in more depth the capability of such measures to detect imbalance among the classes of a given attribute in a dataset. Then, it has been taken into account a much larger number of datasets belonging to various application domains (from the criminal justice systems to financial services, but also social related topics, such as personal earnings and education), for the purpose of assessing whether the existing imbalance measures are able to reveal a discrimination risk when an ADM system is trained with such data. The final goal being to ensure a more conscious and responsible use of automatic decision-making (ADM) systems.

Related Publications:
Mecati, M.; Torchiano, M.; Vetro, A.; De Martin, J.C.
03 March 2023
IEEE ACCESS, 11:(2023), pp. 26996-27011
Mecati M., Adrignola A., Vetrò A., Torchiano M.
17-20 Dec. 2022
Second International Workshop on Data Science for equality, inclusion and well-being challenges (DS4EIW 2022), Osaka, Japan 17-20 Dec. 2022, Page 4700-4709
Mariachiara Mecati, Antonio Vetrò, Marco Torchiano
April 2022
Journal of Data and Information Quality, April 2022
Mecati M., Vetrò A., Torchiano M.
15-18 Dec. 2021
In: First International Workshop on Data Science for equality, inclusion and well-being challenges (DS4EIW 2021)
Simonetta A., Vetrò A., Paoletti C.M., Torchiano M.
8 Dec 2021
3rd International Workshop on Experience with SQuaRE Series and Its Future Direction (IWESQ 2021),Taipei (Taiwan) 8 Dec 2021, pp.17-22
Vetrò, A., Torchiano, M., Mecati, M.
September 4, 2021
GOVERNMENT INFORMATION QUARTERLY, Elsevier, pp. 17, 2021, Vol. 38, Issue 4, ISSN: 0740-624X DOI 10.1016/j.giq.2021.101619
Alessandro Simonetta, Andrea Trenta, Maria Cristina Paoletti, Antonio Vetrò
9 July 2021
ICYRIME 2021 International Conference of Yearly Reports on Informatics Mathematics, and Engineering 2021, Online, July 9, 2021
Mecati, M., Cannavò F.E., Vetrò A., Torchiano, M.
September 2020
EGOV2020 – IFIP EGOV-CeDEM-EPART 2020, Linköping University (Sweden), August 31 - September 2, 2020, pp. 332-344. egov-2020 (BEST PAPER AWARD)