Metrics for Identifying Bias in Datasets

ICYRIME 2021 International Conference of Yearly Reports on Informatics Mathematics, and Engineering 2021, Online, July 9, 2021
Alessandro Simonetta, Andrea Trenta, Maria Cristina Paoletti, Antonio Vetrò
AttachmentSize
PDF icon 2021-ceur-ws.pdf1.24 MB
9 July 2021

Nowadays automated decision-making systems are pervasively used and more often, they are used for taking important decisions in sensitive areas such as the granting of a bank overdraft, the susceptibility of an individual to a virus infection, or even the likelihood of repeating a crime. The widespread use of these systems raises a growing ethical concern about the risk of a potential discriminatory impact. In particular, machine-learning systems trained on unbalanced data could rise to systematic discriminations in the real world. One of the most important challenges is to determine metrics capable of detecting when an unbalanced training dataset may lead to discriminatory behaviour of the model built on it. In this paper, we propose an approach based on the notion of data completeness using two different metrics: one based on the combinations of the values of the dataset, which will be our benchmark, and the second using frame theory, widely used among others for quality measures of control systems. It is important to remark that the use of metrics cannot be a substitute for a broader design that must take into account the columns that could lead to the presence of bias in the data. The line of research does not end with these activities but aims to continue the path towards a standardised register of measures.