
Status: ongoing
Period: November 2024 – November 2027
Funding: in kind
Funding organization: Italian Ministry of University and Research, University of Pisa, Politecnico di Torino
Person(s) in charge: Giacomo Fantino (PhD Student), Federica Cappelluti (PhD Supervisor), Antonio Vetrò (PhD Tutor), Marco Torchiano (PhD Tutor)
Executive summary
This PhD project investigates AI-enhanced tools and methodologies that support the collaborative development, validation and documentation of software in line with Open Science and FAIR principles.
The project’s main objective is to enhance the reproducibility, quality, and responsible integration of AI within research environments. It is being conducted in collaboration with the Centro Studi Open Science of Politecnico di Torino.
Background
Software plays a fundamental role in scientific research across all disciplines. However, its dynamic, executable and modular nature can make it difficult to manage in accordance with the FAIR principles of findability, accessibility, interoperability and reusability. The generation of synthetic data has emerged as a key technique for supporting reproducibility when real data is limited or sensitive. At the same time, AI-driven tools are increasingly being used to improve the clarity, maintainability, and usability of source code in research contexts. The aim of this project is to improve AI-driven techniques to provide better support for the development and dissemination of FAIR-by-design research software, thereby contributing to the broader goals of Open Science.
Objectives
This PhD research has the following main objectives:
- Platform Enhancement for FAIR Software:
- Integrate AI services within the GitLab@PoliTo platform to support software documentation, reproducibility, and Open Science practices.
- Leverage and contribute to GitLab’s evolving AI architecture, particularly the AI Gateway, to ensure compatibility and forward-looking development.
- Automatic Source Code Comment Generation:
- Design and train AI models capable of generating human-like comments from source code.
- Evaluate model performance across programming languages and project types.
- Assess usefulness in real-world software development scenarios.
- Synthetic Data Generation and Privacy Evaluation:
- Investigate techniques for generating synthetic datasets suitable for AI training and testing.
- Develop metrics and benchmarks to evaluate the security of synthetic data against inference and membership attacks.
Results
An empirical evaluation of the robustness of such data was carried out by applying a wide range of inference attacks and privacy metrics. This provides a more accurate quantification of the risks and limitations of synthetic data in preserving privacy. Meanwhile, the core research focuses on AI models for generating comments from source code. This work introduces a novel modality by identifying the semantic differences in the data before and after code execution. This gives the model a deeper contextual understanding, improving the accuracy and relevance of the generated comments. Together, these research areas improve the transparency, documentation, and ethical reuse of software and data, thereby directly supporting the infrastructure needed for reproducible and accountable scientific research.