TellMeFirst - A Knowledge Discovery Application - Nexa Center for Internet & Society

Status: concluded
Period: April 2012 – January 2016
Funding: In-kind contributions in 2016
Person(s) in charge: Giuseppe Futia (Nexa project manager & main developer), Alessio Melandri (Nexa Fellow), Federico Cairo (Project founder & technology advisor)

Executive summary

TellMeFirst (TMF) is an open source software designed for classifying and enhancing documents with Natural Language Processing (NLP) and Linked Open Data (LOD) technologies. Identified topics are expressed as DBpedia resources, a representation of structured information of a Wikipedia entry. The input document is then enriched with new information (images, videos, maps, news) retrieved from LOD repositories published on the Web.

Background

The adoption of Linked Data best practices for exposing and connecting information on the Web has a considerable success in several areas: multimedia publishing, open government, health care. Moreover, a specific line of research explores the points of convergence of Linked Data and Natural Language Processing (NLP): DBpedia, a central interlinking hub for the Linking Data project, has proven to be a very suitable knowledge base for text classification, according to both technical reasons and more theoretical considerations. Furthermore, DBpedia is directly linked to the arguably largest multilingual annotated corpus ever created, which is Wikipedia: thus, it is technically perfect for automated tasks in the fields of NLP.

Objectives

TMF intends to leverage Linked Data and NLP technologies to extract the main topics from texts in the form of DBpedia resources, retrieving new information from the Web. In the previous years, we have created a structured and well-defined process to maintain the training set updated: this is a necessary step for classifying documents concerning recent topics. The next step was the development of a module for building ad hoc training sets for documents related to a specific area of knowledge. For these reasons, we have focused on the development of a parametrized process in order to adapt TellMeFirst to different purposes and different semantic areas. Through the development of this feature, TMF can be exploited by companies, public administrations, and cultural institution that need a classification system for their specific knowledge domains and purposes. The TMF software has now reached maturity and therefore we will explore use cases of the tool within structured projects.

Results

The last features developed on TMF have been presented in February 2016 in occasion of the DBpedia Community Meeting in The Hague (Netherlands) and in September 2016 in occasion of the 7th DBpedia Community Meeting in Leipzig.

In 2015 we have developed a software pipeline to build a training set for classifying documents related to a specific domain of knowledge. The pipeline is currently driven by SPARQL queries supported by the Linked Data Recommender developed by the SoftEng Group of the Politecnico di Torino, in order to discover other entities that are not identified with the previous method. More information is available on GitHub.

Moreover, with the experience gathered with the development of TellMeFirst, Giuseppe Futia has won the “Best tool for multi-lingual journalists” prize during the #newsHACK 2016 event organized by the BBC.

Related Publications

2015

Rocha, Oscar Rodriguez; Vagliano, Iacopo; Martinez, Cristhian Nicolas Figueroa; Cairo, Federico; Futia, Giuseppe; Licciardi, Carlo; Marengo, Marco; Morando, Federico

Semantic Annotation and Classification in Practice Journal Article

In: IT PROFESSIONAL, vol. 17, no. IT-Enabled Business Innovation, pp. 33–39, 2015.

Abstract | Links | BibTeX