Status: concluded Period: April 2012 – January 2016 Funding: In-kind contributions in 2016 Person(s) in charge: Giuseppe Futia (Nexa project manager & main developer), Alessio Melandri (Nexa Fellow),Federico Cairo (Project founder & technology advisor)
Executive summary
TellMeFirst (TMF) is an open source software designed for classifying and enhancing documents with Natural Language Processing (NLP) and Linked Open Data (LOD) technologies. Identified topics are expressed as DBpedia resources, a representation of structured information of a Wikipedia entry. The input document is then enriched with new information (images, videos, maps, news) retrieved from LOD repositories published on the Web.
Background
The adoption of Linked Data best practices for exposing and connecting information on the Web has a considerable success in several areas: multimedia publishing, open government, health care. Moreover, a specific line of research explores the points of convergence of Linked Data and Natural Language Processing (NLP): DBpedia, a central interlinking hub for the Linking Data project, has proven to be a very suitable knowledge base for text classification, according to both technical reasons and more theoretical considerations. Furthermore, DBpedia is directly linked to the arguably largest multilingual annotated corpus ever created, which is Wikipedia: thus, it is technically perfect for automated tasks in the fields of NLP.
Objectives
TMF intends to leverage Linked Data and NLP technologies to extract the main topics from texts in the form of DBpedia resources, retrieving new information from the Web. In the previous years, we have created a structured and well-defined process to maintain the training set updated: this is a necessary step for classifying documents concerning recent topics. The next step was the development of a module for building ad hoc training sets for documents related to a specific area of knowledge. For these reasons, we have focused on the development of a parametrized process in order to adapt TellMeFirst to different purposes and different semantic areas. Through the development of this feature, TMF can be exploited by companies, public administrations, and cultural institution that need a classification system for their specific knowledge domains and purposes. The TMF software has now reached maturity and therefore we will explore use cases of the tool within structured projects.
In 2015 we have developed a software pipeline to build a training set for classifying documents related to a specific domain of knowledge. The pipeline is currently driven by SPARQL queries supported by the Linked Data Recommender developed by the SoftEng Group of the Politecnico di Torino, in order to discover other entities that are not identified with the previous method. More information is available on GitHub.
Moreover, with the experience gathered with the development of TellMeFirst, Giuseppe Futia has won the “Best tool for multi-lingual journalists” prize during the #newsHACK 2016 event organized by the BBC.
Related Publications
2015
Rocha, Oscar Rodriguez; Vagliano, Iacopo; Martinez, Cristhian Nicolas Figueroa; Cairo, Federico; Futia, Giuseppe; Licciardi, Carlo; Marengo, Marco; Morando, Federico
@article{<LineBreak> 11583_2585561,
title = {Semantic Annotation and Classification in Practice},
author = {Oscar Rodriguez Rocha and Iacopo Vagliano and Cristhian Nicolas Figueroa Martinez and Federico Cairo and Giuseppe Futia and Carlo Licciardi and Marco Marengo and Federico Morando},
url = {http://www.computer.org/csdl/mags/it/2015/02/mit2015020033-abs.html},
doi = {10.1109/MITP.2015.29},
year = {2015},
date = {2015-01-01},
urldate = {2015-01-01},
journal = {IT PROFESSIONAL},
volume = {17},
number = {IT-Enabled Business Innovation},
pages = {33–39},
publisher = {IEEE},
abstract = {The evolution of the traditional Web into a Semantic Web and the continuous increase in the amount of data published as Linked Data open up new opportunities for annotation and categorization systems to reuse these data as semantic knowledge bases. Accordingly, Linked Data has been used by information extraction systems to exploit the semantic knowledge bases, which can be interconnected and structured in order to increase the precision and recall of annotation and categorization mechanisms. This paper describes TellMeFirst a software for the classification and enrichment of textual documents written in English and Italian. Although nowadays there are various works presenting solutions for text annotation and classification, this work is focused on describing and studying the use case of a Telecommunications Operator that has adopted TellMeFirst in order to generate value-added to two services available to its users: FriendTV and SOCIETY.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The evolution of the traditional Web into a Semantic Web and the continuous increase in the amount of data published as Linked Data open up new opportunities for annotation and categorization systems to reuse these data as semantic knowledge bases. Accordingly, Linked Data has been used by information extraction systems to exploit the semantic knowledge bases, which can be interconnected and structured in order to increase the precision and recall of annotation and categorization mechanisms. This paper describes TellMeFirst a software for the classification and enrichment of textual documents written in English and Italian. Although nowadays there are various works presenting solutions for text annotation and classification, this work is focused on describing and studying the use case of a Telecommunications Operator that has adopted TellMeFirst in order to generate value-added to two services available to its users: FriendTV and SOCIETY.
@conference{nokey,
title = {Exploiting Linked Data and Natural Language Processing for the Classification of Political Speech},
author = {Giuseppe Futia and Federico Cairo and Federico Morando and Luca Leschiutta
},
url = {https://nexa.polito.it/wp-content/uploads/2024/06/futia2014exploiting.pdf},
year = {2014},
date = {2014-05-21},
organization = {Conference for E-Democracy and Open Governement},
abstract = {This paper shows the effectiveness of a DBpedia-based approach for text categorization in the e-government field. Our use case is the analysis of all the speech transcripts of current White House members. This task is performed by means of TellMeFirst, an open-source software that leverages the DBpedia knowledge base and the English Wikipedia linguistic corpus for topic extraction. Analysis results allow to identify the main political trends addressed by the White House, increasing the citizens' awareness to issues discussed by politicians. Unlike methods based on string recognition, TellMeFirst semantically classifies documents through DBpedia URIs, gathering all the synonyms, hypernyms and hyponyms of a lemma under the same unambiguous concept.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
This paper shows the effectiveness of a DBpedia-based approach for text categorization in the e-government field. Our use case is the analysis of all the speech transcripts of current White House members. This task is performed by means of TellMeFirst, an open-source software that leverages the DBpedia knowledge base and the English Wikipedia linguistic corpus for topic extraction. Analysis results allow to identify the main political trends addressed by the White House, increasing the citizens’ awareness to issues discussed by politicians. Unlike methods based on string recognition, TellMeFirst semantically classifies documents through DBpedia URIs, gathering all the synonyms, hypernyms and hyponyms of a lemma under the same unambiguous concept.