Geography of the Web

Executive summary

The Web Geography research activity focuses on estimating the country of origin of a content without exploiting the presence of explicit geographical references in the textual content. To achieve this goal the large diffusion of web pages licensed under localized Creative Commons licenses was exploited, thus allowing to automatically collect the necessary dataset by means of an ad-hoc designed web crawler.

Background

In recent years the problem of inferring geographical information contained in web pages in order to determine the geographic context of their content gained increasing attention. Knowledge of this information may be useful in many fields, ranging from localized market analysis, statistics on content production, and more efficient content search and retrieval.

Objectives

Our research activity on Web Geography focuses on estimating the country of origin of a content without exploiting the presence of explicit geographical references in the textual content. Estimation is performed by means of a Machine Learning algorithm which learns a probabilistic model of the correspondence between the country of origin of a web page and some features such as, for example, the page language, the characters encoding, the physical position of the server hosting the web site

Results