Geography of the Web

Status: 
concluded
Period: 
January 2006 - December 2009
Executive summary: 

The Web Geography research activity focuses on estimating the country of origin of a content without exploiting the presence of explicit geographical references in the textual content. To achieve this goal the large diffusion of web pages licensed under localized Creative Commons licenses was exploited, thus allowing to automatically collect the necessary dataset by means of an ad-hoc designed web crawler.

Background: 

In recent years the problem of inferring geographical information contained in web pages in order to determine the geographic context of their content gained increasing attention. Knowledge of this information may be useful in many fields, ranging from localized market analysis, statistics on content production, and more efficient content search and retrieval.

Objectives: 

Our research activity on Web Geography focuses on estimating the country of origin of a content without exploiting the presence of explicit geographical references in the textual content. Estimation is performed by means of a Machine Learning algorithm which learns a probabilistic model of the correspondence between the country of origin of a web page and some features such as, for example, the page language, the characters encoding, the physical position of the server hosting the web site.

Results: 

A potentially huge training set of hand-labeled web pages for estimating the country of origin of a content is needed to learn a reliable model for classification of unseen pages. Hand labeling of such a dataset is unfeasible, so we exploit the large diffusion of web pages licensed under localized Creative Commons licenses, thus allowing to automatically collect the necessary dataset by means of an ad-hoc designed web crawler. To date, our models can determine the geographic context of a web page with an accuracy of about 81%. Related are available at: http://nexa.polito.it/category/topic/web-geography