Source Geography Estimation for Web Pages

Proceedings of the 9th IEEE International Symposium on Signal Processing and Information Technology 2009, IEEE ISSPIT 2009, Ajman, UAE.
Davide Bardone, Elias S. G. Carotti e Juan Carlos De Martin
PDF icon ISSPIT09.pdf0 bytes
14-17 December 2009

The problem of inferring geographical information associated to web pages and identifying the geographic scope of their content is gaining increasing attention. However, geographic scope is a concept that can be interpreted in many different ways, ranging from the expected target scope of a specific content to the country where the content originated. The latter, in particular, albeit difficult to address, is of great importance for many reasons, such as, for example, market inquiries or anytime estimates on content production in specific countries are needed. Search engines may also be affected by the knowledge of the various kinds of geographic scopes, to better tune their responses to queries, e.g. according to (but not restricted to) the geographic proximity with the user location. However that information is rarely available and must be inferred in the vast majority of the cases. In this paper we propose a technique, grounded into the machine learning theory, to estimate source geography of web pages by means of a classifier learned on a specially constructed training set. The training set, consisting of a number of features extracted from web pages and the corresponding source-geography label (i.e. the country of origin of the web page) is automatically built by exploiting the wide number of pages with contents licensed under a localized Creative Commons (CC) license. The model thus learned is then used to classify unlabeled records and our tests showed a mean accuracy of 81% with a standard deviation of 0.9.

PDF version.