Estimation of Web Contents Geographic Provenience Exploiting Creative Commons Licensed Pages for Training Set Aggregation

Proceedings of the Free Culture Research Workshop 2009, Harvard University, Cambridge, USA
Davide Bardone, Elias S. G. Carotti e Juan Carlos De Martin
PDF icon FCRW09.pdf0 bytes
23 October 2009

Geographic scope estimation is a fairly recent problem which is gaining increasing attention due to the broad implications in many different fields, ranging from the development of better search engines to the need to assess specific content production on a geographical basis. However, geographic scope is a concept that can be interpreted in many different ways, ranging from the expected target scope of a specific content to the country where the content originated. The latter, in particular, albeit difficult to address, is of great importance for many reasons, such as, for example, market inquiries or anytime estimates on content production in specific countries are needed. Search engines may also be affected by the knowledge of the various kinds of geographic scopes, to better tune their responses to queries, e.g. according to (but not restricted to) the geographic proximity with the user location. However that information is rarely available and must be inferred in the vast majority of the cases. In this paper we propose a technique, grounded into the machine learning theory, to estimate source geography of web pages by means of a classifier learned on a specially constructed training set. The training set, consisting of a number of features extracted from web pages and the corresponding source-geography label (i.e. the country of origin of the web page) is automatically built by exploiting the wide number of pages with contents licensed under a localized Creative Commons (CC) license. The model thus learned is then used to classify unlabeled records and our tests showed a mean accuracy of 81% with a standard deviation of 0.9.

PDF version.