Sign up & Download
Sign in

P-TAG : Large Scale Automatic Generation of Personalized Annotation TAGs for the Web

by Paul Alexandru Chirita, Stefania Costache, Siegfried Handschuh, Wolfgang Nejdl
Text ()

Abstract

The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies, have become more and more popular and successful. Such tags are relevant keywords associated with or assigned to a piece of information (e.g., a Web page), describing the item and enabling keyword-based classification. In this paper we propose P-TAG, a method which automatically generates personalized tags for Web pages. Upon browsing a Web page, P-TAG produces keywords relevant both to its textual content, but also to the data residing on the surfer's Desktop, thus expressing a personalized viewpoint. Empirical evaluations with several algorithms pursuing this approach showed very promising results. We are therefore very confident that such a user oriented automatic tagging approach can provide large scale personalized metadata annotations as an important step towards realizing the Semantic Web.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

P-TAG : Large Scale Automatic Gen...

P-TAG: Large Scale Automatic Generation of Personalized Annotation TAGs for the Web Paul - Alexandru Chirita1* , Stefania Costache1, Siegfried Handschuh2, Wolfgang Nejdl1 1L3S Research Center / University of Hannover, Appelstr. 9a, 30167 Hannover, Germany {chirita,costache,nejdl}@l3s.de 2National University of Ireland / DERI, IDA Business Park, Lower Dangan, Galway, Ireland Siegfried.Handschuh@deri.org ABSTRACT The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies, have become more and more popular and successful. Such tags are relevant keywords associated with or assigned to a piece of information (e.g., a Web page), describing the item and enabling keyword-based classifica- tion. In this paper we propose P-TAG, a method which automat- ically generates personalized tags for Web pages. Upon browsing a Web page, P-TAG produces keywords relevant both to its textual content, but also to the data residing on the surfer���s Desktop, thus expressing a personalized viewpoint. Empirical evaluations with several algorithms pursuing this approach showed very promising results. We are therefore very confident that such a user oriented automatic tagging approach can provide large scale personalized metadata annotations as an important step towards realizing the Se- mantic Web. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing H.3.5 [Information Storage and Retrieval]: On- line Information Services General Terms Algorithms, Experimentation, Design Keywords Web Annotations, Tagging, Personalization, User Desktop 1. INTRODUCTION The World Wide Web has had a tremendous impact on society and business in recent years by making information instantly and ubiquitously available. The Semantic Web is seen as an extension of the WWW, a vision of a future Web of machine-understandable documents and data. One its main instruments are the annotations, which enrich content with metadata in order to ease its automatic processing. The traditional paradigm of Semantic Web annotation *Part of this work was performed while the author was visiting Yahoo! Research, Barcelona, Spain. Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8���12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. (i.e., annotating Web sites with the help of external tools) has been established for a number of years by now, for example in the form of applications such as OntoMat [20] or tools based on Annotea [23], and the process continues to develop and improve. However, this paradigm is based on manual or semi-automatic annotation, which is a laborious, time consuming task, requiring a lot of expert know-how, and thus only applicable to small-scale or Intranet col- lections. For the overall Web though, the growth of a Semantic Web overlay is restricted because of the lack of annotated Web pages. In the same time, the tagging paradigm, which has its roots in social bookmarking and folksonomies, is becoming more and more pop- ular. A tag is a relevant keyword associated with or assigned to a piece of information (e.g., a Web page), describing the item and enabling keyword-based classification of the information it is ap- plied to. The successful application of the tagging paradigm can be seen as evidence that a lowercase semantic Web1 could be easier to grasp for the millions of Web users and hence easier to introduce, exploit and benefit from. One can then build upon this lowercase semantic web as a basis for the introduction of more semantics, thus advancing further towards the Web 2.0 ideas. We argue that a successful and easy achievable approach is to automatically generate annotation tags for Web pages in a scalable fashion. We use tags in their general sense, i.e., as a mechanism to indicate what a particular document is about [4], rather than for ex- ample to organize one���s tasks (e.g., ���todo���). Yet automatically gen- erated tags have the drawback of presenting only a generic view, which does not necessary reflect personal interests. For example, a person might categorize the home page of Anthony Jameson2 with the tags ���human computer interaction��� and ���mobile computing���, because this reflects her research interests, while another would an- notate it with the project names ���Halo 2��� and ���MeMo���, because she is more interested in research applications. The crucial question is then how to automatically tag Web pages in a personalized way. In many environments, defining a user���s viewpoint would rely on the definition of an interest profile. How- ever, these profiles are laborious to create and need constant main- tenance in order to reflect the changing interest of the user. Fortu- nately, we do have a rich source of user profiling information avail- able: everything stored on her computer. This personal Desktop usually contains a very rich document corpus of personal informa- tion which can and should be exploited for user personalization! There is no need to maintain a dedicated interest profile, since the 1Lowercase semantic web refers to an evolutionary approach for the Semantic Web by adding simple meaning gradually into the documents and thus lowering the barriers for re-using information. 2http://www.dfki.de/��jameson WWW 2007 / Track: Semantic Web Session: Semantic Web and Web 2.0 845
Page 2
hidden
Desktop as such reflects all the trends and new interests of a user, while it also tracks her history. Based on this observation, we propose a novel approach for a scalable automatic generation of annotation tags for Web pages, personalized on each user���s Desktop. We achieve this by aligning keyword candidates for a given Web page with keywords repre- senting the personal Desktop documents and thus the subject���s / author���s personal interest, utilizing appropriate algorithms. The re- sulting personalized annotations can be added on the fly to any Web page browsed by the user. The structure of the paper is as follows: In Section 2 we discuss previous and related work. Section 3 describes the core algorithmic approaches for personalized annotation of Web pages by exploiting Desktop documents. In Section 4 we present the setup and empiri- cal results of our evaluation. Finally, prior to concluding, we briefly discuss possible applications of our approach in Section 5. 2. PREVIOUS WORK This paper presents a novel approach to generate personalized annotation tags for Web pages by exploiting document similarity and keyword extraction algorithms. Though some blueprints do ex- ist, to our knowledge there has been no prior explicit formulation of this approach, nor a concrete application or empirical evaluation, as presented in this paper. Nevertheless, a substantial amount of re- lated work already exists concerning the general goal of creating annotations for Web pages, as well as keyword extraction. The fol- lowing sections will discuss some of the most important works in the research areas of annotation, text mining for keyword extrac- tion, and keyword association. 2.1 Generating Annotations for the Web Brooks and Montanez [4] analyzed the effectiveness of tags for classifying blog entries and found that manual tags are less effective content descriptors than automated ones. We see this as a support for our work, especially since our evaluation from Section 4 proves that the tags we create do result in high precision for content de- scription. They further showed that clustering algorithms can be used to construct a topical hierarchy amongst tags. We believe this finding could be a useful extension to our approach. Cimiano et. al. [10] proposed PANKOW (Pattern-based Annota- tion through Knowledge on the Web), a method which employs an unsupervised, pattern-oriented approach to categorize an instance with respect to a given ontology. Similar to P-TAG, the system is rather simple, effortless and intuitive to use for Web page annota- tion. However, it requires an input ontology and outputs instances of the ontological concepts, whereas we simply annotate Web pages with user specific tags. Also, PANKOW exploits the Web by means of a generic statistical analysis, and thus their annotations reflect more common knowledge without considering context or personal preferences, whereas we create personalized tags. Finally, the ma- jor drawback of PANKOW is that it does not scale, since it produces an extremely large number of queries against one target Web search engine in order to collect its necessary data. The work in [11] presents an enhanced version of PANKOW, namely C-PANKOW. The application downloads document ab- stracts and processes them off-line, thus overcoming several short- comings of PANKOW. It also introduces the notion of context, based on the similarity between the document to be annotated and each of the downloaded abstracts. However, it is reported that an- notating one page can take up to 20 minutes with C-PANKOW. Our system annotates Web pages on the fly in seconds. Note that the tasks are not entirely comparable though, since our system does not produce ontology-based annotations, but personalized annota- tion tags. Further, our notion of context is much stronger, since we consider documents from the personal Desktop, which leads to highly personalized annotations. Finally, C-PANKOW uses the proper nouns of each Web page for annotation candidates, and thus annotation is always directly rooted on the text of the Web page. On the other hand, the algorithms we propose in this paper gener- ate keywords that not necessarily appear literally on the Web page, but are in its context, while also reflecting the personal interests of the user. Dill et. al. [14] present a platform for large-scale text analytics and automatic semantic tagging. The system spots known terms in a Web page and relates it to existing instances of a given ontol- ogy. The strength of the system is in the taxonomy based disam- biguation algorithm. In contrast, our system does not rely on such a handcrafted lexicon and extracts new keywords in a fully automatic fashion, while also supporting personalized annotations. 2.2 Text Mining for Keywords Extraction Text data mining is one of the main technologies for discovering new facts and trends about the currently existing large text collec- tions [21]. There exist quite a diverse number of approaches for extracting keywords from textual documents. In this section we re- view some of those techniques originating from the Semantic Web, Information Retrieval and Natural Language Processing environ- ments, as they are closest to the algorithms described in this paper. In Information Retrieval, most of these techniques were used for Relevance Feedback [29], a process in which the user query sub- mitted to a search engine is expanded with additional keywords extracted from a set of relevant documents [32]. Some comprehen- sive comparisons and literature reviews of this area can be found in [17, 30]. Efthimiadis [17] for example proposed several simple methods to extract keywords based on term frequency, document frequency, etc. We used some of these as inspiration for our Desk- top specific annotation process. Chang and Hsu [7] first applied a clustering algorithm over the input collection of documents, and then attempted to extract keywords as cluster digests. We moved this one step further, by investigating the possibilities to acquire such keywords using Latent Semantic Analysis [13], which results in more qualitative clusterings over textual collections. However, this turned out to require too many computational resources for the already large Desktop data sets. The more precise the extracted information is, the closer we move to applying NLP algorithms. Lam and Jones [26] for ex- ample used summarization techniques to extract informative sen- tences from documents. Within the Semantic Web / Information Extraction area, we distinguish the advances achieved within the GATE system [12, 27], which allows not only for NLP based entity recognition, but also for identifying relations between such enti- ties. Its functionalities are exploited by quite several semantic an- notation systems, either generally focused on extracting semantics from Web pages (as for example in KIM [25]), or more guided by a specific purpose underlying ontology (as in Artequakt [1]). 2.3 Text Mining for Keywords Association While not directly related to the actual generation of semantic entities, keyword association is useful for enriching already discov- ered annotations, for example with additional terms that describe them in more detail. Two generic techniques have been found use- ful for this purpose. First, such terms could be identified utilizing co-occurrence statistics over the entire document collection to an- notate [24]. In fact, as this approach has been shown to yield good results, many subsequent metrics have been developed to best as- sess ���term relationship��� levels, either by narrowing the analysis for WWW 2007 / Track: Semantic Web Session: Semantic Web and Web 2.0 846

Authors on Mendeley

Readership Statistics

29 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
34% Ph.D. Student
 
21% Student (Master)
 
10% Post Doc
by Country
 
14% China
 
14% United States
 
10% Austria

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in