Harvesting and organizing knowledge from the Web

0Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Information organization and search on the Web is gaining structure and context awareness and more semantic flavor, for example, in the forms of faceted search, vertical search, entity search, and Deep-Web search. I envision another big leap forward by automatically harvesting and organizing knowledge from the Web, represented in terms of explicit entities and relations as well as ontological concepts. This will be made possible by the confluence of three strong trends: 1) rich Semantic-Web-style knowledge repositories like ontologies and taxonomies, 2) large-scale information extraction from high-quality text sources such as Wikipedia, and 3) social tagging in the spirit of Web 2.0. I refer to the three directions as Semantic Web, Statistical Web, and Social Web (at the risk of some oversimplification), and I briefly characterize each of them. Semantic Web: Although the Semantic Web in its originally envisioned glorious form is still a very elusive goal, the vision itself has created a significant momentum towards creating ontologies and representing knowledge in more rigorous formats than text (see, e.g., [5,7]). These include general-purpose ontologies and thesauri such as SUMO, OpenCyc, ConceptNet, or WordNet, as well as domain-specific ontologies and terminological taxonomies such as GeneOntology, SNOMED, or UMLS. While each of these collections alone may be viewed as fairly partial, connecting them and combining them with "softer" knowledge sources such as Wikipedia could be a powerful way of organizing more and more knowledge in rigorous representations that allow effective querying and reasoning. Richly annotated natural-language corpora such as multilingual thesauri, word-sense-tagged texts, or even representations in logic-based frames start becoming an interesting asset as well. Statistical Web: Information-extraction (IE) technology - entity recognition and learning relation patterns - has made enormous progress and become much more scalable in recent years [1,10] and also much less dependent on human supervision [3,4,8]. Much of this progress comes from major advances in the underlying fields of natural language processing (NLP) and statistical learning, but there is also a much better understanding of algorithmic efficiency and how to engineer large-scale IE. To be clear, all these technologies will remain computationally expensive, but the gloomy picture of such issues being "AI-complete" and practically hopeless is gone. Social Web: There is a growing amount of "low-hanging fruit" that allows us to harvest knowledge without any rocket science. A large extent of this comes from the Web 2.0 trends, or more specifically, the human contributions to the emerging Social Web (aka. Human Semantic Web) in the form of tagging (and thus semantically annotating) Web pages, passages or phrases in pages, images, videos, etc. and creating so-called folksonomies (e.g., [6,10]). Another big contributor is the strong proliferation of high-quality knowledge repositories with some explicit structure that is suitable for entity, relation, and topic recognition. Probably, Wikipedia is the best example. Although it is still primarily hyperlinked text, the link structure, the thematic categories to which articles are manually assigned, and the templates that are used for authoring certain types of articles (e.g., about music bands) provide enormous benefits for semantic tagging. Several recent projects have made excellent use of Wikipedia and similar sources for building explicit knowledge bases and connecting these with other sources (e.g., [2,9]). Each of the three directions - Semantic Web, Statistical Web, Social Web - poses interesting research themes. I believe that connecting these different kinds of implicit and explicit knowledge sources opens up synergies and great opportunties towards the vision of large-scale knowledge management and search. The talk will present various approaches in each of three areas, discuss their strengths and weaknesses, and point out ideas on a combined methodology. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Weikum, G. (2007). Harvesting and organizing knowledge from the Web. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4690 LNCS, pp. 12–13). Springer Verlag. https://doi.org/10.1007/978-3-540-75185-4_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free