Sign up & Download
Sign in

ProvenanceJS: Revealing the Provenance of Web Pages

by Paul Groth
International Provenance and Annotation Workshop IPAW10 (2010)

Abstract

Web pages are regularly constructed through combining con- tent from multiple providers (e.g. photos from Flickr, quotes from the New York Times). As a result, it is often difficult for users and pro- grammers to retrieve the provenance of a web page. Here, we present a JavaScript library, ProvenanceJS, that allows for the retrieval and visu- alization of the provenance information within a Web page and its em- bedded content. A key contribution is to demonstrate that provenance can be supported using widely deployed browser-based technologies.

Author-supplied keywords

Cite this document (BETA)

Available from Paul Groth's profile on Mendeley.
Page 1
hidden

ProvenanceJS: Revealing the Provenance of Web Pages

ProvenanceJS: Revealing the Provenance of Web
Pages
Paul Groth
pgroth@few.vu.nl
VU University Amsterdam
De Boelelaan 1081a, 1081 HV, Amsterdam, The Netherlands
Abstract. Web pages are regularly constructed through combining con-
tent from multiple providers (e.g. photos from Flickr, quotes from the
New York Times). As a result, it is often dicult for users and pro-
grammers to retrieve the provenance of a web page. Here, we present a
JavaScript library, ProvenanceJS, that allows for the retrieval and visu-
alization of the provenance information within a Web page and its em-
bedded content. A key contribution is to demonstrate that provenance
can be supported using widely deployed browser-based technologies.
There has been a rapid proliferation of content sharing on the Web. Sites such
as Flickr, Slideshare.net, and YouTube make it easier to nd and then integrate
images, video, and documents into web pages. Additionally, the cultural of the
Web, in particular the blogsphere, thrives on quoting and re-quoting information.
Because of this mash-up culture and infrastructure, most web pages consist of
content originating from multiple sources. Thus, when viewing a web page it is
often dicult to determine where its content came from and how it was produced.
This lack of provenance is seen as a critical issue in both the provenance and
Web communities as highlighted by the start of the W3C Provenance Incubator
Group and its recently produced report on requirements for provenance on the
Web [3]. In particular, provenance is one of the most import features users rely
on when determining whether to trust a Web page [4]. Indeed, Tim Berners-Lee
envisioned an \Oh, yeah?" button within Web browsers that when clicked on
would produce reasons why the user should trust the web page based on its
provenance [1].
To move towards the realization of such an \Oh, yeah?" button that is widely
distributed, we have developed a library, ProvenanceJS, that allows for the re-
trieval and visualization of the provenance of a web page. There are two key
contributions stemming from ProvenanceJS:
1. Browser-based technologies are capable of retrieving and rendering prove-
nance information without the need for additional software installation.
2. Embedding provenance information within content is a viable approach for
ensuring that the provenance information is available.
We now discuss ProvenanceJS in more detail beginning with a Use Case.
Page 3
hidden
Fig. 2. A visualization of the provenance of the Web page shown in Figure 1
Vocabulary [9]. This vocabulary is an RDF realization of the Open Provenance
Model (OPM) [8] with a number of extensions and is being actively developed
to help address the needs of data.gov.uk. Using this vocabulary, publishers can
markup their data with explicit statements about the provenance of the various
parts of their page. For example, in the surf blog use case, the publishers can
identify where the quote originated and who selected it.
While explicit provenance metadata within Web pages is advantageous, many
times it is not practically feasible to provide it. In the sur ng blog, for instance,
reproducing the provenance of the image within the web page would be time
consuming and increase the size of the page. Additionally, if the image is copied,
its provenance, if only represented within the web page, would not be carried with
it. To address this concern, ProvenanceJS aims to extract provenance metadata
from the content within a Web page. At the time of writing, ProvenanceJS can
extract information from the EXIF metadata found within JPEG images.
3 Implementation
ProvenanceJS is implemented entirely in Javascript using the Javascript InfoVis
Toolkit, rdfQuery, and exif.js. In addition to the extraction of the metadata
described above, it provides an API for building and manipulating OPM Graphs
and visualizing those graphs. A bookmarklet ('Reveal Provenance') is included,
which visualizes the current web page's provenance. The results of using the
bookmarklet on the Surf Blog page can be seen in Figure 2. Triangle nodes are
artifacts. Circle nodes are processes. Figure 2 shows how the page consists of
an image and a quote. It shows how the quote was generated by an aggregation
process controlled by John Smith. In addition, it depicts that the image was
Page 4
hidden
modi ed by Adobe Photoshop and that the copyright of the image belongs to
Michael Dawes. The bookmarklet is a rst step towards a true \Oh, yeah?"
button.
4 Related Work and Conclusion
Moreau provides an extensive review of the provenance literature from the per-
spective of the Web [7]. A number of authors have considered provenance on the
Semantic Web. In particular, Bizer et al. present a Semantic Web based policy
framework for information quality [2]. It included an implementation of the \Oh,
yeah?" button. However, this implementation required a browser plug-in. We
see ProvenanceJS as building on-top of such existing Semantic Web approaches.
Margo and Seltzer showed how by treating user interaction with a Web browser
as provenance, novel search functionality could be realized [6]. The closest work
to ProvenanceJS is the Provenance-Embedding Document approach [5]. This
approach uses Javascript to extract provenance from RDFa metadata. Our work
di ers in that we support the extraction of provenance from embedded content
and use a community driven provenance vocabulary.
Using a simple but representative use case, we demonstrated how Prove-
nanceJS can be used to retrieve and visualize the provenance of a web page us-
ing only browser-based technology, namely Javascript. Additionally, we showed
how both provenance metadata from page markup and embedded content can
be integrated to provide a full view of provenance.
References
1. Berners-Lee, T.: Cleaning up the User Interface (1997),
http://www.w3.org/DesignIssues/UI.html
2. Bizer, C., Cyganiak, R.: Quality-driven information ltering using the WIQA policy
framework. Web Semantics: Science, Services and Agents on the World Wide Web
7(1), 1{10 (January 2009)
3. Cheney, J., Gil, Y., Groth, P.E., Miles, S.: Requirements for Provenance on the Web
(2010), http://www.w3.org/2005/Incubator/prov/wiki/User Requirements
4. Gil, Y., Artz, D.: Towards content trust of web resources. Journal of Web Semantics
5(4), 227239
5. Jones, H.C.: XHTML documents with inline, policy-aware provenance. M. eng.,
Massachusetts Institute of Technology (2007)
6. Margo, D.W., Seltzer, M.: The Case for Browser Provenance. In: st Workshop on
the Theory and Practice of Provenance (TaPP'09) (2009)
7. Moreau, L.: Foundations of Provenance on the Web. Foundations and Trends in
Web Science (Submitted) (209)
8. Moreau, L., Plale, B., Miles, S., Goble, C., Missier, P., Barga, R., Simmhan, Y.,
Futrelle, J., McGrath, R., Myers, J., Paulson, P., Bowers, S., Ludaescher, B., Kwas-
nikowska, N., Van Den Bussche, J., Ellkvist, T., Freire, J., Groth, P.: Open Prove-
nance Model (v1.01), http://eprints.ecs.soton.ac.uk/16148/1/opm-v1.01.pdf
9. Zhao, J.: Guide to the Open Provenance Model Vocabulary (2010),
http://purl.org/net/opmv/guide

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

7 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
43% Ph.D. Student
 
29% Researcher (at a non-Academic Institution)
 
14% Post Doc
by Country
 
43% United States
 
29% Netherlands
 
14% United Kingdom

Tags