SWI-Prolog and the Web -
Under consideration for publication in Theory and Practice of Logic Programming 1 SWI-Prolog and the Web JAN WIELEMAKER Human-Computer Studies laboratory University of Amsterdam Matrix I Kruislaan 419 1098 VA, Amsterdam The Netherlands (e-mail: email@example.com) ZHISHENG HUANG, LOURENS VAN DER MEIJ Computer Science Department Vrije University Amsterdam De Boelelaan 1081a 1081 HV, Amsterdam The Netherlands (e-mail: huang,firstname.lastname@example.org) submitted 28 April 2006 revised 23 August 2007 accepted 18 October 2007 Abstract Prolog is an excellent tool for representing and manipulating data written in formal lan- guages as well as natural language. Its safe semantics and automatic memory management make it a prime candidate for programming robust Web services. Where Prolog is commonly seen as a component in a Web application that is either embedded or communicates using a proprietary protocol, we propose an architecture where Prolog communicates to other components in a Web application using the standard HTTP protocol. By avoiding embedding in external Web servers development and deployment become much easier. To support this architecture, in addition to the transfer protocol, we must also support parsing, representing and generating the key Web document types such as HTML, XML and RDF. This paper motivates the design decisions in the libraries and extensions to Prolog for handling Web documents and protocols. The design has been guided by the requirement to handle large documents efficiently. The described libraries support a wide range of Web applications ranging from HTML and XML documents to Semantic Web RDF processing. The benefits of using Prolog for Web related tasks is illustrated using three case studies. KEYWORDS: Prolog, HTTP, HTML, XML, RDF, DOM, Semantic Web 1 Introduction The Web is an exciting place offering new opportunities to artificial intelligence, natural language processing and Logic Programming. Information extraction from the Web, reasoning in Web applications and the Semantic Web are just a few examples. We have deployed Prolog in Web related tasks over a long period. As arXiv:0711.0917v1 [cs.PL] 6 Nov 2007
2 J. Wielemaker, Z. Huang and L. van der Meij most of the development on SWI-Prolog takes place in the context of projects that require new features, the system and its libraries provide extensive support for Web programming. There are two views on deploying Prolog for Web related tasks. In the most commonly used view, Prolog acts as an embedded component in a general Web processing environment. In this role it generally provides reasoning tasks such as searching or configuration within constraints. Alternatively, Prolog itself can act as a stand-alone HTTP server as also proposed by ECLiPSe (Leth et al. 1996). In this view it is a component that can be part of any of the layers of the popular three-tier architecture for Web applications. Components generally exchange XML if used as part of the backend or middleware services and HTML if used in the presentation layer. The latter view is in our vision more attractive. Using HTTP and XML over HTTP, the service is cleanly isolated using standard protocols rather than pro- prietary communication. Running as a stand-alone application, the attractive in- teractive development nature of Prolog can be maintained much more easily than embedded in a C, C++, Java or C# application. Using HTTP, automatic testing of the Prolog components can be done using any Web oriented test framework. HTTP allows Prolog to be deployed in any part of the service architecture, including the realisation of complete Web applications in one or more Prolog processes. When deploying Prolog in a Web application using HTTP, we must not only implement the HTTP transfer protocol, but also support parsing, representing and generating the important document types used on the Web, especially HTML, XML and RDF. Note that, being widely used open standards, supporting these document types is also valuable outside the context of Web applications. This paper gives an overview of the Web infrastructure we have realised. Given the range of libraries and Prolog extensions that facilitate Web applications we cannot describe them in detail. Details on the library interfaces can be found in the manuals available from the SWI-Prolog Web site.1 Details on the implemen- tation are available in the source distribution. The aim of this paper is to give an overview of the required infrastructure to use Prolog for realizing Web applications where we concentrate on scalability and performance. We describe our decisions for representing Web documents in Prolog and outline the interfaces provided by our libraries. The benefits of using Prolog for Web related tasks are illustrated using three case studies: 1) SeRQL, an RDF query language for meta data management, retrieval and reasoning 2) XDIG, an eXtended Description Logic interface, which provides ontology management and reasoning by processing DIG XML documents and com- municating to external DL reasoners and 3) A faceted browser on Semantic Web databases integrating meta-data from multiple collections of art-works. This case study serves as a complete Semantic Web application serving the end-user. This paper is organized as follows. Section 2 to section 4 describe reading, writing 1 http://www.swi-prolog.org
SWI-Prolog and the Web 3 hdocumenti ::= list-of hcontenti hcontenti ::= helementi | hpii | hcdatai | hsdatai | hndatai helementi ::= element(htagi, list-of hattributei, list-of hcontenti) hattributei ::= hnamei = hvaluei hpii ::= pi(hatomi) hsdatai ::= sdata(hatomi) hndatai ::= ndata(hatomi) hcdatai, hnamei ::= hatomi hvaluei ::= hsvaluei | list-of hsvaluei hsvaluei ::= hatomi | hnumberi Fig. 1. SGML/XML tree representation in Prolog. The notation list-of hxi describes a Prolog list of terms of type hxi. and representation of Web related documents. Section 5 describes our HTTP client and server libraries. Section 6 describes extensions to the Prolog language that facilitate use in Web applications. Section 7 to section 9 describe the case studies. 2 Parsing and representing XML and HTML documents The core of the Web is formed by document standards and exchange protocols. Here we describe tree-structured documents transferred as SGML or XML. HTML, an SGML application, is the most commonly used document format on the Web. HTML represents documents as a tree using a fixed set of elements (tags), where the SGML DTD (Document Type Declaration) puts constraints on how elements can be nested. Each node in the hierarchy has a name (the element-name), a set of name-value pairs known as its attributes and content, a sequence of sub-elements and text (data). XML is a rationalisation of SGML using the same tree-model, but removing many rarely used features as well as abbreviations that were introduced in SGML to make the markup easier to type and read by humans. XML documents are used to represent text using custom application-oriented tags as well as a serialization format for arbitrary data exchange between computers. XHTML is HTML based on XML rather than SGML. The first SGML parser for SWI-Prolog was created by Anjo Anjewierden based on the SP parser2. A stable Prolog term-representation for SGML/XML trees plays a similar role as the DOM (Document Object Model) representation in use in the object-oriented world. The term-structure we use is described in figure 1. Some issues have been subject to debate. ��� Representation of text by a Prolog atom is biased by the use of SWI-Prolog which has no length-limit on atoms and atoms that can represent Unicode text as motivated in section 6.2. At the same time SWI-Prolog stacks are limited to 128MB each. Using atoms only the structure of the tree is represented on the 2 http://www.jclark.com/sp/
4 J. Wielemaker, Z. Huang and L. van der Meij stack, while the bulk of the data is stored on the unlimited heap. Using lists of character codes is another possibility adopted by both PiLLoW (Gras and Hermenegildo 2001) and ECLiPSe (Leth et al. 1996). Two observations make lists less attractive: lists use two cells per character while practical experience shows text is frequently processed as a unit only. For (HTML) text-documents we profit from the compact representation of atoms. For XML documents representing serialized data-structures we profit from frequent repetition of the same value. ��� Attribute values of multi-value attributes (e.g. NAMES) are returned as a Prolog list. This implies the DTD must be available to get unambiguous results. With SGML this is always true, but not with XML. ��� Optionally attribute values of type NUMBER or NUMBERS are mapped to Prolog numbers. In addition to the DTD issues mentioned above, this conversion also suffers from possible loss of information. Leading zeros and different floating point number notations used are lost after conversion. Prolog systems with bounded arithmetic may also not be able to represent all values. Still, au- tomatic conversion is useful in many applications, especially those involving serialized data-structures. ��� Attribute values are represented as Name=Value. Using Name(Value) is an alternative. The Name=Value representation was chosen for its similarity to the SGML notation and because it avoids the need for univ (=..) for process- ing argument-lists. Implementation The SWI-Prolog SGML/XML parser is implemented as a C-library that has been built from scratch to create a lightweight parser. Total source is 11,835 lines. The parser provides two interfaces. Most natural to Prolog is load structure(+Src, -DOM, +Options) which parses a Prolog stream into a term as described above. Alternatively, sgml parse/2 provides an event-based parser making call-backs on Prolog for the SGML events. The call-back mode can deal with unbounded documents in streaming mode. It can be mixed with the term-creation mode, where the handler for begin calls the parser to create a term-representation for the content of the element. This feature is used to process long files with a repet- itive record structure in limited memory. Section 4.1 describes how this is used to process RDF documents. Full documentation is available from http://www.swi-prolog.org/packages/ sgml2pl.html The SWI-Prolog SGML parser has been adopted by XSB Prolog. 3 Generating Web documents There are many approaches to generating Web pages from programs in general and Prolog in particular. We believe the preferred choice depends on various aspects. ��� How much of the document is generated from dynamic data and how much is static? Pages that are static except for a few strings are best generated from a template using variable substitution. Pages consisting of a table generated from dynamic data are best entirely generated from the program.
SWI-Prolog and the Web 5 ��� For program generated pages we can choose between direct printing and gener- ating using a language-native syntax, for example format(���bbold/b���) or print_html(b(bold)). The second approach can guarantee well-formed output, but the first requires the programmer to learn about format/3 only. ��� Documents that contain a significant static part are best represented in the markup language where special constructs insert program-generated parts. A popular approach implemented by PHP3 and ASP4 is to add a reserved element such as hscripti and use the SGML/XML programming instruction written as ?...?. The obvious name PSP (Prolog Server Pages) is in use by various projects taking this approach.5 Another approach is PWP6 (Prolog Well-formed Pages). It is based on the principle that the source is well-formed XML and interacts with Prolog through additional attributes. Output is guar- anteed to be well-formed XML. Our infrastructure does not yet include any of these approaches. ��� Page transformation is realised by parsing the original document into its tree representation, managing the tree and writing a new document from the tree. Managing the source-text directly is not reliable as due to character encoding choice, entity usage and SGML abbreviations there are many differ- ent source-texts that represent the same tree. The load structure/3 predi- cate described in section 2 together with output primitives from the library sgml write.pl provide this functionality. The XDIG case study described in section 8 follows this approach. 3.1 Generating documents using DCG The traditional method for creating Web documents is using print routines such as write/1 or format/2. Although simple and easily explained to novices, the approach has serious drawbacks from a software engineering point of view. In par- ticular the user is responsible for HTML quoting, character encoding issues and proper nesting of HTML elements. Automated validation is virtually impossible using this approach. Alternatively we can produce a DOM term as described in section 2 and use the library sgml write.pl to create the HTML or XML document. Such documents are guaranteed to use proper nesting of elements, escape sequences and character encoding. The terms however are big, deeply nested and hard to read and write. Prolog allows them to be built from skeletons containing variables. This approach is taken by PiLLoW (section 3.2) to control the complexity. In our opinion, the result is not optimal due to the unnatural order of statements as illustrated in figure 2. PiLLoW has partly overcome this shortcoming by defining a large number of ���utility 3 www.php.net 4 www.microsoft.com 5 http://www.prologonlinereference.org/psp.psp, http://www.benjaminjohnston.com.au/template.prolog?t=psp, http://www.ifcomputer.com/inap/inap2001/program/inap bartenstein.ps 6 http://www.cs.otago.ac.nz/staffpriv/ok/pwp.pl