Taxonomic names, metadata, and th...
Biodiversity Informatics, 3, 2006, pp. 1-15 1 TAXONOMIC NAMES, METADATA, AND THE SEMANTIC WEB RODERIC D. M. PAGE Division of Environmental and Evolutional Biology Institute of Biomedical and Life Sciences University of Glasgow, Glasgow G12 8QQ, Scotland Email: firstname.lastname@example.org Abstract.������ Life Science Identifiers (LSIDs) offer an attractive solution to the problem of globally unique identifiers for digital objects in biology. However, I suggest that in the context of taxonomic names, the most compelling benefit of adopting these identifiers comes from the metadata associated with each LSID. By using existing vocabularies wherever possible, and using a simple vocabulary for taxonomy-specific concepts we can quickly capture the essential information about a taxonomic name in the Resource Description Framework (RDF) format. This opens up the prospect of using technologies developed for the Semantic Web to add ���taxonomic intelligence��� to biodiversity databases. This essay explores some of these ideas in the context of providing a taxonomic framework for the phylogenetic database TreeBASE. Key words.������ Life Science Identifiers, metadata, taxonomic names, Semantic Web, RDF, triple stores. Integrating diverse sources of digital information is a major challenge facing biodiversity informatics. Not only are we faced with numerous, disparate data providers, each with their own specific user communities, but the information in which we are interested is diverse, and includes taxonomic names and concepts, specimens in museum collections, scientific publications, genomic and phenotypic data, and images. Of course, this problem is not unique to biodiversity informatics ��� the wider bioinformatics community is keenly aware of this problem (Stein 2003) and indeed it is major topic of discussion concerning the future direction of the World Wide Web (���Web 2.0���). My goal in this paper is to sketch some ideas on how we could create the infrastructure for constructing a distributed system for querying information on biodiversity. My contention is that, thanks to efforts by the Semantic Web community1 the elements we need are mostly already in place. The two key technologies I will advocate are the Resource Description Format (RDF2) developed by the W3C, and the Life Science Identifier (LSID) technology developed by IBM3. It is easy to enthuse about a technology and contribute to the ���hype��� that surrounds it, so I will try and keep my feet on the ground by providing some background on 1 http://www.w3.org/2001/sw/. 2 http://www.w3.org/RDF/. 3 http://lsid.sourceforge.net/. the problem that lead me to this conclusion, and by presenting working implementations wherever possible. Motivation Reconstructing the history of life on Earth (the ���Tree of Life���) is the holy grail of phylogenetics, yet we lack a comprehensive phylogenetic database that stores our efforts at reconstructing this tree. The most comprehensive phylogenetic database we currently have is TreeBASE4 (Piel et al. 2002). As I've outlined elsewhere (Page 2004) a major limitation of this database is that it has no taxonomic ���intelligence.��� Taxonomic names are entered into TreeBASE without being validated against any external database of names, hence many of the names are not proper scientific names. Efforts to map names in TreeBASE to external databases rapidly run into problems. Around half the names in TreeBASE do not have an exact match in the NCBI's Taxonomy database. Using data cleaning tools (Herbert et al. 2004) or a combination of approximate string matching, regular expressions, and manual matching5 can improve on this, but a significant fraction of names in TreeBASE still have no obvious counterpart in the NCBI's database. In some cases this is because no DNA 4 http://www.treebase.org/. 5 http://darwin.zoology.gla.ac.uk/~rpage/TreeBASE.
PAGE ��� THE SEMANTIC WEB 2 sequences have been (or indeed, can be) obtained from those taxa, in which case those names will not be in the NCBI database and hence matches may be sought in other taxonomic databases, such as the Integrated Taxonomic Information System (ITIS6), the International Plant Names Index (IPNI7), IndexFungorum8, and the Universal Biological Indexer and Organizer (uBio9). Whereas it is relatively easy to search NCBI���s Taxonomy because the entire database can be downloaded, this is not the case for most other taxonomic databases. A Taxonomic Search Engine In 2004 I started to map TreeBASE names onto various taxonomic databases (results can be viewed10). Querying these source manually using their web interfaces is slow and tedious, so I developed a simple federated search engine that queries multiple taxonomic databases for information about a name (Page 2005). The Taxonomic Search Engine supports two basic queries, NameSearch and GetDataForID. The first query (NameSearch) searches a database for a name, and if the name is found returns the name and its identifier in that database. The second query (GetDataForID) ���drills down��� to get details about a single record in the source database. Leaving aside the technical details of talking to databases that support very different query interfaces, I had two problems to deal with. The first was how to generate unique identifiers for names from each database. Given that most of the databases use integers as their primary keys, in many cases the same identifier will be used by different databases. As one of many possible examples, 101593 is the identifier for Odonata in NCBI's GenBank and Dahlia australis in ITIS. Hence, if I were to store just the identifier it would not be clear what name that identifier referred to. An obvious solution is the idea of a ���namespace��� that specifies the context for a given identifier. In this case, the identifier from NCBI could be distinguished from that in ITIS by adding prefixes corresponding to the domain name address of the two databases, i.e., adding ���ncbi.nlm.nih.gov��� and ���itis.usda.gov��� to the respective identifiers. 6 http://www.itis.usda.gov/. 7 http://www.ipni.org/. 8 http://www.indexfungorum.org/. 9 http://wwww.ubio.org/. 10 http://darwin.zoology.gla.ac.uk/~rpage/TreeBASE. The second problem is how to return information about a specific record in a database (e.g., the name, any synonyms, etc.). Given that each database has its own format for returning information (ranging from delimited text, HTML, XML, and SOAP data structures), I transformed the result returned by each database into a common XML format that in turn could be transformed into HTML output for display in a web browser. So, to facilitate mapping names in TreeBASE onto names in external databases we need (1) a mechanism for generating globally unique identifiers, and (2) standard format for providing information about the object the identifier refers to. Before introducing one possible solution, let us first consider why names themselves are not enough. Why Taxonomic Names Aren't Enough The taxonomic name of an organism is a key link between different databases that store information on that organism. However, taxonomic names themselves have serious limitations as identifiers in databases (Kennedy 2003 Kennedy et al. 2005) due to the existence of multiple names (synonyms) for the same taxon, and the use of the same name to refer to different taxa. For example, the genus Morus applies to both an animal (the gannet) and a plant (the mulberry tree). Even species names can be identical ��� a species of wasp and a species of conifer both share the name Agathis montana. Furthermore, there may be multiple names for the same taxon. Hence, using names alone to link different data sources can be prone to error. As an example, at the time of writing NCBI's LinkOut feature mistakenly links the catfish genus Loricaria (NCBI tax_id = 52085) to the TreeBASE taxon Loricaria (TreeBASE TaxonID = 1305), which is a plant genus (family Compositae). This lack of uniqueness of names raises the issue of how to store taxonomic information in databases. URIs, URLs, and URNs Life Science Identifiers (LSID) are one solution to the problem of globally unique identifiers (Clark et al. 2004). At the risk of drowning the reader in alphabet soup, it is useful to distinguish between two different types of identifiers in use in the Internet, the Uniform Resource Locator (URL), and the Uniform Resource Name (URN). URNs and URLs are two possible kinds of Uniform Resource Identifier (URI).
PAGE ��� THE SEMANTIC WEB 3 Figure 1. The components of a Life Science Identifier (LSID). Most readers will be familiar with URLs, which specify the location of a resource the Internet (e.g., http://www.ubio.org/SOAPbrowser/index.php?func=n ame_detail&ubioID=454488), that is, they ���point��� to it. They can, in principle serve as a unique identifier, however they are prone to breakage ��� if the resource being pointed to moves, the URL no longer points to the resource, leading to the dreaded ���404 page not found��� problem (Dellavalle et al. 2003). URNs, in contrast, provide a persistent name for a resource, but typically do not provide any information on how to access that resource. A LSID is a Uniform Resource Name (URN). Digital Object Identifiers (DOIs) are another example of a URN, and are widely used in the publishing industry to identify electronic publications. If the resource moves (e.g., one publishing house acquires another, and moves the acquired company���s digital resources to a new server) the resource still retains the original DOI. The utility of a URN is somewhat limited, unless there is a mechanism to resolve the URN, that is, to retrieve the named resource. In the case of DOIs, the simplest way to see this mechanism in action is to append a DOI, such as 10.1145/1024694.1024703, to the URL11 giving in this instance12, and open the resulting URL in a web browser. In this example the DOI resolves to the electronic version of Herbert et al. (2004). Life Science Identifiers Figure 1 shows an example LSID. Each LSID is prefixed by ���urn��� indicating that the LSID is a URN, ���lsid��� indicates that the identifier is a LSID, then follow the authority, namespace, and identifier components. There may also be an optional revision component to indicate the version of the resource. The authority is a domain name that can be resolved by the 11 http://dx.doi.org/. 12 http://dx.doi.org/10.1145/1024694.1024703. Internet DNS (typically a domain name owned by the data provider), the namespace and identifier are specific to the data source which provides the resource. In this case the LSID is a taxonomic name in the uBio database. The authority ���ubio.org.lsid. zoology.gla.ac.uk��� is a domain name of a server at the University of Glasgow that serves LSIDs for uBio records. If uBio itself served LSIDs, the domain name could be ubio.org. Note that the uniqueness of the LSID is in part guaranteed by the use of Internet domain names, which are globally unique. Providing that the data source ensures that each combination of namespace and identifier is unique within the data source, the LSID itself will be a globally unique identifier. A LSID is intended to refer to one unchanging digital object. Hence, if two users retrieve data with the same LSID, they will have the exactly the same data. This contrasts with URLs, where the content may change at any time (for example, if the author of the web page changes the layout). Different versions of a digital object can be identified using the revision part of the LSID. In addition to data there may be metadata associated with a LSID. The LSID standard doesn���t require that the metadata remain unchanging. The Life Science Identifier (LSID) standard specifies a mechanism for resolving a LSID and retrieving the data and/or metadata associated with that LSID. Because a LSID is not a URL, you can't simply paste a LSID into a web browser unless you have additional software installed, such as IBM's LSID Launchpad for Internet Explorer 13 (Figure 2) or the LSID extension for Firefox14. The BioPathways Consortium provides a web-based LSID resolver15. The LSID shown in (Figure 2) resolves to the IPNI 13 http://lsid.sourceforge.net/ 14 http://lsid.mozdev.org/ 15 http://lsid.biopathways.org