Semantic tagging of and semantic ...
Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples 1 Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples Lyubomir Penev1, Donat Agosti2, Teodor Georgiev3, Terry Catapano2, Jeremy Miller4, Vladimir Blagoderov5, David Roberts5, Vincent S. Smith5, Irina Brake5, Simon Ryrcroft5, Ben Scott5, Norman F. Johnson6, Robert A. Morris7, Guido Sautter8, Vishwas Chavan9, Tim Robertson9, David Remsen9, Pavel Stoev10, Cynthia Parr11, Sandra Knapp5, W. John Kress12, F. Christian Th ompson12, Terry Erwin12 1 Bulgarian Academy of Sciences & Pensoft Publishers, 13a Geo Milev Str., Sofi a, Bulgaria 2 Plazi, Zinggstrasse 16, Bern, Switzerland 3 Pensoft Publishers, 13a Geo Milev Str., Sofi a, Bulgaria 4 Nationa- al Natuurhistorisch Museum Naturalis, Netherlands 5 Th e Natural History Museum, Cromwell Road, London, UK 6 Th e Ohio State University, Columbus, OH, USA 7 University of Massachusetts, Boston, USA & Plazi, Zinggstrasse 16, Bern, Switzerland 8 IPD B��hm, Karlsruhe Institute of Technology, Ger- many & Plazi, Zinggstrasse 16, Bern, Switzerland 9 Global Biodiversity Information Facility, Copen- hagen, Denmark 10 National Museum of Natural History, 1 Tsar Osvoboditel blvd., Sofi a, Bulgaria 11 Encyclopedia of Life, Washington, DC, USA 12 Smithsonian Institution, Washington, DC, USA Corresponding author: Lyubomir Penev (info@pensoft.net) Received 20 May 2010����|����Accepted 22 June 2010����|����Published 30 June 2010 Citation: Penev L, Agosti D, Georgiev T, Catapano T, Miller J, Blagoderov V, Roberts D, Smith VS, Brake I, Ryrcroft S, Scott B, Johnson NF, Morris RA, Sautter G, Chavan V, Robertson T, Remsen D, Stoev P, Parr C, Knapp S, Kress WJ, Th ompson FC, Erwin T (2010) Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples. ZooKeys 50: 1���16. doi: 10.3897/zookeys.50.538 Abstract Th e concept of semantic tagging and its potential for semantic enhancements to taxonomic papers is outlined and illustrated by four exemplar papers published in the present issue of ZooKeys. Th e four papers were created in diff erent ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (doi: 10.3897/zookeys.50.504) (ii) generated from Scratchpads and submitted as XML- tagged manuscripts (doi: 10.3897/zookeys.50.505 and doi: 10.3897/zookeys.50.506) (iii) generated from an author���s database (doi: 10.3897/zookeys.50.485) and submitted as XML-tagged manuscript. XML tagging and semantic enhancements were implemented during the editorial process of ZooKeys using the Pensoft Mark Up Tool (PMT), specially designed for this purpose. Th e XML schema used was TaxPub, an extension to the Document Type Defi nitions (DTD) of the US National Library of Medicine Journal Archiving and Interchange Tag Suite (NLM). Th e following innovative methods of tagging, layout, ZooKeys 50: 1-16 (2010) doi: 10.3897/zookeys.50.538 www.pensoftonline.net/zookeys Copyright Lyubomir Penev et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Launched to accelerate biodiversity research A peer-reviewed open-access journal FORUM PAPER
Lyubomir Penev et al. / ZooKeys 50: 1-16 (2010) 2 publishing and disseminating the content were tested and implemented within the ZooKeys editorial workfl ow: (1) highly automated, fi ne-grained XML tagging based on TaxPub (2) fi nal XML output of the paper validated against the NLM DTD for archiving in PubMedCentral (3) bibliographic metadata embedded in the PDF through XMP (Extensible Metadata Platform) (4) PDF uploaded after publication to the Biodiversity Heritage Library (BHL) (5) taxon treatments supplied through XML to Plazi (6) semantically enhanced HTML version of the paper encompassing numerous internal and external links and linkouts, such as: (i) vizualisation of main tag elements within the text (e.g., taxon names, taxon treatments, localities, etc.) (ii) internal cross-linking between paper sections, citations, references, tables, and fi gures (iii) mapping of localities listed in the whole paper or within separate taxon treatments (v) taxon names autotagged, dynamically mapped and linked through the Pensoft Taxon Profi le (PTP) to large international database services and indexers such as Global Biodiversity Information Facility (GBIF), National Center for Biotechnology Information (NCBI), Barcode of Life (BOLD), Encyclopedia of Life (EOL), ZooBank, Wikipedia, Wikispecies, Wikimedia, and others (vi) GenBank accession numbers autotagged and linked to NCBI (vii) external links of taxon names to references in PubMed, Google Scholar, Biodiversity Heritage Library and other sources. With the launching of the working example, ZooKeys becomes the fi rst taxonomic journal to provide a complete XML-based editorial, publication and dissemination workfl ow implemented as a routine and cost-effi cient practice. It is anticipated that XML-based workfl ow will also soon be implemented in botany through PhytoKeys, a forthcoming part- ner journal of ZooKeys. Th e semantic markup and enhancements are expected to greatly extend and accelerate the way taxonomic information is published, disseminated and used. Keywords Semantic tagging, semantic enhancements, systematics, taxonomy Introduction ���Adapt or die��� is certainly one of the most well-known fundamental principles of the theory of natural selection. If we want to paraphrase this principle so that it applies to the dynamic and challenging world of academic publishing, it seems that we have to progress from the recently popular ���go online or die��� to the rapidly emerging ���link yourself or die���. Within just the past few years, several important components of the Semantic Web, such as cross-linking, semantic tagging, data publication, data sharing, data aggregation, etc., have become ordinary components in the vocabulary of the biodiversity scientists. Moreover, we have already several prototypes of the ���articles of the future��� published in the form of exemplar papers (e.g., Pyle et al. 2008, Johnson et al. 2008, Fisher et al. 2008, Shotton et al. 2009, Miller et al. 2009, Sharkey et al. 2009). Th e history of semantic enhancements to biodiversity papers is short but dynamic, starting perhaps as far back as the beginning of the present decade, exemplifi ed by the articles of Erwin and Johnson (2000), Page (2006), Shotton (2009) and others. Perhaps the fi rst taxonomic article to show how embedded hyperlinks may bring vital additional information to a published taxonomic text (i.e., to enhance it) is the famous ���Chromis article��� of Pyle et al. (2008). Shortly after its publication, use of hyperlinks to external resources, such as Zoobank (http://www.zoobank.org), Morphbank (http://www. morphbank.org), Genbank (http://www.genbank.org), and others, started to become,
Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples 3 if not ordinary, a relatively unremarkable feature of taxonomic papers (e.g., Miller et al. 2009, Talamas et al. 2009, Mengual and Ghorpad�� 2010). Th e hyperlinking of text strings has often been enriched through additional enhancements, such as publication of datasets (Costello 2009, Smith 2009, Chavan and Ingwersen 2009, Miller et al. 2009, Penev et al. 2009a) and interactive keys (Sharkey et al. 2009, Penev et al. 2009b). Hyperlinking of text strings within a paper or links to external sources are useful and widely used methods, however they can no longer be considered a ���cutting edge��� feature of text processing and publishing practices. A completely new world of data mining and processing of taxonomic texts through semantic XML mark up has been recently advanced by the eff orts of a group of enthusiasts around Plazi (http://www. plazi.org, see also http://en.wikipedia.org/wiki/Plazi and Agosti and Egloff 2009). Plazi articulated some truly innovative concepts and tools, such as an electronic form of the ���taxon treatment��� concept (Sautter et al. 2007, Agosti et al. 2007), TaxonX and TaxPub XML schemas for either marking up legacy literature (http://www.taxonx. org, http://sourceforge.net/projects/taxonx), or to serve prospective publishing (http:// sourceforge.net/projects/taxpub), respectively. A special software tool, GoldenGATE, was also developed by Plazi (together with IPD B��hm at the Karlsruhe Institute of Technology, Germany) to facilitate the process of marking up of published taxonomic works (http://plazi.org/?q=GoldenGATE). Major eff orts in this direction were also invested by the Literature Working Group of TDWG (http://wiki.tdwg.org/Literature) to elaborate the TaXMLit schema as a future TDWG standard (see also (http://www. sil.si.edu/digitalcollections/bca/documentation/taxmlitv1-3intro.pdf). Th e rapid development of bioinformatics thanks mostly to the eff orts of enthusiastic groups of people and organisations, e.g., the Taxonomic Database Working Group or TDWG (http://www.tdwg.org), the Global Biodiversity Information Facility, or GBIF (http://www.gbif.org), GenBank (http://www.genbank.org), ZooBank (http:// www.zoobank.org), Morphbank (http://www.morphbank.org), Encyclopedia of Life, or EOL (http://www.eol.org), Biodiversity Heritage Library, or BHL (http://www. biodiversitylibrary.org), as well as of the so-called ���bottom-up��� initiatives, such as Wikipedia (http://www.wikipedia.org), Wikispecies (http://www.species.wikimedia. org), Wikimedia (http://www.wikimedia.org) and others has led to some ���technological lagging��� in applying new technologies by the publishing industry. Publishers have not adapted so quickly to the active developments of bioinformatics tools. Nevertheless, during the last few years, some innovative exemplar papers started to elucidate the essence of the next generation of journal articles in taxonomy. Two of them have greatly inspired the ZooKeys team to pursue new approaches to publication and dissemination and have had a substantial impact on the current paper. Th ese are the ���Neglected disease��� semantically enhanced exemplar paper by Shotton et al. (2009) and the ���Elsevier Grand Challenge��� paper by Page (2010) and our model incorporates some elements from these. Other sources of inspiration include some web-based projects and tools, particularly uBio (http://www.ubio.org) and iSpecies (http://www.ispecies.org). Th e aim of the present paper is to briefl y describe semantic tagging and semantic enhancement concepts and their application to publishing in biological systematics.