The use of Web Ontology Language (OWL) to Combine Extant Controlled Vocabularies in Biodiversity Informatics Appears Redundant

  • Hyam R
  • Hyam R
N/ACitations
Citations of this article
12Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Implementation of PESI requires data to be combined from multiple source databases. Some of the shared fields in the source databases used different controlled vocabularies of terms. OWL DL was investigated as a mechanism to build an extensible, shared ontology of species occurrence terms that permitted the source database to continue using and extending their own vocabularies whilst formally mapping to a more generic shared vocabulary. The merits of this approach were explored and it was concluded that the building of such a complex mapping ontology probably wasn't worthwhile. The level of semantic complexity involved outweighed the costs of simply imposing a flat list of well defined terms onto data suppliers. The main problem with exiting vocabularies appear to be the overloading of terms. A candidate list of terms was proposed. Introduction Descriptive biology frequently states the nature of the occurrence of a species within a specified geographic region. At its simplest level this takes the form of a presents/absents/unknown statement. For example the flora of a country may list the provinces each species occurs in. Many publications go further and use other phrases to associate taxa with regions such as “Naturalised” or “Extinct”. This is an effective way of presenting summary information derived from numerous individual observations. The Pan European Species dictionaries Infrastructure (PESI)1 is an European Union funded project to unite the authoritative species name registers and nomenclators (name databases) and associated expertise networks that underpin the management of biodiversity in Europe. In its first instance PESI involves combining three key source databases Euro+Med PlantBase (EM)2, European Register of Marine Species (ERMS)3 and Flauna Europaea (FE)4 to build a single, database driven portal to the taxonomy of European species. In the future there will be a requirement to incorporate further species databases from different regional focal points and taxonomic expert groups. The three initial databases use different controlled vocabularies (lists of terms) for their respective occurrence status fields. The combined database, that drives the portal, therefore needs to have a unified list that combines these terms in a logically coherent way so that, for example, when a user searches for “Present” they find those records scored as “Naturalised” in one database and “Present” in another because “Naturalised” is seen as a sub category of “Present” in the combined list of terms. All three databases need to maintain their original lists of terms and possibly expand them because they are domain specific. Furthermore additional databases will be added in the future that may contain yet more terms. 1 http://www.eu-nomen.eu/pesi/ 2 http://www.emplantbase.org/home.html 3 http://www.marbef.org/data/erms.php 4 http://www.faunaeur.org/ Page 1 of 10 N at ur e P re ce di ng s : d oi :1 0. 10 38 /n pr e. 20 10 .5 16 8. 1 : P os te d 2 N ov 2 01 0 Content Complete Draft for Comment To solve this problem in a logically rigorous manner it was decided to use the Description Logics dialect of the Web Ontology Language (OWL) to build an Extensible Species Occurrence Ontology (ESOO). This approach was chosen for two key reasons. Firstly the primacy of the web in integration of biodiversity informatics data suggested the use of a technology tightly bound to it. Secondly, although the problem appears simple in the initial list of terms, the complexity was expected to grow rapidly and it was felt the use of inference would become important. It was also anticipated that the ontology itself may be useful in carrying out inference in combinations with other ontologies. This perceived need for inference ruled out the use of the Simple Knowledge Organisation System (SKOS)5 which is expressed in the OWL Full dialect and is therefore not guaranteed to be computable. Building ontologies of terms is a common problem not only within the biodiversity informatics field but more generally. It is hoped that a practical implementation of the approach will be informative when considering combining data for other controlled vocabularies notably habitat classifications, functional types and even geographic regions. Terms Used In Source Databases It is important to fully understand the terms used in the source databases prior to designing the ontology. Furthermore the act of designing the ontology can help elucidate an understanding of the subject domain. The ERMS database does not record absence data but only tracks records of species occurrence in a region, an indication of whether it considers these records to be valid records and how certain it is about that validity. ERMS therefore recognises four possible states of presence. (Table 1) Table 1: ERMS Occurrence States Validity Certainty Interpretation Valid Certain This record should be considered as good evidence of presence of the taxon in the region. Valid Uncertain There is evidence of presence but some doubt about over the veracity of the information. Invalid Certain This record asserts that the taxon is present in the region but the reviewing expert is certain it is wrong and it should not be used as evidence of presence. (This is not evidence of absence only a negation of a single record) Invalid Uncertain This record asserts that the taxon is present in the region. The reviewing expert considers there is sufficient evidence to show that the record is probably wrong and should not be used as evidence of presence. The Fauna Europaea database records presence and absence summaries for an area. There are four possible states for the occurrence field for a region (Table 2). Contributors have expressed a desire for more detailed codes to cover such things as migrants. Table 2: Fauna Europaea Occurrence States Term Code Interpretation 5 http://otto.w3.org/TR/skos-reference/ Page 2 of 10 N at ur e P re ce di ng s : d oi :1 0. 10 38 /n pr e. 20 10 .5 16 8. 1 : P os te d 2 N ov 2 01 0 Content Complete Draft for Comment Present P There is at least one well documented record of the taxons presence in the area since 1600. Doubtful P? The taxon is scored as being present in the area but there is some doubt over the evidence. The doubt may be of different kinds including taxonomic or geographic imprecision in the records. Absent A The expert does not know of the existence of any records that assert the presence of the at taxon in this area. The null condition. This record has not been scored. The Euro+Med PlantBase database has a more complex schema that was originally based on the TDWG Plant Occurrence and Status Scheme (POSS)6 standard. This standard consists of defining seven fields with a series of possible single letter values for each: Field 1: Occurrence – Present (P), Assumed Present (S), Doubtfully Present (D), Extinct (E), Recorded as present in error (F). Field 2: Native Status – Native (N), Assumed to be Native (S), Doubtfully Native (D), Formerly Native now extinct (E), Not Native (A), Recorded as Native in Error (F), No information (-), None of the above (U), Not Applicable (X). Field 3: Introduction Status – Introduced (I), Assumed to be introduced (S), Doubtfully introduced (D), Formerly introduced now extinct (E), Not introduced (A), Recorded as introduced in error (F), No information (-), None of the above (U), Not applicable (X). Field 4: Introduction Agency – Introduced by humans (M), Introduced by natural means (N), No Information (-), Not applicable (X). Field 5: Cultivated Status – Cultivated outdoors (C), Cultivated indoors (I), Assumed to be cultivated (S), Doubtfully cultivated (D), Formerly cultivated now extinct (E), Not cultivated (A), Recorded as cultivated in error (F), No information (-), None of the above (U), Not applicable (X). Field 6: Area Distribution Completeness – Distribution complete (C), Distribution incomplete (I), Not known whether distribution complete (U), Not applicable (X). Field 7: World Distribution Completeness – Distribution complete (C), Distribution incomplete (I), Not known whether distribution complete (U). E+M uses four fields each with a series of codes (Tables 3,4,5 & 6). Some of these codes are present for historical reasons, will not be used in future and will eventually be replaced but need to be accounted for – they have been deprecated. Many are little used. In addition to this E+M exports the values to a single (Table 7) that combines the fields omitting endemism. Table 3: E+M Native Status Field Values Code Interpretation Usage A Present: alien (definitely not native) <1% D Present: doubtfully native (perhaps introduced only) 1% E Formerly native but presumably extinct <1% F Absent but reported in error 2% 6 http://www.tdwg.org/standards/106/ Page 3 of 10 N at ur e P re ce di ng s : d oi :1 0. 10 38 /n pr e. 20 10 .5 16 8. 1 : P os te d 2 N ov 2 01 0 Content Complete Draft for Comment N Present: native 82% Q Presence questionable 1% Table 4: E+M Introduction Status Field Values Code Interpretation Usage A Definitely not introduced (deprecated) < 1% D Present: doubtfully introduced (perhaps cultivated only) < 1% E Formerly introduced but presumably extinct < 1% F Absent but reported in error < 1% I Introduced. If feasible, more precise categories are used such as 9% I(A) Adventitious (casual) 1% I(N) Naturalised 2% I(P) Problematic (degree of naturalisation uncertain) 1% Q Presence questionable < 1% Table 5: E+M Outdoor Cultivated Field Values Code Interpretation Usage A Definitely not cultivated (deprecated) < 1% C Present: cultivated 3% D Present: doubtfully cultivated (deprecated) < 1% E Formerly cultivated but presumably extinct (deprecated) < 1% F Absent but reported in error < 1% Q Presence questionable (deprecated) < 1% Table 6: E+M Endemism Field Values Code Interpretation Usage C distribution in Euro+Med area complete (endemic) 81% I distribution in Euro+Med area incomplete (not endemic) 18% U unknown whether d

Cite

CITATION STYLE

APA

Hyam, R., & Hyam, R. (2010). The use of Web Ontology Language (OWL) to Combine Extant Controlled Vocabularies in Biodiversity Informatics Appears Redundant. Nature Precedings. https://doi.org/10.1038/npre.2010.5168.2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free