TreeSnatcher: coding trees from i...
Vol. 23 no. 24 2007, pages 3384���3385 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm438 Phylogenetics TreeSnatcher: coding trees from images Thomas Laubach1,2,3,4 and Arndt von Haeseler1,2,3,4,* 1 Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, Dr-Bohr-Gasse 9/6, A-1030 Vienna, 2 University of Vienna, 3 Medical University of Vienna and 4 University of Veterinary Medicine Vienna, Vienna, Austria Received on July 3, 2007 revised on August 15, 2007 accepted on August 17, 2007 Advance Access publication September 24, 2007 Associate Editor: Keith Crandall ABSTRACT Summary: TreeSnatcher is a GUI-driven JAVA application for the semi-automatic recognition of multifurcating phylogenetic trees in pixel images. The program accepts an image file as input and analyzes the topology and the metrics of a tree depicted. The analysis is carried out in a multiple-stage process using algorithms from image analysis. In the end, TreeSnatcher produces a Newick tree code that represents the tree structure optionally including branch lengths. TreeSnatcher can process trees with 100 leaves or more in a few seconds. Availability: TreeSnatcher was developed in JAVA under Mac OS X and is executable on UNIX/Linux, Windows and Mac OS X systems. The application and its documentation can be freely downloaded from http://www.cibiv.at/software/treesnatcher. Contact: arndt.von.haeseler@univie.ac.at 1 INTRODUCTION Figures of phylogenetic trees are widely used in publications to illustrate the result of an evolutionary analysis. However, as one cannot effortlessly extract a machine-readable represen- tation, i.e. a Newick expression, of the phylogeny from such images, those are not suited for subsequent reanalysis or easy compilation of tree topologies. Therefore, a computer-readable representation of a published tree has either to be built completely by hand or by using special applications. The program TreeThief (Rambaut, 2000), presumably the first that meets this demand, allows the user to trace a given tree by clicking on each of its nodes in turn. However, TreeThief gets quite tedious if the number of taxa represented in the tree is large. Here, we present a conceptually advanced program, TreeSnatcher. It identifies the topology of a tree (e.g. a figure from a publication) and its branch lengths almost automatically with only minor user interaction. The application features a sophisticated graphical user interface that is based on the JAVA Swing API. 2 METHODS 2.1 Prerequisites The current version of TreeSnatcher opens tree-image files in the formats PNG, JPG/JPEG or GIF. The PDF format is currently not supported. The user has to provide an image that (a) shows a tree and (b) complies with the following requirements: The image has a light and homogeneous background that is clearly separated from a dark foreground. The tree along with species names and branch captions constitute the foreground. Lettering and branches must not overlap. The branches consist of evenly thin, plain, not necessarily straight continuous lines. The inner nodes are branching points between lines and have no circles, rectangles, etc. inscribed. The outer nodes do not feature leaves: the edges end at a well-defined location. If the image does not meet those requirements, the tree topology will not be identified correctly. An example of an admissible tree-image is displayed in Figure 1a. 2.2 Workflow Working with TreeSnatcher takes place along a mandatory succession of program stages. The program checks whether the user has performed the formal task(s) specified at any given stage and permits advancing to the next stage. On the other hand, the user controls all image manipulations and recognition tasks performed by the program. Figure 1b displays an early stage of the tree encoding production chain. The entire workflow is as follows: The program converts the specified RGB image into a grayscale image (c.f. ITU-R Recommendation BT.709-3, 1998). As color information is not needed for the tree recognition, all RGB values get expressed as shades of gray. The user trims and cuts the image at will. That is to say, the user can select sub-trees or a subset of taxa from the converted image. The user corrects the image by drawing and erasing pixels. If a line of known length is present in the image, i.e. a scale, the user can mark it and specify a unit length in a dialog box to scale the tree. The program first separates all foreground shades from the background shades in order to extract the tree from the back- ground. Then, it thins out the foreground to prepare the detection of tree nodes and edges. The user marks a location on an edge and instructs the program to color the foreground reachable from this location. *To whom correspondence should be addressed. �� 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. at University of California, Davis on February 29, 2012 http://bioinformatics.oxfordjournals.org/ Downloaded from
The program carries out the detection of nodes and leaves based on the previously colored part of the image. The user rejects or adds tree nodes by drawing and erasing pixels. Then TreeSnatcher repeats recognition step 5 on demand. The program displays the inner nodes and edges of the tree. The user can provide individual edge lengths in a dialog box by clicking on the nodes that are uniquely assigned to an edge. This overrides a previous specification of a unit length for the whole tree. The user clicks on each leaf node in turn in order to type in the corresponding species name in a dialog box. The user either clicks on a tree node to designate it as the root of the tree, or clicks on a position on an edge to mark the root. The program constructs and displays the Newick expression that represents the tree structure. If the user has provided scaling information for the tree, the Newick string contains branch lengths. Figure 1c displays the Newick code for the tree in Figure 1a. A detailed description of the thinning (skeletonizing) algorithm used in step 4 is given in Zhang and Suen (1984). In steps 4 and 5, several variants on the flood-filling technique are used (Burger and Burge, 2005). 3 PERFORMANCE We carried out some authentic tree recognition tasks with TreeSnatcher in which we analyzed images of different size and resolution on a PowerPC equipped Apple iBook 800 MHz with 640 MB main memory. The JAVA SDK 1.5 Virtual Machine running on top of Mac OS X consumes a fairly high amount of heap memory and reserves a huge portion of virtual memory, leaving a comparatively small amount of physical memory for TreeSnatcher. The performance decrease had to be compensated by swapping out images to disk. The fourth and the fifth steps in the workflow are the most time consuming: The program performs binarizing and skeletonizing of a 300 400 RGB grayscale image in 2 s and of a 1400 1200 image in 8s. For the first image representing a tree with 20 species the node and edge detection is done within 2s. For the second image representing a tree with 100 species the latter is done in 12 s. The response time will be further improved by algorithmic means. 4 CONCLUSION We presented a program that automates the generation of a machine-readable representation of phylogenetic trees con- tained in images. TreeSnatcher uses established methods from image analysis to preprocess the source image and to detect the tree structure. The user supervises the semi-automatic recogni- tion process on every program stage and makes corrections to the image where necessary. This novel approach capitalizes on the cooperation between user and program and works well in practice. Furthermore, it accelerates inputting huge topologies from scanned images. This suggests that TreeSnatcher is well fitted to support the research on complex phylogenies. TreeSnatcher will be helpful if one wants to combine phylogenies based on a collection of already published trees that are not electronically available (c.f. Bininda-Emonds, 2004). ACKNOWLEDGEMENT We thank the Vienna Science and Technology Fund (WWTF) for generous financial support. Conflict of Interest: none declared. REFERENCES Bininda-Edmonds,O.R.P. (2004) Phylogenetic supertrees: combining information to reveal the tree of life. Comput. Biol. Ser., 4, 1���550. Burger,W. and Burge,M.J. (2005) Digitale Bildverarbeitung. Springer Verlag, Heidelberg, Berlin, pp. 196���200. International Telecommunications Union: ITU-R Recommendation BT.709-3 (1998) Basic Parameter Values for the HDTV Standard for the Studio and for International Programme Exchange. Rambaut,A. (2000) TreeThief: a tool for manual phylogenetic tree entry. http:// evolve.zoo.ox.ac.uk/software/TreeThief/main.html Zhang,T.Y. and Suen,C.Y. (1984) A fast parallel algorithm for thinning digital patterns. Image Process. Comput. Vis., 27, 236���239. Fig. 1. An admissible pixel image displaying a tree with arbitrary branch lengths is shown in (a), an early stage of the tree encoding production chain in (b), and the Newick code for the tree in (c). TreeSnatcher 3385 at University of California, Davis on February 29, 2012 http://bioinformatics.oxfordjournals.org/ Downloaded from