Sign up & Download
Sign in

Computational representation of biological systems.

by Zach Frazier, Jason McDermott, Michal Guerquin, Ram Samudrala
Methods In Molecular Biology Clifton Nj (2009)

Abstract

Integration of large and diverse biological data sets is a daunting problem facing systems biology researchers. Exploring the complex issues of data validation, integration, and representation, we present a systematic approach for the management and analysis of large biological data sets based on data warehouses. Our system has been implemented in the Bioverse, a framework combining diverse protein information from a variety of knowledge areas such as molecular interactions, pathway localization, protein structure, and protein function.

Cite this document (BETA)

Available from www.springerlink.com
Page 1
hidden

Computational representation of biological systems.

Chapter 23
Computational Representation of Biological Systems
Zach Frazier, Jason McDermott, Michal Guerquin, and Ram Samudrala
Abstract
Integration of large and diverse biological data sets is a daunting problem facing systems biology researchers.
Exploring the complex issues of data validation, integration, and representation, we present a systematic
approach for the management and analysis of large biological data sets based on data warehouses. Our system
has been implemented in the Bioverse, a framework combining diverse protein information from a variety of
knowledge areas such as molecular interactions, pathway localization, protein structure, and protein
function.
Key words: Bioverse, data integration, molecular interactions, protein structure, protein function,
data warehouse, database, bioinformatics.
1. Introduction
As high-throughput and other large data sets are generated, the
ability of researchers to organize and analyze these data will deter-
mine the science that can be accomplished. Successful integration
of diverse data sources provides novel insight into biological pro-
cesses. For example, the combination of data sets has been used to
discover novel protein–protein interactions in the galactose utili-
zation pathways of yeast (1, 2). In the Bioverse, the application
described here, proteins have been annotated with functional
descriptions by combining the existing and predicted interaction
networks and the existing functional annotations (3).
Integrating biological resources pose many problems for
researchers. Resources are designed and developed with a specific
user community in mind and, with this specialization, have devel-
oped a particular data focus, storage format, and query interface.
Jason McDermott et al. (eds.), Computational Systems Biology, vol. 541
ª Humana Press, a part of Springer Science+Business Media, LLC 2009
DOI 10.1007/978-1-59745-243-4_23
535
Page 2
hidden
Developing tools to utilize these resources demands both an
investment of time and often specific knowledge of the resource.
Objects of interest have different identifiers in different contexts,
complicating accurate integration. Independent projects collect
different information for similar data sets, and may use different
standards of measurement. The query interfaces provided for the
resource may be restrictive, not allowing for novel uses. For exam-
ple, using web sites for blast queries to find similar proteins is
reasonable for a handful of interesting proteins, but for a large
data set it is easier to perform the queries against a local database.
The focus of many biological databases is necessarily narrow,
either focused exclusively on single organisms, such as Wormbase
(4), databases of structures (5, 6), or pathways (7). Manually
integrating the results from many data sources may be feasible
for focused questions or small studies, but is time-consuming for
large data sets. Several projects have attempted to solve this pro-
blem, acting as an intermediary between databases, thereby solving
the problem of integration; however, since these often work
through the interfaces provided, the throughput of this approach
is limited. Services such as BioMoby (8), REMORA (9), and the
Bioinformatics Resource Manager (10) successfully integrate a
variety of data sources and bioinformatics tools. These are excel-
lent resources for small queries across many different databases.
For larger projects, we instead integrate the entire resource.
We begin with the raw data provided by the resource maintainers,
and develop our own storage system integrated with other data
sources based on data warehousing principles.
Data warehouses are an approach to data integration and
management, which is used for a variety of problem domains. In
addition to maintaining a highly flexible storage system for data,
data warehouses allow for the expression of complex relationships
and ease the construction and execution of complex queries.
The solutions developed in the Bioverse (11) integrate a wide
variety of biological data sources, allowing for exploration and pre-
diction of functional, structural, and sequence-based data analysis.
2. Data
Warehouses
Data warehouses organize data for analysis and data mining appli-
cations. Although they are built on relational database technology,
data warehouses differ from traditional online transaction proces-
sing (OLTP) databases. Instead, they are designed to support
online analytical processing (OLAP). OLTP systems typically sup-
port many concurrent users inserting, deleting, and modifying
small amounts of data. OLAP systems provide management and
536 Frazier et al.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

12 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
50% Ph.D. Student
 
17% Student (Master)
 
17% Researcher (at a non-Academic Institution)
by Country
 
17% United States
 
17% United Kingdom
 
17% India