Sign up & Download
Sign in

Automated bioinformatic discovery: tools that tell you what you should know A case-study in synthetic biology

by Ricardo Vidal
(2009)

Abstract

The large number of bioinformatics tools currently available for the analysis of genetic sequences and the technical features that each tool presents make the biological engineer's job simultaneously easier and more complex. The emerging field of synthetic biology is seen as a new approach to engineering biology, where foundational technologies and concepts generally applied to established fields of engineering, such as electrical or software engineering, are implemented in biology to enable the development of methodologies and standards to make engineering biology a predictable, reproducible and efficient task. Focusing on this new field of synthetic biology, the objective of this thesis was to develop a compound bioinformatics web application to assist in the conception and preparation of standardized biological parts, essential to the progress of this novel field. Different assembly standards for biological parts, developed and implemented by the community of biological engineers and researchers working in the field of synthetic biology, were analyzed and incorporated into the web application. The main programming language used for the development of this project was Python, with special usage of the biological computation code library provided by the BioPython project. Within the objective of producing a simple and unique web application, a selection of tools were integrated programmatically so as to generate a comprehensive analysis report of features based on a single input - a DNA or RNA sequence. This automated procedure made for a streamlined user interface with minimal learning requirements or user interaction. The provided results of the analysis performed by the web application are presented in the form of a report with information regarding the input sequence. These results include information such as for example, the identification of specific restriction sites, thermodynamic stability and secondary structure. These easily generated results provide biological engineers with information that may allow decisions to be made regarding the eligibility of the initial input sequence to be refined into a standard biological part, which in turn are elementary components in the development of synthetic biological devices or systems

Cite this document (BETA)

Available from Ricardo Vidal's profile on Mendeley.
Page 1
hidden

Automated bioinformatic discovery: tools that tell you what you should know A case-study in synthetic biology

UNIVERSIDADE DO ALGARVE
Faculdade de Ciências e Tecnologia
Departamento de Ciências Biológicas e Bioengenharia





Automated bioinformatic discovery:
tools that tell you what you should know
A case-study in synthetic biology

Ricardo dos Santos Vidal


Dissertação para obtenção do grau de
Mestrado em Engenharia Biológica






2009
Page 2
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

1

UNIVERSIDADE DO ALGARVE
Faculdade de Ciências e Tecnologia
Departamento de Ciências Biológicas e Bioengenharia





Automated bioinformatic discovery:
tools that tell you what you should know
A case-study in synthetic biology

Ricardo dos Santos Vidal


Dissertação para obtenção do grau de
Mestrado em Engenharia Biológica


Dissertação orientada por:
Dra. Maria Emília Lima Costa
Dra. Reshma Shetty

2009
Page 3
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

2








I hereby declare that the content and execution of all the work produced in this
dissertation is my own original work, unless otherwise disclosed.

__________________________________
Ricardo dos Santos Vidal
September 21, 2009


Page 4
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

3

















“Scientists discover the world that exists;
Engineers create the world that never was”
Theodore von Kármán


Page 5
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

4
Acknowledgments
This project could not have been made possible without the help of a large group
of people, from academics to friends to family.

To Tom Knight, thank you for the interesting idea of building a web application
that makes engineering biology simpler, following your view of how working with
biology should be, easier.

To Drew Endy and the lab members at MIT, for their support and patience in
teaching me all the wonderful things about synthetic biology. Special thanks to Reshma
Shetty, Jason Kelly and Barry Canton for their personal views and support.

To Maria Emília Costa for accepting to advise and support me during my dissertation
work. Thank you for your patience and time.

To my computer-savvy friends that provided me with their time during this
programming journey, thank you Cláudio Gamboa, Tiago Rodrigues and Bill Flanagan.
Your help was much appreciated.

To my Mother which has always pushed me forward and supported me with all my
decisions during all these years, a special and loving thank you.

And finally, none of this would be possible without the love and support of my
beautiful wife, Bárbara, Thank you!


Page 6
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

5
Abstract
The large number of bioinformatics tools currently available for the analysis of
genetic sequences and the technical features that each tool presents make the biological
engineer's job simultaneously easier and more complex.
The emerging field of synthetic biology is seen as a new approach to engineering
biology, where foundational technologies and concepts generally applied to established
fields of engineering, such as electrical or software engineering, are implemented in
biology to enable the development of methodologies and standards to make engineering
biology a predictable, reproducible and efficient task.
Focusing on this new field of synthetic biology, the objective of this thesis was to
develop a compound bioinformatics web application to assist in the conception and
preparation of standardized biological parts, essential to the progress of this novel field.
Different assembly standards for biological parts, developed and implemented by
the community of biological engineers and researchers working in the field of synthetic
biology, were analyzed and incorporated into the web application.
The main programming language used for the development of this project was
Python, with special usage of the biological computation code library provided by the
BioPython project. Within the objective of producing a simple and unique web
application, a selection of tools were integrated programmatically so as to generate a
comprehensive analysis report of features based on a single input - a DNA or RNA
sequence. This automated procedure made for a streamlined user interface with minimal
learning requirements or user interaction.
The provided results of the analysis performed by the web application are
presented in the form of a report with information regarding the input sequence. These
results include information such as for example, the identification of specific restriction
sites, thermodynamic stability and secondary structure.
These easily generated results provide biological engineers with information that
may allow decisions to be made regarding the eligibility of the initial input sequence to
be refined into a standard biological part, which in turn are elementary components in the
development of synthetic biological devices or systems.
Page 7
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

6
Key words:
Synthetic Biology; Biobricks; Standards; Biological engineering; Bioinformatics;
Python; Web application;
Page 8
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

7
Resumo
O elevado número de ferramentas bioinformáticas disponíveis para a análise de
sequências genéticas e as características técnicas de cada uma dessas ferramentas tornam
o trabalho do engenheiro biológico simultaneamente facilitado e complexo.
A biologia sintética surge como uma nova forma de abordagem à engenharia
biológica, onde a implementação de regras aplicadas a outras áreas de engenharia, como
a engenharia electrónica ou informática, permitem desenvolver metodologias e standards
para tornar a engenharia biologica mais previsivel, reprodutível e eficiente.
Com especial foco na nova área da biologia sintética, este trabalho teve como
objectivo o desenvolvimento de uma aplicação bioinformática, online e composta, para
assistir na concepção e preparação de componentes biológicos standardizados, essenciais
para o progresso de trabalhos nessa mesma área.
Foram analisados os diferentes standards de montagem criados e implementados
pela comunidade de engenheiros biológicos e investigadores a trabalhar na área da
biologia sintética, para a sua incorporação numa ferramenta bioinformática online.
A linguagem de programação principal usada no desenvolvimento deste trabalho
foi Python, com recurso a bibliotecas de código especializadas para a computação
biológica integradas no projecto BioPython. Diversas ferramentas foram incorporadas de
forma a criar uma única aplicação web onde se pode obter um conjunto de informações
relativos a um único input – uma sequência de DNA ou RNA. Este automatismo tornou a
ferramenta extremamente fácil de utilizar sem necessidades especiais de aprendizagem
por parte do utilizador.
O resultado da análise desta ferramenta bioinformática online é um relatório com
informações relativas à sequência fornecida pelo utilizador, tais como a existência de
determinadas zonas de restriçao enzimática, estabilidade termodinâmica e estrutura
secundária. Estes resultados, obtidos de uma forma simples, fornecem ao engenheiro
biológico um conjunto de dados que lhe permitem decidir se está perante uma sequência
de interesse, apropriada à preparação de um componente biológico standardizado.
Componente esse que é um elemento chave na construção de novos componentes ou
sistemas sintéticos.
Page 9
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

8
Palavras chave:
Biologia sintética; Biobricks; Standards; Engenharia biológica; Bioinformática;
Python; Aplicação web.
Page 11
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

10
3.1.1 PYTHON ..................................................................................................................................... 36
3.1.2 BIOPYTHON ............................................................................................................................... 37
3.1.2.1 BioPython Modules.......................................................................................................... 37
3.1.3 BLAST ...................................................................................................................................... 38
3.1.4 UNAFOLD ................................................................................................................................. 39
3.1.5 EMBOSS................................................................................................................................... 39
3.1.5.1 Backtranseq ..................................................................................................................... 40
3.1.6 DJANGO ..................................................................................................................................... 40
3.1.7 JAVASCRIPT................................................................................................................................ 40
3.1.8 GOOGLE CHARTS API ................................................................................................................ 41
3.2 IMPLEMENTATIONS ............................................................................................................... 42
3.2.1 Biological sequence input..................................................................................................... 43
3.2.2 Sequence normalization........................................................................................................ 43
3.2.3 Temporary sequence file....................................................................................................... 44
3.2.4 Nucleotide sequence analysis ............................................................................................... 44
3.2.5 Restriction analysis............................................................................................................... 46
3.2.6 Local BLAST against Parts Registry .................................................................................... 47
3.2.7 Secondary Structure prediction with UNAFold .................................................................... 48
3.2.8 Bringing the application to the web...................................................................................... 48
4 RESULTS AND DISCUSSION ....................................................................................................... 50
4.1 BIOLOGICAL SEQUENCE INPUT INTERFACE ................................................................................. 50
4.1.1 Biological sequence input..................................................................................................... 50
4.2 SEQUENCE ANALYSIS RESULTS................................................................................................... 51
4.2.1 BIOLOGICAL SEQUENCE AND REVERSE COMPLEMENT ................................................................ 51
4.2.2 NUCLEOTIDE STATISTICS............................................................................................................ 52
4.2.3 RESTRICTION SITES AND COMPATIBILITY ................................................................................... 53
4.2.4 LOCAL BLAST RESULTS............................................................................................................ 54
4.2.5 SECONDARY STRUCTURE ........................................................................................................... 54
4.3 OVERVIEW OF RESULTS .............................................................................................................. 56
5 FUTURE IMPLEMENTATIONS................................................................................................... 58
6 BIBLIOGRAPHY............................................................................................................................. 60
7 APPENDIX........................................................................................................................................ 63
7.1 BBA_F2620................................................................................................................................ 63
7.2 RFC10 ....................................................................................................................................... 64
7.3 RFC21 ....................................................................................................................................... 67
Page 12
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

11
7.4 RFC23 ....................................................................................................................................... 69
7.5 RFC25 ....................................................................................................................................... 75
7.6 PROGRAMMING CODE................................................................................................................. 82
7.6.1 myform.html.......................................................................................................................... 83
7.6.2 results.html ........................................................................................................................... 84
7.6.3 cleanbbfasta.py ..................................................................................................................... 89
7.6.4 forms.py ................................................................................................................................ 90
7.6.5 lblast.py ................................................................................................................................ 91
7.6.6 rfcs.py ................................................................................................................................... 92
7.6.7 seqchecker.py........................................................................................................................ 93
7.6.8 views.py ................................................................................................................................ 97

Page 13
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

12
Figures
Figure 1 - Synthetic Biology encompasses systems design and fabrication (Heinemann &
Panke, 2006) ..................................................................................................................... 20
Figure 2 - A typical component vector consists of a sequence of the following form.
Upstream flanking EcoRI and XbaI and downstream flanking SpeI and PstI restriction
sites. (Knight et al., 2003)................................................................................................. 22
Figure 3- Standard Biobricks Assembly illustrated with the assembly of a “blue” biobrick
with a “green” biobrick into a “blue-green” system biobrick (Rettberg, 2009) ............... 23
Figure 4- Number of parts added per year and available in the Registry of Standard
Biological Parts. (PartsRegistry.org, 2009) ...................................................................... 26
Figure 5 - Abstraction Hierarchy in Synthetic Biology. Abstraction barriers (red) block
all exchane of information between abstraction levels. Interfaces (green) enable the
limited and principled exchange of information between levels. (Endy, 2005) ............... 26
Figure 6 - Understanding natural systems through synthetic biology .............................. 28
Figure 7 – Overview of the web application work-flow. Colors represent the different
implementation steps: Grey: Reverse translation with EMBOSS; Orange: Nucleotide
sequence analysis; Blue: Restriction analysis; Magenta: Sequence comparison with
BLAST; Green: Sequence folding with UNAFold;.......................................................... 42
Figure 8 - Web application user interface form. Text area is for biological sequence
insertion. When sequence is protein, species selection options are required to perform
codon optimization............................................................................................................ 50
Figure 9 – Information regarding the type of sequence originally submitted to the web
application and the layout of the DNA sequence and reverse complement sequence...... 52
Figure 10 - Various nucleotide generated data presented in numerical form or via
dynamic generated pie-charts. .......................................................................................... 52
Figure 11 - Restriction sites and standards compatibility................................................. 53
Figure 12 - Local BLAST results presented in a table with BioBrick part name linked to
offical Parts Registry page for the part, short description of part and e-value obtained
from sequence comparison ............................................................................................... 54
Figure 13 - Estimated secondary structure for submitted sequence ................................. 55
Page 16
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

15
Glossary
API - Application Programming Interface
BBF – Biobrick Foundation
Biobrick – Standardized biological part
BLAST – Basic Local Alignment Search Tool
DNA – Deoxyribonucleic acid
EMBOSS – European Molecular Biology Open Software Suite
iGEM – International Competition of Genetically Engineered Machines
MIT – Massachusetts Institute of Technology
MVC – Model-View-Controller
MTV – Model-Template-View
Parts Registry – The Registry of Standard Biological Parts
RFC – Request For Comments
RNA – Ribonucleic acid
UNAFold – Unified Nucleic Acid Folding software package
XML – Extensible Markup Language



Page 18
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

17
1 Introduction

1.1 Biological Background
Engineering in its most formal definition, provided by the Engineers' Council for
Professional Development (Anonymous 1941), in the United States of America, is stated
as the application of “scientific principles to design or develop structures, machines,
apparatus, or manufacturing processes, or works utilizing them singly or in combination;
or to construct or operate the same with full cognizance of their design; or to forecast
their behavior under specific operating conditions; all as respects an intended function,
economics of operation and safety to life and property.”

The efficient development and replicability of such structures, machines or
processes, is highly dependent on core concepts such as standardization, decoupling and
abstraction, without which engineering would be technologically hindered.

The implementation of standards has enabled different fields of science and
engineering to progress at a faster and more efficient pace. An example of this was
demonstrated by the establishment of standards for screw threads by William Sellers, at
the Franklin Institute (Sellers et al. 1864), which provided rapid progress in fields such as
mechanical engineering during the American industrial revolution.

Separating a complicated problem into smaller and less complicated ones, that can
be worked on independently in a way which can later be combined as a functional whole,
is what is referred to as decoupling. This technology enables the separation of tasks for
example where a building project can be worked on independently by an architect, an
engineer, a constructor, etc. The combined expertise come together to produce a whole,
and hopefully functional, project.

Page 21
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

20

Figure 1 - Synthetic Biology encompasses systems design and fabrication (Heinemann &
Panke 2006)

Combining foundational engineering technologies such as de novo DNA
synthesis, advancements in computational and systems biology, standardization and
abstraction, has produced various achievements in the new field of synthetic biology.
Among some of the already noteworthy applications are projects in such diverse areas as
the design of artificial gene networks (Sprinzak & Elowitz 2005), the refactoring of small
genomes (Chan et al. 2005), artificial mammalian oscillators (Tigges et al. 2009) and
problem solving with a bacterial computer (Baumgardner et al. 2009).

1.1.3 Biobricks, iGEM and the Parts Registry
The introduction of standardization in assembly techniques for DNA sequences
allow DNA assembly reactions to avoid becoming themselves experiments but rather
characterized tools for addressing a defined research topic. By replacing the current
experimental approach to genetic engineering that is both time consuming and ad hoc in
nature, with a set of standard, reproducible and reliable engineering mechanisms, it is
foreseen that some of the engineering challenges in biology can be overcome and
progress made at a more efficient rate.
Page 23
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

22
end by SpeI and PstI restriction sites. Also, the component vector must not contain any of
the aforementioned restriction sites.


5’ --gca GAATTC GCGGCCGC T TCTAGA G --insert-- T ACTAGT A GCGGCCG CTGCAG gct--
--cgt CTTAAG CGCCGGCG A ACATCT C ---------- A TGATCA T CGCCGGC GACGTC cga--
EcoRI NotI XbaI SpeI NotI PstI
Figure 2 - A typical component vector consists of a sequence of the following form.
Upstream flanking EcoRI and XbaI and downstream flanking SpeI and PstI restriction
sites. (Knight et al. 2003)
The bases between restriction sites were carefully chosen to eliminate the
accidental generation of specific methylation sites which could prevent enzyme cutting
within determined strains.
Each complying vector component can be cut in four distinct ways yielding four
distinct fragments. By cutting with EcoRI and SpeI, a front insert (FI) is obtained.
Cutting with XbaI and PstI creates a back insert (BI). With EcoRI and XbaI, a front
vector (FV) is created and finally by cutting with SpeI and PstI creates a back vector
(BV) (Knight et al. 2003).
Since XbaI and SpeI recognition sequences have compatible overhangs, it is
possible to ligate back inserts with back vectors to add components to the back of
existing constructs. The same applies with front inserts and front vectors as regards to
adding components to the front of existing constructs. This ligation process results in a
mixed SpeI/XbaI site, a scar, that is not recognized by either of the restriction enzymes
and can no longer be cut.
The resulting construct is identical in form to the standard components from
which it was made. In other words, the resulting construct is flanked by the exact same
restriction sites as those flanking the initial “parent” components. This recursive behavior
makes it physically possible to “rinse-and-repeat” the process with the newly created
construct with any other parts or devices that follow the biobrick standard to form ever
more complex constructs or systems.

Page 24
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

23

Figure 3- Standard Biobricks Assembly illustrated with the assembly of a “blue” biobrick
with a “green” biobrick into a “blue-green” system biobrick (Rettberg 2009a)

One immediate advantage introduced by the biobrick standard is the fact that once
a component is constructed, it can then be characterized and stored in a library or registry
for use by others that also work according to the same standards.

1.1.5 Standards, plural
There is an anecdote among engineers that goes something like this: “The nice
thing about standards is there are so many from which to choose”. Many fields of
engineering have various concurrent standards. Examples of this are the various co-
existing standards for wireless access points and their subtypes in electrical and
electronics engineering.
Page 25
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

24
Similarly, there are various standards presented for standard biological parts.
These various standards are put together and presented by the BioBricks Foundation, a
not-for-profit organization founded by engineers and scientists from MIT, Harvard, and
UCSF with significant experience in both non-profit and commercial biotechnology
research, that is dedicated to promoting and protecting the open development, sharing,
and reuse of BioBrick standard biological parts

Taking inspiration from the Internet Engineering Task Force, which devised and
implemented the standard protocols that make the internet what it is today, the BioBricks
Foundation has proceeded to implement a Request for Comments process. In other
words, short structured documents interestingly named Request for Comments (RFC), are
made available with defined standards and open to review and commentary by the
BioBricks community.
The list of RFCs put forward by the BioBrick community has been growing with
new standards being proposed for different features and techniques relevant to research
and work within synthetic biology.

The previously described Biobrick standard for physical composition of
biological parts proposed by Knight is described in RFC10 (see 7.2 ). It is currently the
most used and widespread standard. However, other RFCs are available with alternate
standards for physical composition and assembly that address specific issues such as
fusion proteins.
These other main RFCs regarding physical composition and assembly are RFC23
(see 7.4 ) and RFC25 (see 7.5 ). They are extensions to RFC10 and therefore are all
compatible with parts that abide by RFC10 standards.
The majority of parts available in the Registry have indications as respects to their
compatibility with each of these standards which in many cases are compatible with all
four.
Page 27
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

26

Figure 4- Number of parts added per year and available in the Registry of Standard
Biological Parts. (PartsRegistry.org, 2009)

Abstraction levels are an important part of synthetic biology as an approach to
engineering biology. As such, the terms: part, device and system, refer to standard levels
of hidden complexity at which synthetic biologists work (Figure 5).


Figure 5 - Abstraction Hierarchy in Synthetic Biology. Abstraction barriers (red) block
all exchane of information between abstraction levels. Interfaces (green) enable the
limited and principled exchange of information between levels. (Endy 2005)
Page 28
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

27
Working at any of these levels of abstract hierarchy should not require one to
understand all the intricacies of each of the other levels. Moreover, there must exist a
form of exchanging information between levels, therefore providing each level with
sufficient information to interact.

Parts and devices are available in the registry and allow biological engineers to
combine these parts or devices into more complex systems without having to start from
scratch. In other words, a researcher looking to produce a multi-component system does
not have to spend a large part of his or her time refining the biological parts from DNA.
By (re-)using standard compliant, characterized components from the registry,
building multi-component biological systems proves to be faster, more efficient and
above all, reproducible (Peccoud et al. 2008).

1.1.8 Characterization
Most parts and devices present in the registry have their sequences verified for
quality control and, in some cases, are also accompanied by qualitative and quantitative
data in the form of a data sheet. This information varies with the type of component but
tends to include information on various characteristics such as composition, mechanism,
function, specificity, compatibility, stability, among others.

There are various highly characterized components available in the registry of
standard biological parts. One such device is a genetically encoded receiver neatly
labeled: Bba_F2620 (Canton et al. 2008). This composite device was constructed by
using five other standard biological parts. BBa_F2620 is a featured device within the
registry due to the level at which it has been characterized (see 7.1 ).

The characterization of BioBricks is important to enable the reuse of parts by
independent researchers across many laboratories. Although Biobricks are standardized
in a manner that allows parts to be physically assembled into multi-component systems, it
Page 31
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

30
Finding the sequence of choice, performing the sequence analysis and the
determination of standard compliant pre and suffix primers is made enormously easier
with the assistance of biological computational tools and databases.


Page 32
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

31
1.2 Bioinformatics: benefits and obstacles
The goal of bioinformatics is to better understand the living cell and how it
functions at the molecular level. By analyzing raw molecular sequence and structural
data, bioinformatics research can generate new insights and provide a "global"
perspective of the cell.

Currently there exist a large number of bioinformatics tools that allow researchers
to analyze specific features of their biological data. If you are analyzing a DNA sequence,
you can retrieve such information as the GC content, repeats in your sequence, open
reading frames, among many other features. The same happens when analyzing amino
acid sequences, with a large number of tools available to retrieve different protein
specific data.
By combining a group of tools focused on specific features, such as those
pertaining to synthetic biology and the production of standard biological parts, it would
be of great utility to biological engineers to have the ability of querying a system that
would not only retrieve the data directly related to the sequence but also suggested
information of interest supplied by the system in an automated way and further annotate
their findings.
Automated discovery related to a given sequence can provide biological engineers
with insight into details that would not be easily found without a thorough analysis with a
spectrum of different bioinformatics tools.

Access to biological information databases and the use of computational tools is
invaluable to anybody attempting to build new standard biological parts or devices. These
tools make it possible for biological engineers to analyze desired genetic sequences for
specific characteristics that may be of interest to their work.

In general, biological computational tools focus on a specific set of functions and
features. The choice of tools is usually related to what the researcher is looking to retrieve
Page 33
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

32
or analyze. May it be DNA, RNA, proteins or any other biological component, there are
specialized computational tools for each and every one.
There are many great tools to help researchers obtain and analyze their biological
data. The wide range of tools available can be considered beneficial but also hindering
since the researcher does not only need to properly interpret the data they obtain via the
analysis of their biological target but also is required to understand how to use each
software application.

Despite the enormous benefits that bioinformatics applications provide, there are
obstacles that researchers have to overcome while using them. These tools or applications
can be produced in a wide variety of programming languages, can be operating system
specific, require specific input file formats, produce results in proprietary file formats,
etc. The benefits, however, are far greater than the obstacles.

The interoperability of bioinformatics services has also been an issue in the past
and has made way for centralized web services that provide standardized ways to
overcome the differences between biological data-types, databases and data-formats
(Wilkinson 2002; Smedley et al. 2009; Pillai et al. 2005).



Page 37
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

36
3 Material and Methods
The following sections enumerate the web application development technologies
and their subsequent implementations for the development of the proposed
bioinformatics web application.
The first section gives insight into the programming languages, applications and
frameworks used during the development this project. Utilized technologies are each
specified and followed by a detailed description regarding their function and the motive
for selection.
The second section describes the implementations of the previously mentioned
technologies used in the developing of the proposed web application. The internal
application work-flow is described sequentially from the initial input sequence to the
finalized web page output and available files.

3.1 Development technologies
3.1.1 Python
The main programming language used to develop this project was Python.
Python is a very high-level, interpreted, object-oriented and extensible
programming language. It is free to use and has demonstrated enormous stability and
continuous development by a large and active community of contributing open source
developers (www.python.org).
Despite the existence of various other programming languages with similar
characteristics as those just described, Python was selected as the main programming
language for this project for a few other reasons. Among such reasons are the versatility
as a scripting language for web applications, the availability for all major operating
systems, the proven performance and, especially for this project, the existence of a freely
available library of tools and applications for biological computation aptly named
BioPython.

Page 39
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

38
• Bio.Restriction
o Restriction enzymes play a very important role in genetic engineering.
Bio.Restriction provides the tools to perform restriction analysis on
biological sequences. This package includes facilities provided by
Rebase and contains information for over 600 restriction enzymes.
Given the importance of restriction enzymes in the assembly of standard
biological parts, this module provided numerous tools which facilitated
the restriction analysis of the input sequence.

• Bio.Alphabet.IUPAC
o Biological sequences are comprised of nucleotides or proteins (amino
acids). The Bio.Alphabet.IUPAC module provides the means to
identify the constituting elements within a sequence according to
IUPAC defined nucleotide or protein alphabets.

• Bio.NCBIXML
o The Bio.NCBIXML module included in BioPython enables the XML
output of a BLAST to be parsed. BLAST enables the comparison of
biological sequences with an appropriately formated database and the
results of said comparison are stored in a XML file or object, which can
then be parsed using this module.

3.1.3 BLAST
The Basic Local Alignment Search Tool, or BLAST, is an algorithm for
comparing primary biological sequence information. BLAST enables researchers to
compare a query sequence with a database or library of sequences and identify
similarities between them within a specified threshold.
Comparison is just one of the usage purposes for BLAST. Other usage purposes
include identification of species, establishing phylogeny, DNA mapping and locating
domains.
Page 46
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

45
A useful indicator regarding the thermostability of a double stranded DNA
sequence is provided by the GC-content. The percentage of nitrogenous bases in a DNA
molecule which are either guanine (G) or citosine (C) determines the GC-content.
The GC pair is bound by three hydrogen bonds, one more than those that binding
the AT pair. Therefore, DNA that contains a higher percentage of GC-content is
generally considered to be more stable than DNA with low GC-content.

As already mentioned, GC-content is generally represented as a percentage and
can be calculated as:

100G C
A C G T
+
×
+ + +

Equation 2

Despite the simplicity of the arithmetic, the Bio.SeqUtils module contains a
built-in gc() function which was used to obtain the desired statistic.

Melting temperature
Melting temperature (Tm) is defined as the temperature at which half of the DNA
strands are in the double-helical state and half are in the "random-coil" states. The
melting temperature depends on the length of the sequence of nucleotides and the
nucleotide composition.
Using the built-in MeltingTemp submodule from Bio.SeqUtils, a function
named Tm_staluc()enables the calculation of the thermodynamic melting temperatures
of nucleotide sequences according to the nearest neighbor method (SantaLucia 1998).
This method of calculating the melting temperature takes into account the
stacking energies between adjacent nucleotides along the double helix. Each combination
of two adjacent bases has an enthalpic (∆H) and entropic (∆S) parameter. These
parameters along with the concentration of the strands (single C1 and complementary C2)
and the universal gas constant (R) come together in an equation which allows the
concentration of the melting temperature (Tm) to be calculated as (SantaLucia 1998):
Page 51
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

50
4 Results and Discussion
The following sections describe the proposed web application as a practical result
of the implemented development technologies. The sections are ordered in terms of their
interaction and presentation to the user and discussed in regards to their function and
importance.
4.1 Biological sequence input interface

Figure 8 - Web application user interface form. Text area is for biological sequence
insertion. When sequence is protein, species selection options are required to perform
codon optimization.
4.1.1 Biological sequence input
Initial interaction with the web application comes through a simple form that
requests the introduction of a biological sequence in the form of DNA, RNA or Protein.
For each type of sequence introduced, the resulting sequence with which the web
application will perform analysis and calculations will be a DNA sequence.
The input field allows biological sequences to be introduced with numbers, spaces,
multiple lines or any other characters that are generally copied along with the sequence of
interest from biological databases like GenBank (Benson et al. 2008). The non-
Page 53
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

52
enabled links listed in the restriction sites and compatibility section of the results (see
4.2.3 ).

Figure 9 – Information regarding the type of sequence originally submitted to the web application and the
layout of the DNA sequence and reverse complement sequence. Non-desired restriction site marked in red.
4.2.2 Nucleotide statistics


Figure 10 - Various nucleotide generated data presented
in numerical form or via dynamic generated pie-charts.

Simple pie-charts are generated on-the-fly using
the Google Chart API providing a visual representation
of the composition and the GC-content of the submitted
sequence.
Other data obtained through nucleotide analysis
are presented such as the sequence length, molecular
weight and melting temperature.

Data such as GC-content and melting temperature are
closely related and both influence DNA strand
thermostability (SantaLucia & Hicks 2004; SantaLucia 1998).

Page 58
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

57
This program was developed as a proof-of-concept bioinformatics tool
demonstrating that a compound web application can provide a wide variety of data based
on limited input and interaction on behalf of the user.

The objective of this project was to develop an intuitive web application that
would require minimal user input to perform a task that would in other circumstances
require multiple tools, knowledge of how they work and in some cases programming
skills.
By focusing on a specific field of research, such as synthetic biology, it was
possible to assess the required tools and information that synthetic biologists would use
in their research. Refining natural biological parts into standardized ones is made easier
by using the proposed web application. It provides biological engineers with a one-stop
location to analyze biological sequences and obtain a set of results that not only
characterize the target sequence but also provide preliminary information regarding the
potential for said sequence to be made into a working Biobrick.
The information is provided via a single web form that runs the biological
sequence of interest through a series of Python functions and robust bioinformatics tools
such as BLAST, EMBOSS and UNAFold to provide biologically relevant information to
the synthetic biologist without any required programming or bioinformatics knowledge
whatsoever.


Page 59
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

58
5 Future implementations
As most software or web applications, there is always space for improvement or
adding new functions or features.

Here are listed a few future implementations that may be incorporated into the existing
web application which may provide more information or tools for synthetic biologists
attempting to create novel standard biological parts:

• Generation of primers for each of the assembly standards provided, not only
for the BioBrick standard (RFC10).
o Since RFC23 and RFC25 assembly standards are compatible with the
widely used Biobrick standard, primer generation for these alternative
RFCs would assist many synthetic biologists working with fusion
proteins.

• Provide an estimated cost to synthesis the desired BioBrick and provide an
easy gateway to selected DNA synthesis enterprises.
o With the cost of nucleotide sequence synthesis rapidly decreasing, it is
now cost-effective to have sequences synthesized instead of constructing
them via PCR. Providing information and a gateway to the companies that
currently offer such services could provide useful for synthetic biologists.

• Implement export options to online locations such as OpenWetWare or the
Parts Registry.
o Once the results are displayed, information on the current web application
is lost upon closing of the browser window. All information is generated
on-the-fly, files are made available for download if necessary. However,
an export option to a synthetic biology related wiki such as OpenWetWare
or directly to the Parts Registry as a potential new Biobrick would prove
to be useful.
Page 61
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

60
6 Bibliography

Altschul, S.F. et al., 1990. Basic local alignment search tool. Journal of molecular
biology, 215(3), 403-10. Available at: http://www.ncbi.nlm.nih.gov/pubmed/2231712.
Anonymous, 1941. THE ENGINEERS' COUNCIL FOR PROFESSIONAL
DEVELOPMENT. Science (New York, N.Y.), 94(2446), 456. Available at:
http://www.ncbi.nlm.nih.gov/pubmed/17744800.
Baumgardner, J. et al., 2009. Solving a Hamiltonian Path Problem with a bacterial
computer. Journal of biological engineering, 3, 11. Available at:
http://www.ncbi.nlm.nih.gov/pubmed/19630940.
Benson, D.A. et al., 2008. GenBank. Nucleic acids research, 36(Database issue),
D25-30. Available at: http://www.ncbi.nlm.nih.gov/pubmed/18073190.
Canton, B., Labno, A. & Endy, D., 2008. Refinement and standardization of
synthetic biological parts and devices. Nature biotechnology, 26(7), 787-93. Available at:
http://www.nature.com/nbt/journal/v26/n7/abs/nbt1413.html.
Chan, L.Y., Kosuri, S. & Endy, D., 2005. Refactoring bacteriophage T7.
Molecular systems biology, 1, 2005.0018. Available at:
http://www.ncbi.nlm.nih.gov/pubmed/16729053.
Chandran, D., Bergmann, F.T. & Sauro, H.M., 2009. TinkerCell: Modular CAD
Tool for Synthetic Biology. DNA Sequence, 23. Available at:
http://arxiv.org/abs/0907.3976.
Cock, P.J. et al., 2009. Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11), 1422-3.
Available at: http://www.ncbi.nlm.nih.gov/pubmed/19304878.
Densmore, D. et al., 2009. A platform-based design environment for synthetic
biological systems. In TAPIA '09: The Fifth Richard Tapia Celebration of Diversity in
Computing Conference. Portland, Oregon: ACM, pp. 24-29. Available at:
http://portal.acm.org/citation.cfm?id=1565806.
Endy, D., 2005. Foundations for engineering biology. Nature, 438(7067), 449-53.
Available at: http://www.ncbi.nlm.nih.gov/pubmed/16306983.
Goler, J., 2004. Biojade: A design and simulation tool for synthetic biological
systems. DSpace. MIT, (May), 56. Available at: http://hdl.handle.net/1721.1/30475.
Page 64
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

63
7 Appendix
7.1 Bba_F2620

Figure 15 - Data sheet for standard biological part BBa_F2620 (Canton et al. 2008)
Page 65
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

64
7.2 RFC10
Draft Standard for Biobrick Biological Parts

Tom Knight
3 May 2007

This standard defines the required sequence properties for a
Biobrick(tm) standard biological part. It does not define any
functional characteristics of the parts, nor does it motivate any
aspect of these standards. All sequences defined herein are specified
in the 5' to 3' direction.

0. A Biobrick compatible standard biological part consists of a DNA
fragment potentially conveying informational or functional
properties to a composite structure assembled from multiple parts.
The current assembly process requires certain sequence properties
for the part and the surrounding DNA.


1. Allowed sequences within Biobrick parts include any DNA sequence
which does not contain the following subsequences:

EcoRI site: GAATTC
XbaI site: TCTAGA
SpeI site: ACTAGT
PstI site: CTGCAG
NotI site: GCGGCCGC

Additionally, there are a set of sites which, if convenient, should
also be eliminated. Parts containing these sites qualify as fully
Biobrick standard compliant, but future assembly and advanced uses
of the parts may be compromised. These include:

PvuII site: CAGCTG
XhoI site; CTCGAG
AvrII site: CCTAGG
NheI site: GCTAGC
SapI site: GCTCTTC and GAAGAGC


2. Biobrick Suffix

Each Biobrick part must contain precisely this sequence immediately
following the 3' end of the part:

T ACTAGT A GCGGCCG CTGCAG
(note: if constructing a primer, this sequence must be reverse
complemented.)


3. Biobrick Prefix:

To allow the construction of ribosomal binding sequences 5' of
coding regions, the prefix used for coding regions is distinguished
Page 67
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

66


5. Strains

The registry maintains parts libraries in frozen bacterial
cultures, and encourages submissions in this form. The bacterial
strain must be a K-12 cloning strain (endA-). Under no
circumstances will the registry accept submissions in any strain
which is not BSL-1. We recommend strains such as Top10, DH10B, and
DH5a. We do not recommend submissions in MC4100, BL21 and similar
strains.


6. PCR contruction of Biobrick parts

Biobrick parts can be constructed by PCR from naturally occurring
coding regions or other long DNA sequences. The recommended primer
sequences for PCR of these fragments are:

Biobrick prefix:
GTTTCTTC GAATTC GCGGCCGC T TCTAGA G <18-24 bp of matching primer>

Coding region biobrick prefix:
GTTTCTTC GAATTC GCGGCCGC T TCTAG <18-24 bp of matching primer,
beginning with ATG>

Biobrick suffix:
GTTTCTTC CTGCAG CGGCCGC T ACTAGT A <18-24 bp of matching primer
(reverse complement)>

The reverse complement of this suffix sequence is:
3' T ACTAGT A GCGGCCG CTGCAG GAAGAAAC 5'


------------
Biobrick(tm) is a registered trademark of the Biobrick Foundation, Inc.


Page 70
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

69
7.4 RFC23

Page 78
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

77


Page 79
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

78

Page 80
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

79

Page 81
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

80

Page 82
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

81


Page 86
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

85
return;
}
var elm = document.getElementById(elmId);
var lighted = buttonElm.getAttribute('lighted');

if(lighted == null || lighted == 'false') {

buttonElm.setAttribute('lighted', 'true');
buttonElm.className = 'selected';

var content = elm.innerHTML;

// replace strMatch and put "\s?" between all chars
var aStrMatch = strMatch.split('');
var reStrMatch = aStrMatch.join('\\s?');

var reFind = new RegExp("("+reStrMatch+")", "gm");

elm.innerHTML = content.replace(reFind, "<span
class=\""+classNameString+"\">$1</span>")
} else {
var aLightSpans = elm.getElementsByTagName('span');
var spanContent = false;
for(var i = (aLightSpans.length - 1); i >= 0; i--) {
if(aLightSpans[i].className == classNameString) {
spanContent = aLightSpans[i].childNodes[0];
//alert(spanContent.textContent);

aLightSpans[i].parentNode.replaceChild(spanContent,
aLightSpans[i]);
}
}
buttonElm.setAttribute('lighted', 'false');
buttonElm.className = 'unselected';
}

}

</script>
</head>
<body>
<div id="container">
<div id="header"><h1>Results</h1></div>
<div id="wrapper">
<div id="content">
<h3>Sequence</h3>
<p><strong>Sequence submitted:</strong> {{ params.seqtype
}}</p>
<p>Sequence:</p>
<p id="seq">{{ params.fwd_codons }}</p>
<p>Reverse Complement (5'-3'):</p>
<p id="revseq">{{ params.rev_codons }}<br/><br /></p>
<div id="nucstats" style="float: left; width: 200px; text-
align:center;">
<h3 style="background:#C0D5EB;" >Nucleotide Stats</h3>
<p><strong>Sequence length</strong>: {{ params.seqlen }}
bp<br />
<strong>Molecular Weight:</strong> {{ params.seqmw }}
g/mol</p>
<img
src="http://chart.apis.google.com/chart?chs=150x100&amp;chd=t:{{ params.per_a
}},{{ params.per_c }},{{ params.per_g }},{{ params.per_t
}}&amp;chco=336699&amp;cht=p&amp;chl=A|C|G|T" alt="Bases (Percentage)"/>
Page 87
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

86
<p><strong>A</strong>: {{ params.num_a }} bp ({{
params.per_a }}%)&nbsp;<br />
<strong>T:</strong> {{ params.num_t }} bp ({{ params.per_t
}}%)&nbsp;<br />
<strong>C:</strong> {{ params.num_c }} bp ({{ params.per_c
}}%)&nbsp;<br />
<strong>G:</strong> {{ params.num_g }} bp ({{ params.per_g
}}%)</p>
<img
src="http://chart.apis.google.com/chart?chs=150x100&amp;chd=t:{{ params.gc
}},{{ params.at }}&amp;chco=336699&amp;cht=p&amp;chl=GC|AT" alt="GC/AT Content"
/>
<p><strong>GC content:</strong> {{ params.gc }} % &nbsp;
<strong>AT content:</strong> {{ params.at }} %</p>
<p><strong>Melting Temperature (ªC)</strong>: {{ params.tm
}}</p>
</div>
<div id="rsites" style="float: right; width: 500px;">
<h3>Restriction sites and compatibility</h3>
<h3>RFC compatibility</h3>
<p><strong>RFC10</strong>: {{ compat.rfc10 }} &nbsp;
<strong>RFC21</strong>: {{ compat.rfc21 }} &nbsp;
<strong>RFC23</strong>: {{ compat.rfc23 }} &nbsp;
<strong>RFC25</strong>: {{ compat.rfc25 }} </p>
<h3>RFC10</h3>
<p><strong>Fwd. primer</strong>: <code>{{
rfc10primers.fwd_primer }}</code><br />
<strong>Rev. primer</strong>: <code>{{
rfc10primers.rev_primer }}</code><br /></p>
<p><strong>Non-allowed restriction sites</strong><br />
{% for renz,cuts in cutsites.rfc10.items %}
<a href="#" class="unselected"
onclick="toggleHighLight(this, 'seq', '{{ cuts.rs_site }}',
'my_class_highlighted'); return false;">{{ renz }}</a>:
{% if not cuts.sites %}
NA
{% else %}
{% for pos in cuts.sites %}
<span class="bad">{{ pos }}, </span>
{% endfor %}
{% endif %};&nbsp;
{% endfor %}</p>

<h3>RFC21</h3>
<p><strong>Non-allowed restriction sites</strong><br />
{% for renz,cuts in cutsites.rfc21.items %}
<a href="#" class="unselected"
onclick="toggleHighLight(this, 'seq', '{{ cuts.rs_site }}',
'my_class_highlighted'); return false;">{{ renz }}</a>:
{% if not cuts.sites %}
NA
{% else %}
{% for pos in cuts.sites %}
<span class="bad">{{ pos }}, </span>
{% endfor %}
{% endif %};&nbsp;
{% endfor %}</p>

<h3>RFC23</h3>
<p><strong>Non-allowed restriction sites</strong><br />
{% for renz,cuts in cutsites.rfc23.items %}
Page 88
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

87
<a href="#" class="unselected"
onclick="toggleHighLight(this, 'seq', '{{ cuts.rs_site }}',
'my_class_highlighted'); return false;">{{ renz }}</a>:
{% if not cuts.sites %}
NA
{% else %}
{% for pos in cuts.sites %}
<span class="bad">{{ pos }}, </span>
{% endfor %}
{% endif %};&nbsp;
{% endfor %}</p>

<h3>RFC25</h3>
<p><strong>Non-allowed restriction sites</strong><br />
{% for renz,cuts in cutsites.rfc25.items %}
<a href="#" class="unselected"
onclick="toggleHighLight(this, 'seq', '{{ cuts.rs_site }}',
'my_class_highlighted'); return false;">{{ renz }}</a>:
{% if not cuts.sites %}
NA
{% else %}
{% for pos in cuts.sites %}
<span class="bad">{{ pos }}, </span>
{% endfor %}
{% endif %};&nbsp;
{% endfor %}</p>
</div>
<div style="clear:both; width: 100%;">&nbsp;</div>
<h3>Local Blast: Results (against Parts Registry)</h3>
<p><table id="blasttable" width="100%">

<tr><td><strong>BioBrick</strong></td><td><strong>Description</strong></td><td>
<strong>e-value</strong></td></tr>
{% for item in params.blast_list %}
<tr><td><a href="http://partsregistry.org/Part:{{
item.biobrick }}" >{{ item.biobrick }}</a></td><td>{{ item.seqtitle
}}</td><td>{{ item.e_value }}</td></tr>
{% endfor %}</table></p>
</div>
</div>
<div id="navigation">
<h3>Save for later:</h3>
<p style="text-align:justify;">The information generated on this
page is not stored permanently and will be lost after the window has been
closed.
Therefore, a few files have been generated so that you can "save
for later". Note, these files are wiped from the system at the end of each
week, so don't rely on these links to stay active. Save the files to your
computer.</p>
<p>Files available:<br /><a href="/media/seqfiles/{{
params.seqfile }}.jpg">Single Fold</a> (jpg)<br />
Full local BLAST <a href="/media/seqfiles/{{ params.seqfile
}}.xml">results</a><br />
Original submitted<a href="/media/seqfiles/{{ params.seqfile
}}.seq">sequence</a></p>
<h3>About</h3>
<p>This application was developed by Ricardo Vidal<br />
Source code can be found <a href="">here</a>.</p>
<p>Secondary struction created with UNAFold (Zuker <em>et al</em>
2008)</p>
<p>&nbsp;</p>
</div>
<div id="extra">
Page 89
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

88
<h3>Secondary Structure</h3>
<p><img src="/media/seqfiles/{{ params.seqfile }}.jpg"
width="320px" /></p>
</div>
<div id="footer"><p>Powered by Python, Biopython, Django, Ubuntu,
UNAFold, blast2, and coffee. Lots of coffee.</p></div>
</div>
</body>
</html>

Page 91
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

90
7.6.4 forms.py
from django import forms

class SeqForm(forms.Form):
message = forms.CharField(label='Sequence', widget=forms.Textarea)
species = forms.ChoiceField(choices=(('human', 'Human'), ('ecoli',
'E.coli'), ('yeast', 'Yeast')),
widget=forms.RadioSelect)


Page 92
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

91
7.6.5 lblast.py
from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML

#my_blast_file = "/home/rvidal/NetBeansProjects/bbhelper/src/testseq.fasta"
SEQ_FOLDER = "/path/to/code/folder/bbhelper/media/seqfiles/"
my_blast_db = "/path/to/registry/cleanbb/forBLAST/clean_bbs.faa"
my_blast_exe = "/usr/bin/blastall"
E_VALUE_THRESH = 0.04


def local_blast(rfname):
my_blast_file = SEQ_FOLDER+rfname+".seq"
result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe,
"blastn",
my_blast_db, my_blast_file)
''' Save results to xml file '''
blast_results = result_handle.read()
xmlfile = SEQ_FOLDER+rfname+'.xml'
save_file = open(xmlfile, "w")
save_file.write(blast_results)
save_file.close()

''' open the new xml file to be parsed '''
result_handle = open(xmlfile)
blast_records = NCBIXML.parse(result_handle)
blast_record = blast_records.next()
results_list = []
if blast_record.alignments == []:
pass
#print "No relevant results in Registry"
else:
if len(blast_record.alignments) > 3:
bresults = 3
else:
bresults = len(blast_record.alignments)
for i in range(bresults):
for hsp in blast_record.alignments[i].hsps:
if hsp.expect < E_VALUE_THRESH:
parts = blast_record.alignments[i].title.split(' ')
lb_result = {}
lb_result['biobrick'] = parts[1]
lb_result['seqtitle'] = ' '.join(parts[4:-1])
lb_result['length'] = blast_record.alignments[i].length
lb_result['e_value'] = hsp.expect
results_list.append(lb_result)
else:
pass
#print 'e value out of range.'
return results_list


Page 93
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

92
7.6.6 rfcs.py
from Bio.Restriction import *
from Bio.Seq import Seq

spacer = 'GTTTCTTC'


def overhang(seq):
""" Select the first 20+ nucleotides from from a dna sequence
make sure it ends in G or C
"""
seq = Seq(seq)
seq = str(seq)
seq_chunk = seq[:20]

for i in range(20,30):
if seq[i:i+1] == 'C':
seq_chunk += seq[i:i+1]
break
elif seq[i:i+1] == 'G':
seq_chunk += seq[i:i+1]
break
else:
seq_chunk += seq[i:i+1]
return seq_chunk


def rfc10_primers(seq):
""" Generate the forward and reverse primers according to the RFC10
assembly standard
http://openwetware.org/wiki/The_BioBricks_Foundation:BBFRFC10
Coding sequence or non-coding sequence specific primers generated.
"""
primers_rfc10 = {}
fwd_chunk = overhang(str(seq))
rev_chunk = overhang(str(seq.reverse_complement()))
if str(seq[:3]) == 'ATG':
fwd_primer = spacer + EcoRI.site + NotI.site + 'T' + XbaI.site[:-1] +
fwd_chunk
primers_rfc10['fwd_primer'] = fwd_primer
rev_primer = spacer + PstI.site + NotI.site[1:] + 'T' + SpeI.site + 'A'
+ rev_chunk
primers_rfc10['rev_primer'] = rev_primer
else:
fwd_primer = spacer + EcoRI.site + NotI.site + 'T' + XbaI.site + 'G' +
fwd_chunk
primers_rfc10['fwd_primer'] = fwd_primer
rev_primer = spacer + PstI.site + NotI.site[1:] + 'T' + SpeI.site + 'A'
+ rev_chunk
primers_rfc10['rev_primer'] = rev_primer
return primers_rfc10


Page 96
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

95
seqfile.write(dna_seq)
seqfile.close()
self.rfname = rfname
return rfname

def single_fold(self):
filename = self.makeSeqFile()
seqfile = SEQ_FOLDER+filename
resfile = UNAFOLD_DATA+filename
''' run UNAFold '''
command_string = 'UNAFold.pl --NA=DNA --max=1 --mode=bases --label=10 -
-run-type=html ' + seqfile + '.seq'
os.system(command_string)

''' convert .ps file to .jpg '''
command_string = 'convert ' + resfile + '_1.ps ' + seqfile + '.jpg'
os.system(command_string)

command_string = 'rm ' + resfile + '*'
os.system(command_string)
return filename

def rev_translate(self, sequence, species):
"""
Reverse translation of amino acid sequence to DNA sequence
using EMBOSS backtranseq (human, ecoli and yeast .cut files)
"""
sequence = str(sequence)
species = self.species

tmp_aainput = "aainput.txt"
tmp_aaoutput = "aaoutput.fasta"

""" Use the appropriate .cut file for EMBOSS backtranseq """
if species == "human":
s_file = "Ehum.cut"
elif species == "ecoli":
s_file = "Eecoli.cut"
elif species == "yeast":
s_file = "Eyeast.cut"


tmp_inputfile_loc = SEQ_FOLDER + tmp_aainput
tmp_outputfile_loc = SEQ_FOLDER + tmp_aaoutput

tmp_f = open(tmp_inputfile_loc, 'w')
tmp_f.write(sequence)
tmp_f.close()

""" Run EMBOSS backtranseq """
emboss_cl = 'backtranseq -sequence ' + tmp_inputfile_loc + ' -cfile ' +
s_file + ' -outfile ' + tmp_outputfile_loc
os.system(emboss_cl)


""" read newly created file with reverse translated sequence """
#filename = 'seqecoli.fasta'
out_f = open(tmp_outputfile_loc, 'r')
return re.sub('[0-9\>\<\n\t\r]+','', ''.join(out_f.readlines()[1:]))

def getParams(self):
''' Returns a dictionary with following data
- GC content (percentage)
Page 97
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

96
- AT content (percentage)
- Number of A, C, T, G (int)
- Percentage of A, C, T, G (percentage)
- Reverse Complement Sequence (string)
- Sequence Molecular Weight (float)
'''
dna_seq = self.convToDNA()
rfname = self.single_fold()
fwd_codons = self.seq2codons(str(dna_seq))
rev_codons = self.seq2codons(str(dna_seq.reverse_complement()))

time.sleep(2)
params = {}
params['seqtype'] = self.seqType()
params['gc'] = format(GC(dna_seq),'.3f')
params['at'] =
format((float(dna_seq.count("A"))+float(dna_seq.count("T")))*100/float(len(dna_
seq)), ".3f")
params['num_a'] = dna_seq.count("A")
params['per_a'] =
format((float(dna_seq.count("A")))*100/float(len(dna_seq)), ".2f")
params['num_c'] = dna_seq.count("C")
params['per_c'] =
format((float(dna_seq.count("C")))*100/float(len(dna_seq)), ".2f")
params['num_g'] = dna_seq.count("G")
params['per_g'] =
format((float(dna_seq.count("G")))*100/float(len(dna_seq)), ".2f")
params['num_t'] = dna_seq.count("T")
params['per_t'] =
format((float(dna_seq.count("T")))*100/float(len(dna_seq)), ".2f")
params['seqlen'] = len(dna_seq)
params['seq'] = dna_seq
params['fwd_codons'] = fwd_codons
params['rev_codons'] = rev_codons
params['seqcomp'] = dna_seq.complement()
params['seqrevcomp'] = dna_seq.reverse_complement()
params['seqmw'] = molecular_weight(dna_seq)
params['seqfile'] = rfname
params['rs_sites'] = rs_sites
params['tm'] = format(Tm_staluc(str(dna_seq)), ".2f")
return params


# Create a list of sorted dictionary keys
def sortDictKeys(adict):
items = adict.items()
items.sort()
return [key for key, value in items]


Page 98
hidden
Automated bioinformatic discovery: tools that tell you what you should know

Universidade do Algarve – Engenharia Biológica

97
7.6.8 views.py
from seqchecker import *
from forms import SeqForm
from django.shortcuts import render_to_response
from lblast import *
from rfcs import *
import re

def seqread(request):
if request.method == 'POST':
form = SeqForm(request.POST)
if form.is_valid():
msg = form.cleaned_data['message']
species = form.cleaned_data['species']
msg = re.sub('[0-9\>\<\n\t\r]+','', msg.encode()).replace(' ',
'').upper()
s = SeqChecker(msg, species)
params = s.getParams()
rfc10primers = rfc10_primers(params['seq'])
lblast_list = local_blast(params['seqfile'])
params['blast_list'] = lblast_list

rs_sites = params['rs_sites']
cutsites = s.rfcCutSites(rfcdict)

''' Reorganize rfcCutSite dictionary to include R.Enz. sites '''
all_cutsites = {}
renz_list = {}
renz_info = {}
for rfcs in cutsites:
#print rfcs
renz_list = {}
for renz in cutsites[rfcs]:
#print renz
allcuts = []
for site in cutsites[rfcs][renz]:
allcuts.append(site)
renz_info = {}
renz_info['sites'] = allcuts
renz_info['rs_site'] = rs_sites[str(renz)]
renz_list[renz] = renz_info
all_cutsites[rfcs] = renz_list

compat = s.rfcCompatible(rfcdict)
return render_to_response('results.html', {
'params':params,
'cutsites':all_cutsites,
'compat':compat,
'rfc10primers':rfc10primers,

})
else:
form = SeqForm()

return render_to_response('myform.html', {'form':form,})

def serve(request, path, document_root, show_indexes=False):
pass

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

12 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
42% Ph.D. Student
 
25% Researcher (at an Academic Institution)
 
17% Student (Master)
by Country
 
25% United States
 
8% Austria
 
8% Japan