Fundamentals of bioinformatics and computational biology. Methods and exercises in MATLAB (Q484312)

This book is built as an introductory textbook (structured for students with computing background) for the rapidly developing, interdisciplinary field of bioinformatics. It is structured in four parts covering background biological notions, approaches for information retrieval from biological databases, methods for biological sequence analysis and notions of phylogenetic reconstruction of evolutionary trees. It also contains two appendices overviewing Matlab and BioPerl commands. The first part, assembled as an introduction to both bioinformatics and general molecular biology, consists of four chapters. The first sketches the boundaries of bioinformatics and presents as arguments the advances resulted from the Human Genome project and the rapid growth of GenBank, which requires a significant computational effort for its management. The second chapter is a gentle introduction to basic notions of molecular biology such as cell structure and the central dogma. The author also introduces the quantification of gene expression and describes the main role that DNA sequencing plays. In the third chapter, the author presents commonly used databases such as GenBank (nucleotide database) and protein databases such as Swiss-Prot, PIR, GenPept and Uniprot. Another type of databases, focusing on patterns, is presented in the third section (PROSITE and TRANSFAC are employed as examples). Next, the concept of genome browser is introduced, followed by a description of the Gene Ontology database. The forth chapter is built on examples for processing biological sequences with Matlab. The basic operations on nucleotide sequences are also presented. The second part of the book focuses on information retrieval from biological databases and consists of five chapters covering notions from sequence homology to single and multiple alignments. The fifth chapter commences with an example of information retrieval from the Entrez environment (using Matlab). Next, the dot plots are introduced and their interpretation is given through examples of edit distances and dynamic programming. Examples for the Needleman-Wunsch, the Smith-Waterman and Blast algorithms are also included. The sixth chapter focuses on protein alignments. Following a description of commonly used scoring matrices (such as Pam or Blossum) the author discusses the criteria to choose an appropriate scoring matrix for a given problem. The seventh chapter presents approaches and algorithms for multiple sequence alignments (including the mathematical formulation for the MSA problem and the dynamic programming-based solution). Methods based on progressive alignments, on guide trees and profiles of aligned blocks are presented in detail. The eighth chapter focuses on alignment tools, with a particular emphasis on Blast. The seeding, extension, evaluation and interpretation of the \(p\)-value are discussed in detail. The ninth chapter introduces biolinguistic approaches and includes methods for comparing \(k\)-mer profiles and weighted profiles. Examples for processing profiles in Matlab are also included. The third part of the book consists of three chapters on biological sequence analysis. First, in Chapter 10, sequence models are introduced. The approaches are based on Markov chain models and on matrix association regions for selecting statistically significant motifs. Chapter eleven focuses on models for patterns in subsequences and commences with a discussion on regular expressions and weight matrices. Position-dependent Markov models and hidden Markov models (HMM) for multiple alignments are discussed at large. As an example of the applicability of these approaches, the PFAM database is presented. In the twelfth chapter approaches for the identification of gene models are presented. These include the neural network-based Grai, the quadratic discriminant analysis based Mzef, Genscan based on a probabilistic model, Veil based on HMM, Morgan based on a decision tree classifier, and the approches GeneFinder, GeneParser, GeneLang and AAT (analysis and annotation tool). The chapter also includes a comparison of gene finding algorithms with an analysis of performance of the parameters and results. The fourth part contains four chapters on phylogenetics, systems biology and an extra chapter on microarrays. Chapter thirteen is built as a gentle introduction to phylogenetic reconstruction. Next to the terminology and a brief description of the types of trees, the author presents the problems of counting and comparing phylogenetic trees. An example in Matlab and BioPerl is also included. In chapter fourteen the author discusses the sequence similarity and linkage analysis in the context of UPGMA algorithm. In the fifteenth chapter the parsimony is introduced as a character-based method. For finding the maximum parsimony tree, the approaches of counting substitutions for a tree, computing branch length and branch and bound optimization are also presented. Weighted parsimony algorithms are also described and protein alignments are used as example. The sixteenth chapter focuses on probabilistic methods: the maximum likelihood. First, the probabilistic models of evolution are discussed, followed by an analysis of the alignment of two sequences and the likelihood for ungapped alignments. The seventeenth chapter is built as a description of microarrays (focusing on affymetrix microarrays). The steps from raw data to the gene data matrix and the identification of genes of interest are presented in some detail. An example in Matlab is also included, next to the introduction of publicly available databases of experiments: the Gene Expression Omnibus (GEO). The book also includes two appendices with a basic description of Matlab (A) and BioPerl (B) functions and commands that facilitate the understanding of the examples presented throughout the book. By ending each chapter with a summary of key concepts, a set of exercises and a list of papers for further reading, the author envisages an extension of the information presented in the book. Also, although written for students (undergraduate and post-graduate) with a background in computing, the book can be adapted for students with a background in biology or chemistry. This book is a timely addition to the recently emerged and quickly developing field of computational biology.

0 references

reviewed by

Irina Ioana Mohorianu

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references