18.418: Topics in Computational Biology

February 2, 2011	Introduction to 18.418
	Bonnie Berger Massachusetts Institute of Technology Professor, Applied Mathematics An introduction to the course.

February 7, 2011	Class

February 9, 2011

Discriminating coding and non-coding RNAs using comparative sequence analysis

Stefan Washietl
Massachusetts Institute of Technology
Posdoctoral Fellow, Kellis Group, CSAIL

In my talk, I will first briefly review challenges and the current state-of-the-art for genome-wide annotation of non-coding RNAs. To accurately locate non-coding RNAs in a genome it turned out to be critical to know what parts are actually coding. Although there are many sophisticated protein gene finders and very good annotations exist for most model organisms, there are also ambiguous and non-standard situations in which these programs fail. We have therefore developed a new algorithm called "RNAcode", a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene finding software. Our algorithm combines evolutionary information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied "out of the box", without any training, to data from all domains of life. I will demonstrate how RNAcode was used in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in E. coli that have evaded annotation so far. As another example of a typical application, I will show how RNAcode can be used together with the structural RNA gene finder RNAz to study ambiguous cases of dual function genes that function on both the RNA and protein level.

Washietl_etal_RNAcode_RNApreprint.pdf

February 14, 2011	Class

February 16, 2011

The evolution of eusociality

Corina Tarnita
Harvard University
Junior Fellow, Society of Fellows

Eusociality, in which some individuals reduce their lifetime reproductive potential to raise the offspring of others, underlies the most advanced forms of social organization and the ecologically dominant role of social insects. For the past four decades, kin selection theory, based on the concept of inclusive fitness, has been the major theoretical attempt to explain the evolution of eusociality. In this talk I propose that standard natural selection theory in the context of precise models of population structure represents a simpler and superior approach, allows the evaluation of multiple competing hypotheses, and provides an exact framework for interpreting empirical observations.

The evolution of eusociality.

February 21, 2011	Class

February 23, 2011

Probabilistic Graphical Model for Protein Structure Prediction

Jinbo Xu
Toyota Technological Institute at Chicago
Assistant Professor

If we know the primary sequence of a protein, can we predict its three‐ dimensional structure by computational methods? This is one of the most important and difficult problems in computational molecular biology and has tremendous implications for protein functional study and drug discovery. Existing computational methods for protein structure prediction can be broadly classified into two categories: template‐based modeling (i.e., protein threading/homology modeling) and template‐free modeling (i.e., ab initio folding). Template‐based modeling predicts structure of a protein using experimental structures in the Protein Data Bank (PDB) as templates while template‐free modeling predicts protein structure without depending on a template. This talk will present new probabilistic graphical models for knowledge‐based protein structure prediction. In particular, this talk will present a regression‐tree‐ based Conditional Random Fields (CRF) method for template‐based modeling and a Conditional Random Fields/Conditional Neural Fields (CRF/CNF) method for template‐free modeling. Experimental results indicate that our template‐ based method performs extremely well, especially on hard template‐based modeling targets and our template‐free method is also very promising for mainly‐alpha proteins.
Short Bio: Dr. Jinbo Xu currently is an assistant professor at the Toyota Technological Institute at Chicago (a computer science institute on the campus of the University of Chicago). He is also a visiting scientist at the CSAIL of the Massachusetts Institute of Technology. Dr. Xu received his PhD in Computer Science from the University of Waterloo and then spent one year as a Postdoctoral Fellow in the Department of Mathematics, MIT. Dr. Xuʹs primary research interest is computational biology and bioinformatics including analysis and modeling of biological sequences, structures and networks. His RaptorX/RAPTOR programs have been ranked very top in several CASP (Critical Assessment of Structure Prediction) events, the most well‐known competitions in the field of protein structure prediction. Dr. Xu was also invited to speak at the CASP meetings and publish papers in the CASP special issues.

February 28, 2011	Class

March 2, 2011

Ensemble Predictions of beta-sheet Protein Structures

Jerome Waldispuhl
McGill University
Assistant Professor

In this talk, I will describe my work in the area of protein structure prediction. I will introduce new ensemble modeling techniques which can analyze and predict an entire landscape of structural solutions, rather than simple single answer optimizations. This philosophy has a broad impact on our understanding of protein folding properties.
To describe our methods, I will start by illustrating how these techniques have been applied to transmembrane beta-barrel proteins. I will introduce a new family of algorithms for investigating this family proteins based only on sequence information, broad investigator knowledge, and a statistical-mechanical approach using the Boltzmann partition function. This provides predictions of all possible structural conformations that might arise in-vivo, along with their relative likelihood of occurrence. Using a parameterizable grammatical model, these algorithms incorporate high-level information, such as membrane thickness, with an energy function based on stacked amino-acid pair statistical potentials to predict ensemble properties, such as the likelihood of two residues pairing in a beta-sheet, or the per-residue X-ray crystal structure B-value.
In the second part of this talk, I will show how to generalize our methods for modeling ensembles of generic beta-sheet structures. From this ability to compute a realistic representation of the conformational landscape, we build a coarse-grained model of the energy landscape which is used to simulate folding processes. We illustrate our methods for dynamics prediction by applying it to the folding pathway of the well-studied Protein G. With relatively very little computation time, we show that our program tFolder is able to reveal critical features of the folding pathways which were only previously observed through time-consuming molecular dynamics simulations and experimental studies.

March 9, 2011	Evolutionary dynamics of cancer
	Franziska Michor Dana-Farber Cancer Institute and Harvard School of Public Health Associate Professor

March 16, 2011

Information from Networks

Leonid Chindelevitch
Pfizer

The networks describing the interaction between different biological entities can yield a lot of interesting information if analyzed properly. This talk will describe the analysis of two kinds of networks: metabolic networks and causal regulatory networks. We will construct mathematical models to ask questions of each kind of network, describe the algorithms required to provide answers, and finally discuss the kind of biological insights that arise from this analysis.

March 23, 2011	Spring Break

March 30, 2011	How do cells pack their DNA, and why do we care about it
	Leonid Mirny

April 6, 2011

Evidence of abundant stop codon readthrough in Drosophila and other metazoa

Irwin Jungreis
Massachusetts Institute of Technology
Research Scientist, Kellis Lab

Abstract: When encountering the stop codons of certain genes, ribosomes will insert a standard amino acid and continue translating, instead of stopping. While such stop codon readthrough occurs in many viral genomes, it has been observed for only a handful of eukaryotic genes. In 2007, Mike Lin found comparative genomics evidence that for 149 Drosophila genes the open reading frame following the stop codon is protein-coding, hinting that stop codon readthrough might be common in Drosophila. We have applied a wealth of bioinformatics techniques and genome-wide data sets to:

Obtain further evidence of translation downstream of these stop codons.
Rule out explanations other than readthrough.
Find clues about the mechanism of readthrough.
Find readthrough in other species and determine the phylogenic extent of abundant readthrough.

Short Bio: Ph. D. in Mathematics, Harvard, 1988. Computer Aided Design softward developer and executive 1987-2004. Founder and VP of Engineering of Revit Technology Corporation, maker of the leading software for architects 1998-2004. Research Scientist in Computational Biology, Kellis Lab, MIT 2009-.

April 13, 2011	Techniques for the analysis of ancient DNA
	Nick Patterson Broad Institute Two papers published last year described the analysis of DNA of Neandertals found in Vindija Cave, Croatia and DNA of a hominin from Denisova Cave, Siberia. I briefly describe the main results, but then go into more detail on the analysis which uses some novel methodology.

April 20, 2011

Dimensionality reduction in the analysis of human genetics data

Petros Drineas
Rensselaer Polytechnic Institute

Dimensionality reduction algorithms (either deterministic or randomized) have been widely used for data analysis in numerous application domains, including the study of human genetics. For instance, linear dimensionality reduction techniques (such as Principal Components Analysis) have been extensively applied in population genetics. In this talk we will discuss such applications and their implications for human genetics, as well as the potential of applying non-linear or supervised dimensionality reduction techniques in this area

April 27, 2011

Modeling Intrinsically Disordered Proteins

Collin Stultz
MIT

A number of neurodegenerative disorders such as Alzheimer’s disease and Parkinson’s disease involve the formation of protein aggregates. The primary constituent of these aggregates belongs to a unique class of heteropolymers called intrinsically disordered proteins (IDPs). While many proteins fold to a unique conformation that is determined by their amino acid sequence, IDPs do not adopt a single well-defined conformation in solution. Instead they populate a heterogeneous set of conformers under physiological conditions. Nevertheless, despite this intrinsic propensity for disorder, a number of these proteins can form ordered aggregates both in vitro and in vivo. As the formation of these aggregates may play an important role in disease pathogenesis, a detailed structural characterization of these proteins and their mechanism of aggregation is of critical importance. One problematic issue is that the characterization of intrinsically disordered proteins is quite challenging because accurate models of these systems require a description of both their thermally accessible conformers and the associated relative stabilities or weights. These structures and weights are typically chosen such that calculated ensemble averages agree with some set of prespecified experimental measurements; however, the large number of degrees of freedom in these systems typically leads to multiple conformational ensembles that are degenerate with respect to any given set of experimental observables. In this talk I will discuss a method for modeling these systems that is based on Bayesian statistics. A unique and powerful feature of the approach is that it provides a built-in error measure that allows one to assess the accuracy of the resulting ensemble. We apply the method to the intrinsically disordered proteins, tau protein and alpha synuclein, which have been implicated in the pathogenesis of Alzheimer’s disease and Parkinson’s disease, respectively. The models reveal specific patterns of long-range contacts that may play a role in the aggregation process.

May 4, 2011

Liability threshold modeling increases power in case-control association studies

Alkes Price
Harvard University

Genetic case-control association studies often include data on covariates, such as body mass index (BMI) or age, that may modify the underlying genetic risk of case or control samples. For example, in type 2 diabetes, odds ratios estimated from low-BMI cases are larger than those estimated from high-BMI cases. An unanswered question is how to optimally use this information to maximize statistical power. In this study we show via simulation that our approach to fitting liability threshold models and computing association statistics, which accounts for disease prevalence and non-random ascertainment, can use this information to increase power. Our method outperforms standard case-control association tests, case-control tests with covariates, tests of gene x covariate interaction, and tests that restrict to a subset of samples. We investigate empirical case-control studies of type 2 diabetes, prostate cancer, breast cancer, rheumatoid arthritis, age-related macular degeneration, and end-stage kidney disease over a total of 78,256 samples. In these data sets, liability threshold modeling outperforms logistic regression for 104 of the 140 known associated variants investigated (p-value < 10-9). The improvement varied across diseases with a 17% median increase in test statistics, corresponding to a greater than 25% increase in power. Application of liability threshold modeling to future case-control association studies of these diseases, or other diseases with analogous effects of covariates on genetic risk, will yield a substantial increase in power for disease gene discovery.

May 11, 2011	Mona Singh