SPP 1530: Flowering Time Control - from Natural Variation to Crop Improvement

Gunnar Rätsch

Computational Methods for Accurate Transcriptome Reconstruction

With the recent advances in sequencing technologies, reconstructing in silico copies of transcriptomes of single cells, specific tissues or whole organisms has come into reach. However, the accurate reconstruction in presence of sequencing biases, inaccurate and repetitive genome sequences, yet incomplete annotations and many alternative transcripts based on relatively short sequence fragments is computationally very challenging. We show that the step of mapping of RNA-seq reads to a genome, which is often considered to be a solved problem, is crucial for many subsequent analyses including transcript identification, quantitation and differential expression. We give examples where the accuracy of transcript identification drastically varies depending on which read mapping strategy was used. We discuss ideas that we implemented in tools part of the PALMapper suite, including alignment filtering, multi-mapper resolution strategies and the efficient alignment in presence of known and unknown variations of the genome sequence. We show that a combination of these mapping strategies can significantly improve transcript identification and quantitation.

In the second part we will present a novel algorithm for simultaneous identification and quantification of transcripts based on RNA-seq data. We developed a novel approach that we call SplAdder based on mixed integer programming. SplAdder identifies the optimal combination of at most N transcripts jointly explaining the observed read coverage. It first generates a possibly dense transcript graph containing all possible exonic segments and intron junctions. It then quantifies possible transcripts encoded in the graph while enforcing parsimony. In contrast to rQuant, which needs a predefined transcript set as input, SplAdder does not require the generation of the many transcripts typically encoded by the transcript graph, but directly works with a compact representation of the relevant transcripts. This approach is also different from the approaches used in Cufflinks (smallest number of transcripts) and Scripture (all transcripts) as it automatically determines sets of relevant transcripts best fitting the observed read coverage. Finally, we show that the simultaneous transcript identification and quantitation using SplAdder, based on very accurate alignments produced by PALMapper, can significantly increase the accuracy of transcript prediction.

About the SpeakerGunnar Rätsch_photo

 Dr. Gunnar Rätsch studied computer science and physics and obtained his Ph.D. degree in computer science in 2001 with his work in Machine Learning at the Fraunhofer Institute FIRST in Berlin. He was a postdoctoral fellow at the Research School of Information Sciences and Engineering of the  Australian National University in Canberra (Australia), at the Max Planck Institute for Biological Cybernetics in Tübingen (Germany), and at Fraunhofer FIRST in Berlin (Germany).  In 2002, he received the Michelson award for his Ph.D. work and in 2007 he was awarded the Olympus prize from the German Association for Pattern Recognition. Between 2005 and 2011 he led a research group at the Friedrich Miescher Laboratory of the Max Planck Society in Tübingen (Germany). In January 2012 he and his group moved to the Memorial Sloan-Kettering Cancer Center in New York City (USA). In their research, they analyze and model transcription and RNA processing as well as the regulation thereof and contribute to the development of techniques ranging from machine learning, sequence analysis, and optimization, to genetics, statistical testing, and image analysis. Since 2008, he and his group have been developing algorithms for RNA-seq data analysis, including rQuant, PALMapper, rDiff, SplAdder and mGene.ngs.



Upcoming Events