This week’s BEACON Researchers at Work blog post is by University of Texas at Austin postdoc Daniel Deatherage.
My doctoral work focused on epigenetic changes in ovarian cancer in the lab of Dr. Tim Huang at The Ohio State University. A common theme of the journal clubs I attended was the rise of next-generation sequencing technologies, the enormous amounts of data they produced, and questions they could answer that were unanswerable by any other method. One thing that always struck me was how small variations in constructing sequencing libraries could have profound effects on the data the sequencing runs would produce. Because of this, one of my primary interests became how to generate sequencing libraries capable of answering questions not possible even with “standard” next-generation sequencing. Now in my post-doctoral work with Dr. Jeffrey Barrick at The University of Texas at Austin I have sought to use modified Illumina sequencing adapters to monitor how mutations within evolving populations of E. coli spread while they are still very rare (<0.01%) rather than having to wait until they reach 1% of the total population as standard Illumina sequencing requires. Hopefully by the time you are finished reading this, you will have a new appreciation for just how powerful next-generation sequencing can be even if you are already using it in your own research.
Identifying mutations in next-generation sequencing data can be very difficult regardless of what method of analysis you choose. This is compounded when looking at mixed populations where both mutations and wild type sequences exist. If I were going to use this blog post as nothing but a bit of shameless self-promotion, the rest of it would likely talk about all the benefits of the breseq computational program that our lab has developed. I could go on and on about how well it automates the identification of mutations particularly in organisms with genomes smaller than 20 mega bases that have good reference sequences, about how its being actively developed as a tool for the entire community that is already being used by many BEACON and non-BEACON researchers alike, and about how it is freely available for both linux and mac operating systems with tutorials for understanding how to use it. At the end of this hypothetical self promotion I would put in a hyper link to the breseq page where you can find instructions on how to install the program and the tutorials for how to use it, and say something about how I hope anyone using next-generation sequencing in their research considers using breseq and contacts our lab if they run into difficulties.
Instead, I’d like to talk about a key limitation of all standard next-generation sequencing experiments, and how I am overcoming this limitation in my research. Next-generation sequencing error rates are something that can quite often be completely ignored, particularly when sequencing a sample that you expect to have only a single genotype (such as a bacterial culture grown up from a single colony) as the predominantly random distribution of sequencing errors are unlikely to yield false-positive mutations. When you begin sequencing samples that are mixed populations, the error rate sets a floor on the minimum frequency you can confidently call no matter how much sequencing you perform. In the case of standard Illumina sequencing, although looking at different subsets of data can lower the error rate somewhat, the overall error rate is commonly reported to be ~1% [1]. Computer simulations show that only a small fraction of total mutations in an evolving bacterial population rise above a frequency of 1%, meaning that a study which does not compensate for sequencing error rate is only looking at the tip of the iceberg and ignoring several orders of magnitude more mutations (see figure at left). As mentioned previously, because the error rate is randomly distributed among all reads, methods such as duplex sequencing [2], which incorporate random nucleotides into the adapters as the first bases sequenced, allow reads corresponding to the same original fragment of DNA to be grouped together based on the “molecular index” and more accurate consensus sequences (with error rates of < 0.01%) can be generated for each read group. This means that mutations can reliably be detected while they are at least 100 times more rare, or only present in a single cell out of more than 10,000.
Because this method of error reduction requires multiple reads from each fragment of DNA, E. coli such as the thoroughly studied REL606 with its ~4.6 megabase genome could easily require more than 1 billion Illumina reads to give 10,000 fold coverage of the entire genome. While it is certainly possible to generate such a quantity of reads, it is not necessarily the wisest investment of money particularly when so much more is known about the organism. The decades of research performed by Richard Lenski and colleagues on REL606 and its evolved descendants in the E. coli long-term evolution experiment (LTEE) has amassed a list of genes that can be mutated to provide a selective advantage. Using some of this knowledge, I designed iDT xGen biotinylated probes against several genes I expected to have beneficial mutations within 500 generations. These probes were hybridized with Illumina libraries containing molecular indexes and enriched for the targeted genes with streptavidin beads. This caused on average ~70-80% of reads to map to the 8 genes of interest corresponding to just ~0.7% of the genome making it highly economical to deep sequence numerous mixed populations.
Despite the enormous power of the “frozen fossil record” of the LTEE experiments performed by Richard Lenski, populations have only been frozen every 500 generations which is a very long time when looking at rare mutations. To overcome this, I allowed six replicate populations of REL606 to evolve under nearly identical conditions to the LTEE for 500 generations, but froze each day’s culture taking up more than half of a large –80°C chest freezer much to the annoyance of other lab members. Sequencing libraries were generated at ~13 to 25 generation increments over the course of the experiment for each of the different populations and analyzed with breseq revealing unprecedented insights into the beneficial mutational landscapes of indivi
dual genes and epistatic interactions. These results should be published soon, but to underscore just how powerful this approach has proven I’ll share two highlights. First, more than 150 beneficial mutations were identified in just 3 genes which is significantly more than have previously been reported for these genes. Second, the fitness effect of all 150+ mutations has been determined based on sequencing data alone, and it is in agreement with clones verified to have individual mutations and competed against the ancestor in conventional fitness assays. These findings would not have been possible if we restricted our analysis to mutations that reach 1% frequency because clonal interference quickly becomes the key force acting on the population and the majority of mutations are outcompeted by a single clone often harboring multiple mutations before they can reach 1% of the total population. As sequencing costs continue to fall this type of analysis should make it possible to map the entire single step beneficial mutational landscape that is available to an organism in a single experiment.
References:
- Lou, D. I. et al. (2013)High-throughput dna sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci USA, 110: 19872–19877.
- Schmitt MW, et al. (2012) Detection of ultra-rare mutations by next-generation sequencing.Proc Natl Acad Sci USA, 109:14508–14513.
For more information about Dan’s work, you can contact him at daniel dot deatherage at gmail dot com.