BEACON Researchers at Work: Males have no taste… at least if you are a Heliconius butterfly

This week’s BEACON Researchers at Work post is by BEACON Faculty Affiliate Adriana Briscoe, from University of California, Irvine.

Males have no taste… at least if you are a Heliconius butterfly

Unlike their male counterparts, female Heliconius butterflies have taste receptors on their forelegs in order to pick non-toxic plants on which to lay their eggs

My research group in the Department of Ecology and Evolutionary Biology at the University of California, Irvine studies the sensory world of butterflies.  We primarily work on vision but have recently extended our scientific gaze to include olfaction and gustation with the publication of reference genomes for the passion-vine butterfly Heliconius melpomene. and the monarch Danaus plexippus.

Aide Macias-Muñoz, PhD student

Aide Macias-Muñoz, PhD student

Nearly a year ago, I visited the Department of Zoology at the University of Cambridge, U.K. as an Overseas Visiting Scholar at St. John’s College.  From that lovely vantage point near the River Cam, I directed a collaborative project that brought together work by my first-year BEACON-sponsored PhD student Aide Macias-Muñoz and work by graduate students Simon Martin and Krzysztof Kozak and undergraduate Gabriel Jamie in Dr. Chris Jiggins’ lab. Our collaborative work was published 11 July, in the journal PLoS Genetics.

Here are some shamelessly lifted words from the press release about our work by the University of Cambridge, U.K.:

“Female Heliconius butterflies have taste receptors embedded in spikes on their legs in order to spear and ‘taste’ plants to find the most beneficial ones on which to lay their eggs, new research reveals. As male Heliconius butterflies do not lay eggs, they have no taste receptors on their [fore]legs.

H. melpomene female laying an egg on a passion-flower vine.

H. melpomene female laying an egg on a passion-flower vine.

For the research, [we] studied the genes that code for the taste receptor proteins. Using new high-throughput sequencing methods [this was Aide’s work on the project], [we] were able to identify genes expressed at very low levels, including the great diversity of taste receptor genes unique to female Heliconius butterflies.

Because, unlike their parents, caterpillars cannot fly away to find a more suitable plant, it is imperative that the female butterflies choose a non-toxic host plant for their eggs. The proteins that are coded for by the taste receptor genes enables the female butterflies to identify non-toxic plants on which to lay their eggs.

It is a long-standing hypothesis that butterflies are so diverse partly because of the complicated co-evolutionary arms race with the plants that their larvae eat – as plants develop new ways to prevent being eaten, butterflies develop new ways to eat plants.

For example, Heliconius butterflies evolved in a way, which allows them to feed on the highly-toxic, cyanide-containing leaves of passion flower vines.

The Heliconius butterflies have not only evolved to overcome the plant’s defences, but can now even synthesise their own cyanide-containing compounds that protect them from predators.”

When I visited BEACON on my way to Cambridge last September this project was just beginning to take shape.  I was inspired by a conversation I had with Danielle Whittaker about BEACON’s public outreach efforts.  This led me to approach a gifted cartoonist, Jay Hosler, to take on the job of translating our discoveries to the public. His gorgeous artwork was published as a supplementary figure to our PLoS Genetics paper and is reproduced below:

“For Bitter or Worse: A Tale of Sexual Dimorphism and Good Taste”, an original cartoon by author and illustrator of science-oriented, Jay S. Hosler.

Heliconius_gustation_FINAL_Page_1

Heliconius_gustation_FINAL_Page_2

Reference: 

Briscoe AD, Macias-Muñoz A, Kozak K, Yuan F, Walters JR, Jamie GA, Martin SH, Dasmahapatra KD, Ferguson LG, Mallet J, Jacquin-Joly E, Jiggins CD. 2013. Female behaviour drives expression and evolution of gustatory receptors in butterflies. PLoS Genetics 9: e1003620. DOI: 10.1371/journal.pgen.1003620

Featured in Scientific American blog Not Bad Science, 24 July 2013

http://blogs.scientificamerican.com/not-bad-science/2013/07/24/how-might-female-butterflies-gain-an-advantage-how-about-having-the-ability-to-taste-through-their-feet/

For more information on Adriana Briscoe’s work, you can contact her at abriscoe at uci dot edu.

Posted in BEACON Researchers at Work | Tagged , , , , , , , | Leave a comment

BEACON Researchers at Work: Bioinformatics tools

This week’s BEACON Researchers at Work blog post is by University of Idaho graduate student Ilya Zhbannikov.

IlyaI graduated from the Moscow Aviation Institute (National Research University, Russia) with a Masters Degree in Information Systems in 2009. After a year of working as a software developer, I joined the University of Idaho (USA) and graduated with a Masters Degree in Computer Engineering (2012). Currently I am pursuing a PhD in Bioinformatics and Computational Biology at the same university and expect to finish it in 2014. I have a wide range of research interests: high-throughput sequencing data analysis, parallel computing, biomedical text mining and phylogeny.

The goal of my research in high-throughput sequencing data processing is to understand, analyze and improve the data in order to use it in subsequent stages of research. With newly developed next-generation sequencing technologies and increasing interest in gene discovery, DNA mapping, functional genomics and genome annotations, the amount of data produced by sequencing has an exponential growth curve and doubles every month. Sequence data from automatic sequencing machines (“as is”) should not be considered as “ready-to-use” data for analysis due to various contaminants remaining after sequencing. An additional cleaning step to filter such remnants is needed to prepare reads for further analysis. To provide this service, I propose SeqyClean, a software tool to clean next-generation sequence data.

Nowadays it also has become possible to sequence an entire genome quickly and inexpensively. However, in some experiments one only needs to extract and assemble a portion of the sequence reads, for example when performing transcriptome studies, sequencing mitochondrial genomes, or characterizing exomes. With the DNA library of a complete genome, one might think it would be no problem to identify the reads of interest. But it is not always easy to incorporate well-known tools such as BLAST, BLAT, Bowtie, and SOAP directly into a bioinformatics pipelines before the assembly stage, either due to incompatibility with the assembler’s file inputs, or because it is desirable to incorporate information that must be extracted separately. I am working on a tool, SlopMap, which can identify the reads of interest from the given DNA library.

The goal of my research in Biomedical text mining is to demonstrate how methods from Systems Biology, along with newly developed text mining techniques, can be applied to publication abstracts for the problem of discovering hidden relationships and features within microbial communities. Recently, microorganisms such as bacteria and whole bacterial communities have become models for Systematic Biology and Bioinformatics. The recent studies of the vaginal microbiome show improvements in learning and classification of microbiota but still lack a systematic approach to this problem. On the other hand, many researchers still use very general problem-solving techniques that consist of systematically enumerating all possible candidates for the solution. Such an approach can be ineffective and tends to delay future discoveries. To alleviate the bottleneck, I propose a tool intended to reduce the range of possible solutions and suggest hypotheses by taking advantage of previously published work along with newly developed text-mining algorithms and graph theory. I have developed a beta-version of an application, BALMNet, which provides these services by constructing microbial interaction networks from a set of PubMed abstracts.

Many comprehensive phylogenetic hypotheses have already been introduced to solve the problem of combining trees computed from data from different loci into one “supertree.” A supertree contains all taxa, while smaller input trees contain only a small part of a phylogeny and are often incompatible with one another. Input trees may not incorporate enough data for a perfect supertree, which leads to multiple supertrees, making it impossible to distinguish among constructed trees to choose the best. The concept of Phylogenetic Decisiveness presented by Steel and Sanderson (Steel M., Sanderson M., “Characterizing phylogenetically decisive taxon coverage,” Appl Math Letters 23:82–86) addresses the problem employing a special criterion, decisiveness, to estimate a unique phylogeny for the given taxa. I created the program decisivatoR, an R infrastructure package inspired by the work of Steel, Sanderson and Fischer. However, the problem to determine what to add to the data to produce a unique evolutionary tree still remains open. The web-version of decisivatoR (see figure below) is available here: http://glimmer.rstudio.com/izhbannikov/DecisivatoR/.

decisivator

For more information please check my blog: http://bioalgo.blogspot.com/

 

Posted in BEACON Researchers at Work | Tagged , , , , , , , | Leave a comment

Evolving ecosystems can change more than previously thought

Cross-posted from MSU grad student Randy Olson’s blog.

For decades, whenever ecology researchers used computer models to study how ecosystems change over time, they often assumed that the species in any given ecosystem are more-or-less fixed. The abundances of each species may change over time — and some species may even go extinct and be replaced by an existing species from another ecosystem — but once a stable ecosystem is established, new species aren’t going to evolve. In ecology research, such an ecosystem is said to have reached an ecological fixed point.

However, recent research suggests that ecosystems may not be as fixed as previously believed. Ecological communities may look like fixed points only because new species don’t evolve on timescales that are easily observed in our lifetime. If it takes thousands or even millions of years for a new species to evolve, then a community may look fixed on the scale of hundreds of years, but be quite fluid on the scale of thousands or millions of years. The new evidence suggests that natural ecosystems exist in a dynamic steady state when studied from the point of view of longer time scales, where new species are constantly evolving and replacing the incumbent species over extremely long time periods.

Of course, it’s difficult to study whether natural ecosystems actually exist in a dynamic steady state because it would take thousands of years to conduct such an experiment. That’s why we turned to the digital evolution platform called Avida — where we can simulate evolving ecosystems over thousands of digital years — to try and answer this important question. Below is a high-level summary of our findings from the BEACON class research project. If you want to get into the nitty-gritty details of the experiments, we’re publishing a report on the findings in the Proceedings of the European Conference on Artificial Life in September 2013 (preprint here).

Evolving digital ecosystems: dynamic steady state, not ecological fixed point

To study whether ecosystems exist in a dynamic steady state, we had to simulate an evolving ecosystem for thousands of digital years. In Avida terms, we evolved the ecosystems for 500,000 updates (roughly 60,000 generations), which is more than enough time for an ecosystem to reach a stable state in Avida. At update 500,000, we took a snapshot of the ecosystem and determined the species present at update 500,000. Following that, we evolved the ecosystem for another 500,000 updates, took another snapshot, and determined the species present at update 1,000,000. What we found was surprising: the species had drastically changed after 500,000 updates of evolution!

A simplified view of an evolving digital ecosystem: Many species in a stable ecosystem can change after thousands of years of evolution. For example, the “green circle” species evolved into a “yellow star” species that doesn’t resemble its ancestor at all. Similarly, the “blue triangle” species evolved into a “green triangle” species, which could mean that the species evolved to live off of a different food source. Finally, the “red square” species stayed the same after several thousand years of evolution; whether that was due to chance or the species was strongly selected to remain the same is a subject of future study.

If evolving ecosystems do indeed exist in an ecological steady state, we should have seen the exact same species at updates 500,000 and 1,000,000. Instead, we saw ecosystems that consumed the same resources and had the same number of species, yet the individual species had changed after thousands of digital years of evolution. We interpreted this phenomenon to suggest that natural ecosystems exist in a dynamic steady state rather than a single ecological fixed point. This finding has far-reaching impacts in ecology research, especially with the recent findings that some species can evolve much faster than traditionally assumed.

Population bottlenecks aren’t so bad in the long run

Another open question affecting ecosystem dynamics over time is whether extreme events such as catastrophic population bottlenecks have a lasting impact on the long-term evolution of ecological communities. To investigate this question, we ran the same experiment as above, except we randomly killed all but a fixed number of organisms after we took the snapshot of the ecosystem at update 500,000. We know, it’s a little cruel to kill so many digital organisms en masse, but we don’t have to get IACUC approval for experiments with digital organisms (yet).

These experiments yielded another series of interesting and unexpected results. In the report linked above, we showed that regardless of the size of the bottleneck, the ecosystem recovers from the population bottleneck in the long term regardless of the size of the bottleneck, even if the ecosystem is reduced to a single organism. This result tells us that ecosystems are surprisingly robust to catastrophic events over long timescales.

Don’t worry about pandas going extinct; another panda-like species will evolve again in another million years. Photo: flickr/Chris Wieland

When we compared the species at update 1,000,000 to their ancestors from before the bottleneck, we found that they were also quite different. Surprisingly, when we compared how different the species were using a species similarity measure, we found that the species in ecosystems that experienced a population bottleneck were just as different from their ancestor population as the species in ecosystems that never experienced a bottleneck. This result again tells us that population bottlenecks have a negligible impact on the long-term evolution of ecosystems, and more importantly, hints that neutral evolution plays a much larger role in shaping ecosystems over long timescales than we previously thought.

This study teaches us yet another important lesson about how evolution shapes ecological communities, perhaps best captured in Bob Dylan’s famous hit song: for the ecosystems, they are a-changin’…

Posted in Uncategorized | Leave a comment

BEACON Researchers at Work: Expanding the Genetic Code

This week’s post is by UT Austin graduate student Michael Hammerling.

Michael HammerlingFor as long as I can remember, I’ve been drawn to philosophical questions about the nature of life and its relationship to the physical world. While it became clear to me over time that empiricism and the scientific method are much more powerful tools than pure philosophy for the exploration of such questions, I continue to believe that the most luminary scientists are those who are also philosophers at heart. They challenge paradigms, abhor dogma, and are habitually aware and skeptical of the assumptions that form the basis of their beliefs. A quote credited to Einstein expresses this most concisely:  “We cannot solve our problems with the same kind of thinking we used when we created them.” While new, creative ways of thinking are necessary, to scientifically address a problem, we also require technological advances which permit direct empirical testing of hypotheses. Incredible advances in genetic engineering are currently allowing biologists to alter the most fundamental properties of living things, and as a corollary, to question some of our most fundamental assumptions.

The genetic code is one of the most ancient and revolutionary developments in the history of biological evolution.  It describes the rules by which the information contained within sequences of DNA – the primary molecule for storing and passing on genetic information – is translated into the proteins which perform most of the important functions within cells. The rules for translating this code into protein are simple in principle. There are four types of DNA bases, abbreviated A, T, C, and G.  In a DNA sequence, every set of three bases is called a codon and has a specified meaning: ATG signifies the START of a protein, three STOP codons (TAG, TAA, and TGA) indicate the end of a protein, and all other codons signify one of the twenty amino acids. The sequence of amino acids in a protein dictates its function. Since there are 64 possible three-base codons but only twenty natural amino acids, the code is redundant, meaning that many of the amino acids may be coded for by more than one codon.

This triplet code is so fundamental to biological life that it is nearly universally conserved in all living things. Since its elucidation, the nature of the genetic code has puzzled some of the greatest minds in molecular biology and evolution. Is it a “frozen accident” of evolution, as Francis Crick proposed, which may have turned out differently if the history of life was replayed, but is now too deeply ingrained in biology to be capable of change? Instead, is the static nature of the genetic code explained because it is in some way optimal, providing the perfect combination of chemical diversity coupled with an ideal robustness against mutation? While the truth likely lies somewhere between these two extremes, the difficulty of engineering changes to the code has prevented the experimental exploration of this topic until now.

Expanded genetic code. Genetic engineering of the bacterium's protein translation machinery allows incorporation of a 21st amino acid. By introducing new genetic elements, a codon designates protein termination (STOP) in the natural genetic code now codes for the unnatural amino acid 3-iodotyrosine.

Expanded genetic code. Genetic engineering of the bacterium’s protein translation machinery allows incorporation of a 21st amino acid. By introducing new genetic elements, a codon designates protein termination (STOP) in the natural genetic code now codes for the unnatural amino acid 3-iodotyrosine.

Recent advances in genetic engineering technology have allowed researchers to redefine the meaning of the rarely used amber (TAG) stop codon to specify a large variety of unique 21st amino acids. Many of these amino acids contain interesting chemical groups, and hold great potential for engineering proteins with improved functions. These systems also allow us to ask direct experimental questions about the optimality and flexibility of the genetic code. Will organisms with a newly expanded genetic code evolve to use the new amino acid in their proteins to improve their fitness, or will the new building block be avoided in favor of the canonical set of amino acids? What conditions favor assimilation of the new codon? Will organisms evolved for long periods of time come to require the 21st amino acid for survival? These are the questions I am exploring in my research.

The virus T7 which infects bacteria was allowed to evolve freely for many generations by growing it on a host with an expanded genetic code. The mixed population of viruses that we obtained from this experiment was deep sequenced, with many interesting results.

The virus T7 which infects bacteria was allowed to evolve freely for many generations by growing it on a host with an expanded genetic code. The mixed population of viruses that we obtained from this experiment was deep sequenced, with many interesting results.

As a preliminary foray into this area of study, we performed an evolution experiment with T7 bacteriophage, a virus which infects the model bacterium Escherichia coli. This virus has a small genome and grows rapidly, allowing many generations of evolution to be completed in a reasonable amount of time, and yielding interpretable results. It utilizes the host bacterium’s genetic code, allowing us to expand the genetic code of the host to incorporate the unnatural amino acid 3-iodotyrosine at the amber (TAG) codon, and observe the impact on the course of viral evolution. As expected, phage populations evolved to avoid the canonical function of the amber codon as a translation terminator.

In addition, and much to our surprise, multiple phages also evolved to use this unusual amino acid in important and essential genes. In one case, we showed that phages containing the unnatural amino acid in their type II holin protein were more fit than those with the original or any other amino acid at that position, which explained the high frequency this mutation reached in the population. This constitutes the first demonstration of a beneficial mutation enabled by unnatural amino acid incorporation in an organism.

These findings challenge the view of the genetic code as an optimal arrangement, and improve our understanding of one of life’s most basic properties. Under certain conditions, an expanded genetic code can increase the evolvability of organisms and proteins, opening new routes to increased fitness and unique functions. Further work will continue to explore the impact of a variety of unnatural amino acids on the evolution of different organisms and proteins.

 For more information about Michael’s work, you can contact him at mhammerling at gmail dot com.

Posted in BEACON Researchers at Work | Tagged , , , , , | Leave a comment

BEACON Researchers at Work: The páramos – understanding a hyperdiverse ecosystem one genus at a time.

This week’s BEACON Researchers at Work post is by University of Idaho graduate student Simon Uribe-Convers.

SimonI am always amazed by the huge diversity around us. Taking it for granted, it seems humans only remember the world’s diversity when watching a BBC documentary. Nevertheless, the study of biodiversity has been one of the main foci for biologists and naturalists. As humans, we have the tendency to classify objects based on their characteristics and/or functions, and biological entities are treated in a similar way. To date, plant systematists have discovered and classified more than 300,000 species of flowering plants mostly based on morphological characters. With the extensive use of DNA in the last couple of decades we have been able to corroborate and correct many of these classifications. However, conservative estimates of biodiversity suggest that we have only discovered and named ~10% of all the species found on the planet. This is particularly evident when we look at the high numbers of new species described from well-explored areas like North America, as well as important biodiversity hotspots around the world. I have the luxury to work in one of these biodiversity hotspots: the páramo. 

ParamoThe páramo ecosystem, found in the high elevation regions of Andean South America, is considered to be the most biodiverse montane ecosystem in the world with ~4,000 species of vascular plants, and nearly 60% (~2400 spp.) of these are endemic. It is thought that the current diversity found in the páramos is the result of different evolutionary processes involving colonization of species from the Amazon basin, the South American savannas, and temperate North America. Furthermore, because the páramos are only found above 10,000 ft. along the Andean cordillera and effectively act as islands, ‘sky-islands’, surrounded by lowland valleys that prevent the dispersal of species, it suggests that in situ and/or allopatric speciation may have played a large role in shaping the diversity that we see today.

Unfortunately, it is impossible to study biodiversity as a whole in such a rich ecosystem like the páramo, and thus, we study it by choosing groups of species that are representative of the environment. I have chosen the plant group Bartsia L. for my dissertation and I am studying the evolutionary history of this genus and how it came to be what is observed today.

Bartsia is a member of Orobanchaceae, the largest parasitic plant family of angiosperms, and is what botanists call a hemiparasite. Hemiparasite plants are those that still produce chlorophyll and that are capable to do photosynthesis. Nevertheless, hemiparasites plants still form connections with their hosts and obtain water and sometimes nutrients from them. In the case of Bartsia and other close relative in Orobanchaceae, they do it by forming root-to-root connections through a structure known as hautorium (plural: hautoria). The genus is comprised by ~50 spp. and as it is currently defined, it has two species in the mountains of northeastern Africa, one in the Mediterranean region, and one north temperate species. The remaining ~45 species are distributed throughout the páramo of Andean South America and are the main focus of my work.

Bartsia

The first question I had about Bartsia pertained to its strange and very disjunct distribution. I wanted to find out whether these species are each other’s closest relatives (in phylogenetic terms, whether they are monophyletic). By using molecular techniques and a few chloroplast and nuclear genes, I discovered that they are not monophyletic and that they are actually four distinct evolutionary lineages that correspond to their geographic distributions. The result of this discovery has some taxonomic implications with the most important one being the break up of Bartsia into groups that better reflect their evolutionary, or phylogenetic, relationships.

I was also interested to know how and when Bartsia had colonized the páramos. The Andes are a very young mountain range that started uplifting approximately 10 million years ago (MA) in a geologic period known as the Miocene. However, it was not until only about 5 MA that the Andes were high enough to host alpine-like conditions, the ones that Bartsia strives in. I used computer algorithms that make use of genetic data and time calibrations, like the uplift of the Andes or additional molecular dates, to infer that Bartsia colonized the páramos between 1.53 and 4.11 MA. These dates fit well with the time when the alpine conditions, or in other words the páramos, were appearing as a new environment. Now, imagine that a new empty niche is created and that the first groups that get to it have the opportunity to radiate into multiple species. This is what likely happened with the common ancestor of the current South American Bartsia. Actually, if we look at how much the genus is diversifying, something we do by looking at the rate of how many species are being created (speciation) minus the rate of how many are going extinct, we can see that the genus is diversifying twice as fast as its European relatives! Something similar has been shown in other plant groups in the páramos like lupines (Lupinus) and in the valerian family (Valerianaceae).

Finally, I am currently working on understanding the relationships between the ~45 spp. in the páramos. Due to their young age and thus similar genetic content, it is hard to gather the necessary phylogenetic information from just a few genes. This is why I have designed PCR primers for the 48 most informative regions in the chloroplast and I am in the process of designing primers for 48 independent nuclear genes. I will amplify all these regions using a new method called microfluidic PCR, which allows me to amplify ~2300 reactions simultaneously that are ready to be sequenced using next-generation sequencing platforms. The data from the 48 chloroplast regions is ready and looks good and I am excited for the nuclear data!

With all these I expect to be better able to understand how Bartsia diversified in the páramos and to be able to infer other evolutionary patterns in the group like its biogeography. I also hope that my study has broader implications by aiding in our understanding of how hyperdiverse ecosystems like the páramo have assembled, which is of general interest to biodiversity researchers, as well as policy makers and conservation agencies. By understanding how certain groups evolved, we can make predictions on how other similar groups will react to changes in their habitat due to land use or climate change, which will aid in the creation of more informed policies and conservation strategies.

For more information about Simon’s work, you can contact him at uribe dot convers at gmail dot com.

Posted in BEACON Researchers at Work | Tagged , , , , , | Leave a comment

BEACON Researchers at Work: Finding hidden flaws and features in evolutionary computing

This week’s BEACON Researchers at Work blog post is by MSU graduate student Brian Goldman.

Profile_PictureFor me, some of the most enjoyable moments in research are when I’m outsmarted by my own creation.  Anyone who’s spent enough time with Evolutionary Computing (EC) can probably tell you a story where this happened, but the example I remember learning in my introductory course has always been one of my favorites.  Researchers set out to evolve a vacuum cleaning robot.  They select whichever robots can “suck up the most dirt.”  Seems like a good idea, right?  Everything started out okay, with robots cleaning the room decently well.  Then one of the robots realized, “If I dump all of my collected dirt on the floor, I can suck it up again!”  Sure enough that strategy is very good at sucking up the most dirt, allowing robots to gather more dirt that exists in the room, all without having to move.

Maybe that example didn’t really happen, but it illustrates what I do.  My research involves deep theoretical and empirical analysis of artificial evolutionary systems, finding negative behavior or unnecessary complexity, and proposing fixes.  Just like the faulty robot, many evolutionary systems used to solve real world problems appear straight forward, built using simple rules.  Yet the impact of those rules may not be obvious at first glance.  Sometimes those degenerate behaviors which hinder optimization can be just as hidden as their causes, meaning the user may not even be aware of the problems they are having.

Over the last year I have focused on Cartesian Genetic Programming (CGP), which is a powerful yet relatively simple way of evolving functions.  Some examples of what I mean by functions are circuits, neural networks, robot controllers, image classifiers, and system models (regression).  CGP has been used to evolve all of these with great success in the almost 15 years since its inception. For those familiar with genetic programming or other machine learning methods, CGP’s claim to fame is that it can evolve direct acyclic graphs, allowing for intermediate value reuse while still creating relatively comprehensible solutions.

So what degenerate behavior can still exist in such a widely effective and well established evolutionary optimizer?  You would think by now all the bugs in a system as simple as CGP would have been found and eliminated.  It turns out that there are a number of “features” in CGP which are detrimental to evolving useful solutions quickly that have received little attention or were unknown to the CGP community. The reason they managed to survive this long is that they have been hiding in otherwise beneficial behaviors.

CGP example individual which encodes the function (X+Y) / (X+Y+Z).  The top portion is the phenotype and the bottom is the genotype. The grey portions of the genotype are inactive nodes.

CGP example individual which encodes the function (X+Y) / (X+Y+Z). The top portion is the phenotype and the bottom is the genotype. The grey portions of the genotype are inactive nodes.

It’s time for some concrete examples.  The two I’d like to discuss here involve a feature of CGP rarely seen in other EC system: explicitly inactive genes.  These are portions of the genome that are identifiably not part of the encoded solution.  They may be incorporated in future offspring or may have been used by the individual’s ancestors.  In biological terms, these are genes that have no impact on an individual’s phenotype.  There is solid theory and a lot of evidence that having these inactive genes is helpful to both evolution in general and CGP’s optimization in specific.

The first behavior I investigated involved the creation of actively identical offspring.  What I mean by “actively identical” is that these offspring have no mutations to any of the genes used to encode their solution.  They may still have mutations to their inactive genes.  These offspring, originally discussed but not investigated by CGP’s creator, have the potential to waste search evaluations.  In EC the goal is to obtain high quality solutions using the least number of evaluations, so almost anything that can reduce evaluations improves the algorithm.  My contribution was to determine the exact formula behind how much waste was being generated, and to propose three simple ways of modifying CGP to avoid this waste.  These techniques improved CGP’s efficiency anywhere from statistically significant amounts to orders of magnitude, depending on configuration.  They also made CGP easier to apply to new problems by making search speed less sensitive to the mutation rate, which must be set by the user.

Another touted feature of CGP related to inactive genes is its ability to resist bloat.  In the EC community, bloat refers to the tendency for solutions to become more complex with time, with little to no improvement in solution quality.  In CGP, evolved solutions almost never approach the maximum size allowed, with most genomes composed almost entirely of inactive nodes.  Through probability theory and some tailored test cases I was able to show that CGP’s resistance to bloat was due to a substantial bias in its search mechanisms.  I proposed two new techniques which broke free of this bias, improving CGP search efficiency even further as well as reducing solution complexity.

I am currently wrapping up my work with CGP, but I plan to continue performing deep analysis and improvement of evolutionary optimization systems.  This fall I plan to propose a few novel systems of my own.  For now though its back to white boards, formulas, and paper writing for me.

For more information about Brian’s work, you can contact him at goldma72@msu.edu.

Posted in BEACON Researchers at Work | Tagged , , , , , | Leave a comment

BEACON Researchers at Work: Visualizing and understanding ‘context dependence’ in evolution

This week’s BEACON Researchers at Work blog post is by MSU graduate student Sudarshan Chari.

Sudarshan ChariHave you ever wondered about the relative contribution of nature vs. nurture? Or why certain individuals are more susceptible to a disease, or respond better to a drug treatment than others? Intuitively we know that individuals are genetically unique and react differently to their environment (like temperature, diet, etc.). The broad theme of my research focuses on understanding such conditional influences of genes (genotypes) on traits (phenotypes) in an evolutionary context using the humble fruit fly Drosophila melanogaster as a model system.

Genotype-Phenotype MapThis idea of how genotype influences a phenotype can be visualized using genotype-phenotype maps (figure at left – click to enlarge). In this framework, one can combine all the gene/genomic components together and call it the genotype. If we plot all possible genotypes of an organism, we can represent it as a plane called the genotype space. Any point in this space represents the entire genetic makeup of an organism. We can also combine all possible phenotypes of an organism into an analogous phenotypic space, with the plane representing all possible combinations for the traits. For any individual you can draw a set of lines from its location in genotype space to its location in phenotype space. These lines will pass through many intermediate developmental spaces that describe how the genotype becomes the phenotype. The intermediate spaces include important processes like gene expression, biochemical pathways, cell and physiological aspects that shape the final form of an organism. In this framework you can imagine if an individual gets a mutation, the genotype changes and gets displaced on the genotype space to a new point. Due to this displacement, the downstream cascade of development is also shifted, resulting in a different final phenotype. Similarly, if there is environmental variation like a drastic change in temperature during development, it can cause a shift in the intermediate spaces so that even if two individuals started from the same genotype may be distinct in phenotype space. One can now begin to visualize how variation in the underlying genes and environment influence variation in the final phenotype of an organism.

This idea of context dependence is incredibly important for understanding complex traits. That a mutation, which causes diabetes or cancer in one person, may not cause harm in another, fundamentally changes how we think about disease and treatment options. Context is also important from an evolutionary perspective. Imagine that a mutation occurs in an organism, which by itself has no phenotypic effects – that is, a neutral mutation. But what happens if this mutation happens to occur in an organism that has other interacting mutations that facilitate the expression of a novel beneficial trait? If beneficial, selection would favor this combination by increasing the survival and reproductive success of any individual with them, thereby preserving this mutation. Thus, apparently silent or even harmful mutations can make certain evolutionary routes or adaptive paths possible when combined with other mutations, or expressed in a different environment. Consequentially, the evolutionary fate of a mutation can be conditional on the genetic background or the environment.

In the Dworkin lab, we use ‘evolution in action’ to understand possible evolutionary fates of mutant populations via the process of compensatory evolution. We know that organisms are constantly bombarded with harmful mutations that can cause phenotypic defects and diminish fitness (how they survive and reproduce). For example, if a mutation causes wings to be smaller in a bird, then it may not be able to fly as well, or could be an easier target for a predator. In such a case, these mutated birds may not survive or reproduce as well as non-mutant members of the population. Although such deleterious mutations are usually eliminated by selection, they sometimes can get fixed (in other words, 100% of the population has the mutation) due to processes like genetic drift. Context dependence can help here too! Sometimes a new mutation will arise that conditionally interacts with the deleterious mutation and compensates for the fitness defects: two negatives can make a positive!

We wanted to understand how compensatory mutations rescue populations with deleterious mutations. Do the necessary mutations already exist in natural populations? Or does evolution have to wait for a new mutation to occur? And in either case, how does the compensatory mutation solve the problem? By analogy, if a machine breaks down, you could either repair/replace the broken part to get it back to its original condition, or keep the broken part but change some other features in the machine to make it work again. In other words, you could fix the broken pipe, or just build a bypass around it.

We fixed a mutation that disrupts normal wing development in a large natural population of fruit flies. This mutation makes tiny, shriveled wings which not only impairs flight, but also reduces fitness because males use their wings to sing to females during courtship. That is, wings are really important for mating and hence reproductive success. Large natural populations have a lot of genetic variation throughout the genome that might interact with the mutation we introduced to provide compensatory effects. We experimentally evolved populations of mutant flies, under two distinct selection regimes: one with natural selection, where the flies could choose their mates, and one with artificial selection where we carefully chose the most ‘normal’ flies for mating every generation.

Wing_pics

We see a rapid recovery of normal-looking wings in the artificial selection populations, indicating that there are naturally occurring alleles that can compensate for the loss of wings. However, the natural selection populations still look mutated i.e. no recovery. Does this mean natural selection is ineffectual here? Not at all, because these populations show increased mating behavior and also have better egg to adult survival, indicating an independent route to fitness recovery. The two selection regimes thus compensated for the same mutation in very different ways: while artificial selection recovered normal development of the wings, natural selection took the alternative routes of increased mating and survivability.

Many more questions are yet to be answered: can we sequence these populations and understand the underlying genetic basis of compensation? Or if we repeated this process with a different mutation will it yield similar results? More on these, next time…

For more information about Sudarshan’s work, you can contact him at charisud at msu dot edu. Special thanks to Amanda Charbonneau, whose comments and suggestions were critical and instrumental in shaping this post. 

Posted in BEACON Researchers at Work | Tagged , , , , , , , | Leave a comment

BEACON Researchers at Work: Omics beyond model organisms

This week’s BEACON Researchers at Work blog post is by MSU graduate student Gaurav Moghe.

GauravPicThere are an estimated 9 million eukaryotic species on our planet, of which only 1.2 million (~15%) have been catalogued so far. Of these 1.2 million, only a few dozen are used as model organisms in modern science. In other words, most of the biological knowledge that mankind has, is derived from <0.001% of species!

This is shocking, but there is a good reason for that. The model organisms are assumed to be representative of their taxa, a fair assumption given common descent. Saccharomyces cerevisiae is representative of fungi, Drosophila melanogaster is representative of invertebrates and Mus musculus is representative of mammals. Common descent also ensures that some of the principles learnt in yeasts are also applicable to mammals and vice-versa. This strategy has worked very well so far and has provided us incredible insights into the mechanisms that keep life up and running. However, not all principles learnt in one species can be extrapolated to others. For example, plants belonging to the same genus can look very different and can have variations in their developmental, metabolic and stress response pathways. Even populations of the same species inhabiting different habitats can vary due to local adaptation or drift. A large proportion of the molecular mechanisms and evolutionary pathways responsible for the observed natural variation remain to be discovered in these un-sampled species; however, till very recently this was not possible partly due to technological limitations.

Response of three different Arabidopsis thaliana accessions to 3 weeks of 500mM salt stress. The molecular basis of such variation, how the genes influencing such variation evolved and the effect of such variation on fitness in the “real world” is poorly understood.

Response of three different Arabidopsis thaliana accessions to 3 weeks of 500mM salt stress. The molecular basis of such variation, how the genes influencing such variation evolved and the effect of such variation on fitness in the “real world” is poorly understood.

Over the past decade, the advent of new technologies has allowed us to explore beyond the model organisms. High-throughput assay technologies such as microarrays, Illumina sequencing and mass spectrometry, coupled with tremendous increases in computing power have allowed us to sample the genomes, transcriptomes and now even proteomes and metabolomes of a variety of organisms. The resultant data explosion has given us the opportunity to not only sample natural variation in the living world but also to understand what causes this variation using comparative “omic” approaches. When I joined the Shiu lab in 2008, I was fascinated by the diversity in the plant world and wanted to learn more about comparative genomics and molecular evolution. My research has made use of recently-available high-throughput data to understand the evolutionary principles associated with certain biological processes.

Studies in the Shiu Lab have focused on three themes – identifying novel genes in genomes, understanding the evolutionary patterns of duplicate genes and understanding how the expression of genes is regulated – using a combination of computational and experimental strategies in plants. In my own research, I partly worked on addressing the nature of intergenic transcription in Arabidopsis thaliana. Spurred by the massive deluge of studies and popular news articles proclaiming the end of “junk DNA” in genomes, we sought to understand whether the claims of pervasive intergenic transcription and functionality of intergenic transcripts hold true if A. thaliana transcription was analyzed using Illumina sequencing data, rather than tiling array data which is of lower quality. We also used comparative genomics to assess conservation of these intergenic transcripts across fifteen sequenced plant genomes and between 80 A. thaliana accessions, whose genomes are now available as part of the 1001 Arabidopsis Genome Project. We found that only 5-10% of the A. thaliana intergenic space is transcribed and that around the same percent shows any signature of selection within or between species, suggesting widespread prevalence of transcriptional noise. Using multiple characteristics of intergenic transcripts such as their expression level, breadth of expression, distance from annotated genes, transcript length and degree of constraint, one can estimate the probability of an intergenic transcript being functional vs noise, thus aiding identification of non-canonical, novel genes in the intergenic space in different plant genomes.

Conservation of intergenic transcripts (A) compared to transcripts mapping to protein-coding genes (B) and RNA genes (C). X-axis represents genomes of different plant species and each row on Y-axis represents an individual feature. The color shades indicate the significance of the BLAST hit of the feature in the corresponding plant genome, with yellow being higher significance. This figure shows that intergenic transcripts are rapidly lost through evolutionary time, at a faster rate than transcripts mapping to protein-coding genes and RNA genes.

Conservation of intergenic transcripts (A) compared to transcripts mapping to protein-coding genes (B) and RNA genes (C). X-axis represents genomes of different plant species and each row on Y-axis represents an individual feature. The color shades indicate the significance of the BLAST hit of the feature in the corresponding plant genome, with yellow being higher significance. This figure shows that intergenic transcripts are rapidly lost through evolutionary time, at a faster rate than transcripts mapping to protein-coding genes and RNA genes.

My research has also focused on understanding the evolutionary patterns of duplicate genes derived from whole genome duplication (WGD) in the Brassicaceae family of plants. WGD is a ubiquitous phenomenon among flowering plants and duplicate genes produced via WGD may aid in functional diversification and/or adaptation of the polyploid species. As part of my research, we sequenced and annotated the genome of wild radish (Raphanus raphanistrum) and compared the patterns of evolution of duplicate genes and pseudogenes between multiple Brassicaceae species. My results reveal a complex pattern of gene loss and retention post WGD.

The evolutionary biologist Theodosius Dobzhansky famously wrote, “Nothing in biology makes sense except in the light of evolution.” This statement is ever more true today. Today, using high-throughput sequencing methods, we can sequence transcripts being produced at a concentration of 1 transcript/1000 cells. We have the ability to rapidly sample all the metabolites being synthesized in cells and tissues. Population level genome sequencing can reveal thousands to millions of polymorphisms between individuals. However, we still have an unclear idea of the significance of such massive molecular-level variation. How do we filter out truly functional features from random noise? How have such features evolved to their present state? With our newfound ability to sample populations and closely related species, we are beginning to better understand how genes and pathways evolve through time. Molecular evolutionary and comparative genomic approaches are also being used to pinpoint genes responsible for complex phenotypes such as intelligence, identify polymorphisms that may result in disease and define relationships between organisms. Such approaches, coupled with advancements in other technologies, are allowing us to better understand the diversity of the world we live in. It is truly an exciting time to be in evolutionary biology!

For more information about Gaurav’s work, you can contact him at moghegau at msu dot edu. 

Posted in BEACON Researchers at Work | Leave a comment

Wiley Practice Prize awarded to BEACON's Multi-Criterion Decision Making team

Good news for BEACON and our Multi-Criterion Decision Making team (BEACON faculty Kalyan Deb, Erik Goodman, and BEACON collaborator Dr. Oliver Chikumbo, of Scion, a New Zealand Crown Research Institute). In January, the team submitted a paper for the prestigious Wiley Practice Prize competition, held every 2 years at the Multi-Criteria Decision Making Conference.  This year’s MCDM Conference was in Málaga, Spain, June 17-21.

Dr. Chikumbo accepting the Prize.

Dr. Chikumbo accepting the Prize.

The team’s paper had been selected as one of four finalists for presentation at the Conference, and was presented by Dr. Chikumbo, with Prof. Deb also in attendance.  The team learned yesterday, June 20, that they had been awarded the Wiley Practice Prize for 2013. The work has been conducted by Chikumbo, Deb and Goodman over the last two years, under BEACON sponsorship, with Dr. Chikumbo having spent one month in 2011 and one month in 2013 as a BEACON visitor.  The project arose from earlier collaborations between Goodman and Chikumbo beginning in the 1990’s and continuing at a low level ever since. The work is continuing, and the team has been joined by Daniel Couvertier, a BEACON graduate student in CSE at MSU, and Mr. Hyungon Kim, a graduate student in the Human Interface Technology Lab at the University of Canterbury (New Zealand), supervised by Prof. Gun Lee.

The project depends heavily on being able to optimize a 14-objective problem, determining a Pareto set of optimal tradeoff solutions. The Evolutionary Multi-Objective Optimization (EMOO) search technology developed by the team complemented Prof. Deb’s R-NSGA-II algorithm with new epigenetic operators developed by the team and to be further studied by Daniel Couvertier.  The size of the search space of potential solutions is on the order of 10**600, and with 14 objectives to evaluate, appears at first glance to defy any efforts at optimization. However, the combination of heuristics and decision-making processes used by the team (nicknamed “WISDOM”) was able to find a useful Pareto set of solutions that bore up well under critical examination.  The team is also working with the University of Canterbury (in New Zealand) to develop virtual reality tools to help users comprehend the set of Pareto-optimal solutions.  The decision-making approach given a set of optimal solutions involves allowing individual stakeholders to first express their relative preferences among the 14 objectives, and then to rank four solutions (selected according to their preferences from among the optimal set).  These ranks are then combined using a scheme called the Analytical Hierarchy Process (AHP) to identify the solution most compatible with the preferences of all stakeholders.

The WISDOM process is applicable to many “wicked” societal problems, and allows stakeholders to address simultaneously economic, environmental and social concerns, to satisfy the “Triple Bottom Line.”  The team plans to redevelop the platform for integrating the many distinct simulators used to calculate the 14 objective values for each solution into a sophisticated sensor and model integration framework in order to make it easier to generalize the approach for application to many problem domains, in partnership with a company with which negotiations are currently underway.

This project is a wonderful example of evolution’s practical value outside of the biological lab. For more information about the project, check out BEACON’s “International” project pages.

Posted in BEACON in the News | Tagged , , | Leave a comment

BEACON Researchers at Work: Using evolutionary computation to discover fakes

This week’s BEACON Researchers at Work blog post is by NC A&T undergraduate Joi Carter and graduate student Henry Williams.

Have you ever read a document that you thought was forged?  Perhaps you’ve received an email from your friend, but just knew somehow that they hadn’t actually written it.  I know I have read many social media posts that I just knew were the result of a prank and a stolen password and as a student I have been on the receiving end of that prank a couple times.  Wouldn’t it be nice if we had a system that could assure you that the proposed author of any text document was actually the author?  We can imagine a world where Facebook would automatically remove a status in which I’ve called myself “a pretty little pony,” because it knows the writing style of that post does not match my own and that my account was compromised.  This is a topic known as Author Identification.

Author Identification is a process by which an author can be recognized by a sample of text.  This process has a history in many fields.  Bosch and Smith (1998) worked on categorizing authorship of the Federalist papers and attempting to end a dispute over the 12 papers that several authors claim to have written.  In the criminal field a famous use of these techniques can be sighted in the 1996 case against Ted Kaczynski (The Unabomber).  The FBI used author identification, and the help of Kaczynski’s family, to prove that he had written the manifesto detailing his plans.

The Federalist Papers and the Unabomber Manifesto

Author Identification is a form of biometric recognition with which many attributes of a person can be gleaned from their writing style, syntax, vocabulary and a nearly limitless list of other potential features.  The goal of our work is to determine which potential features are actually useful in this identification process.  Many researchers have proposed lists of features that they claim to be salient and have proved to be effective at identifying the authors of a specific type and length of text sample.  Our interest in this topic stems from the methodology used by these researchers to test their features.  It seems to be a pattern for these researchers to only test their feature sets on a specific type of document.  Their test sets contain a limited number of documents which are all nearly identical size and produced by very few authors.

We have been working to find a feature set that can be productive at identifying the author of any possible written piece of text (documents as long as a book or as short as a tweet and even as unusual as programming code).  So far we have applied our methods to blog posts and HTML code, both of varying sizes and contexts.  Our work involves using Genetic and Evolutionary Feature Selection (GEFeS) along with our proposed Genetic Heuristic Development (GHD) to find these salient features.

The definition of a heuristic is ‘involving or serving as an aid to learning, discovery, or problem-solving by experimental and especially trial-and-error methods’, which is essentially what we have done with our research. We introduced a GHD process which is intended to improve the standard feature selection process.  The GHD uses a subset of features produced by GEFeS to create a high performing feature mask for the recognition of unseen subjects. In this case the subset of features produced by the GHD is representative of the most salient style based features for determining the author of a piece of work.

In this experiment, we start with feature selection using GEFeS, as our genetic algorithm of choice. After running feature selection multiple times, we can then produce a heuristic. The development of our heuristic starts with collecting all the best features masks from the validation set produced by GEFeS. We then create a frequency histogram based off this data; each value in the histogram represents the percentage of times that a feature was used by GEFeS in the recognition process.

A Feature Frequency Histogram (FFH) can be generated using feature masks found during feature selection.  Each value within this FFH represents the percentage of times a single feature was used in all of the best feature masks.

A Feature Frequency Histogram (FFH) can be generated using feature masks found during feature selection. Each value within this FFH represents the percentage of times a single feature was used in all of the best feature masks.

We then test our heuristic at 1% increments for the feature frequency threshold on the validation set to see which subset will give us the best recognition accuracy, i.e. trial-and-error. This experiment resulted in our Genetic Heuristic outperforming the baseline method by doubling the accuracy and by significantly reducing the number of features being used for recognition, which is the goal of any research we conduct in our group. We always aim to successfully increase the recognition accuracy and reduce the number of features required for recognition.

We have a lot more work to do with this research, but that’s the beauty of it. This branch has allowed us to explore things where a lot of people haven’t ventured before which make the possibilities endless in terms of where we can go next. A heuristic can be applied to any feature selection problem; it doesn’t just limit us to Author Identification. It allows us to attempt to pin point which features together are the best, relatively speaking, characteristics for definition.  Let’s just say we’ll be very busy for a long while!

For more information about this work, you can contact Joi at jncarte1 at mail dot ncat dot edu or Henry at hwilliams18 at gmail dot com. 

Posted in BEACON Researchers at Work | Tagged , , , , , | Leave a comment