BEACON Researchers at Work: Using evolution to update old software

This week’s BEACON Researchers at Work blog post is by MSU graduate student Erik Fredericks. His first BEACON blog post is here.

Erik FredericksEngineers in industry are often given the somewhat daunting task of updating legacy software systems. Consider, for a moment, a piece of software used on a daily basis at a particular company, say, Initech. This software was programmed roughly twenty years ago by an engineer that has since left the company, leaving only cryptic comments and mostly unreadable software requirements. However, the importance of this software is paramount to Initech’s success, given its near-ubiquitous use throughout the corporation.

Times have changed over the past 20 years, Initech has gone through some changes, and our software package is quietly reaching the end of its usefulness. However, everybody still continues to use it, even though it is no longer as useful as it once was. Conveniently, a new engineer has been hired and is given his first task: update the software system, thereby extending its effective lifetime. Everybody is already used to it, the TPS reports it generates are spot-on, re-training all employees would be cost-prohibitive, and really, how hard can it be to update software?

After spending weeks gathering new requirements, analyzing the source code, and evaluating the overall state of the framework, our engineer has determined that the easiest way to fix the software system is to use software composition. Composition is a reuse technique for introducing new functionality into existing systems by reusing existing modules from other software systems. For instance, security can be added to a web application by composing in a security module that has been written for another application, where the composition is enabled by transforming the existing data into the format required by the new security module. Moreover, the data being returned must also be transformed into the format required by the legacy application.

Even though the code for the new module has already been written, there still remains many different ways in which to compose the module. For instance, how exactly should the legacy software system send data to the module? How should the data be returned? Is it even written in the same programming language? There are innumerable approaches for transforming data, and our software engineer would rather not spend the extra time and effort figuring out how to accomplish this task. Our engineer decides that an automated approach is best. Enter, context-free grammar-based genetic programming (CFG-GP).

CFG-GP is a search-based technique for automatically generating programs, similar in fashion to the standard genetic program. It comprises a population of candidate solutions that are evolved over time towards a particular goal or set of goals. CFG-GP differs from standard genetic programming in the solution representation. Particularly, each possible solution defined for the genetic program is specified by a context-free grammar. The grammar imposes a structure on generated solutions, effectively reducing the amount of invalid programs typically generated by a normal genetic program and also reducing wasted computing time. For composition, the grammar provides a structure for the many different ways in which data can be transformed for a particular programming language.

As with a typical genetic program, CFG-GP must also specify a fitness function to guide the search process towards an optimal solution. In our case, we leverage software preconditions and postconditions, along with common software engineering metrics, to guide the search process. Preconditions and postconditions define the conditions necessary for a module to be invoked within a program. For example, preconditions can define the data type and order of method parameters, and postconditions can define a method return value or the state of the program following invocation of the method. Together, preconditions and postconditions define what exactly is required for a method to be used by a program, enabling us to use it as part of the fitness function. In our case, candidate solutions must satisfy all preconditions and postconditions.

For example, we may wish to define preconditions and postconditions for a new security module, particularly for encrypting data to be sent to a server. A sample encryption module we wish to compose is defined as follows, in which the encryption function accepts two strings and a string pointer, and returns a boolean value:

<bool> openssl_private_encrypt(String data,
                               String &encrypted,
                               String private_key)

We now define the preconditions and postconditions necessary to compose this module. Particularly, the module preconditions represent the format the data is required to be in prior to module invocation. Essentially, we are interested in the parameters accepted by the module:

  • STRING: data_in
  • POINTER: data_out
  • STRING: key

Next, the postconditions for the module must be defined to represent the state of the data following invocation of the module. Here, we are interested in what the module returns back to the legacy software system:

  • BOOLEAN: return_value

Using these preconditions and postconditions, coupled with further definition of data available and state of the legacy software system prior to and following module invocation, we can successfully compose the security module.

Common software engineering techniques also help to guide the search process at a high level, enabling an engineer to fine-tune the set of optimal solutions. For instance, minimizing the amount of extra function calls, data dependencies, or amount of generated code may be required on systems that have limited memory available. Obfuscation of generated code may also be of interest for high-security systems. Satisfaction of preconditions and postconditions ensures that generated composition strategies are successful, and tailoring of solutions with software engineering techniques can tailor composition strategies for the needs of the engineer.

Thus far, our engineer has created a grammar detailing possible composition strategies, specified the preconditions and postconditions necessary to compose the new module, and executed SAGE to generate a set of optimal composition strategies. It seems that far more work has been incurred by setting up and running SAGE than would have been required to simply hand-code the composition. The benefit, however, is that SAGE can now be run for nearly any composition that the engineer may need to perform in the future. The defined grammar now forms a basis for any future compositions and can be easily extended to other domains that the engineer may encounter. Furthermore, SAGE can also automatically generate valid code in whichever language is required, enabling the engineer to simply copy and paste the generated composition strategy into the legacy framework.

For more information about Erik’s work, you can contact him at freder99 at msu dot edu.

Posted in BEACON Researchers at Work | Tagged , , , , | Comments Off

BEACON Announces Search for the John R. Koza Endowed Chair in Genetic Programming

A generous gift to BEACON and MSU by Dr. John R. Koza, the acknowledged founder of the field of genetic programming and author of four books on that subject, has endowed this chair, aimed at recruiting one of the leaders in the field as a new MSU faculty member and BEACON contributor. This endowed chair will help to build BEACON’s strength in evolutionary computation during its second five years as an NSF Science and Technology Center, and help to sustain BEACON after the STC funding is over. BEACON’s other endowed chair, the Koenig Chair, is held by Prof. Kalyanmoy Deb, an acknowledged leader in multiobjective evolutionary optimization, and we hope to fill the Koza Chair in GP with someone similarly distinguished in the field of GP. Together, these chairs will play an important role in strengthening BEACON’s long-term contributions to the field.

Full information about the position is available here.

BEACON’s leaders and members join in thanking Dr. John Koza for this marvelous sustaining gift!

Posted in Job Openings, Member Announcements | Tagged , , | Comments Off

BEACON Researchers at Work: Improving Search Quality in Genetic Programming

This week’s BEACON Researcher at Work blog post is by MSU graduate student Armand Burks.

About Me

PortraitAs a young “army brat,” I was always fascinated with learning how things worked. Trying to figure out how my toys worked, I often found myself amidst a pile of plastic pieces, wires, and motors, desperately trying to perform the magic trick of transforming those pieces back into one while my parents remained none the wiser. To this day, I am still not sure if I ever became more skilled at this trick or at making the toys disappear altogether. Either way, I became a better magician, and my interest in how things worked only grew.

It was this same fascination that led me to study Computer Science, as I could not only safely break things to understand them, but I could also create things from scratch. During my undergraduate studies at Alabama State University, I gained exposure to research through the Historically Black Colleges and Universities Undergraduate Program (HBCU-UP). This program ultimately led me to Michigan State University, where I first learned about Evolutionary Computation. From there, it was “love at first sight.” I was able to make things, break things, and let things evolve, all through the power of research!

My Current Research in Genetic Programming

My current research deals with combating premature convergence in Genetic Programming (GP). Before I delve into premature convergence, a brief summary of Genetic Programming is necessary. GP can be thought of as a special type evolutionary algorithm, where one of the main goals is to automatically create computer programs through evolutionary processes. GP evolves a population of computer programs, where programs are typically executed on an optimization problem and compared based on how close they are to solving that problem.

GP genotypeThe major difference between GP and most evolutionary algorithms is way in which the genotype (genetic makeup) of a program is represented in GP. In the most basic form of GP, genotypes are represented by a tree structure because it lends itself well to the creation of programs. As a concrete example, the figure at right shows the genotype of a program that encodes the mathematical expression: x2 – 1. When this tree’s genetic code is executed, the function nodes (* and -) operate on the terminal nodes (the input variable x and the constant value 1) to produce an output value. The evolutionary process breeds programs from the evolving population by combining their genetic material in order to discover new programs.

Premature Convergence in Genetic Programming

Premature convergence is one of the biggest problems in Genetic Programming, and evolutionary algorithms in general. To sum it up, premature convergence can be thought of as evolutionary stagnation. Since GP is often applied to optimization problems, one way to gain insight into the difficulty of such a problem for GP is to consider the enormity of the problem’s search space (the large number of possible computer programs that could be generated in attempt to solve the problem). Premature convergence takes place when the population quickly becomes so similar that the algorithm can no longer effectively explore the search space. At this point, it is often very difficult or even impossible for the population to improve in fitness (breed better programs). There are several reasons why this occurs, but the loss of genetic diversity is one of the most notable.

This is a particularly interesting problem in GP because of the tree-based genotype. Previous research has confirmed that not only do traditional GP populations very quickly lose genetic diversity in general, but a large contiguous portion of most of the trees become completely identical in a top-down fashion. This has a very profound and stifling impact on the effectiveness of finding solutions to difficult problems.

Combating Premature Convergence and Improving Search Quality

Fortunately, a lot of research has gone into combating premature convergence, and a lot of great strides have been made towards improving search quality. One example from the evolutionary computation literature is the Age-Layered Population Structure (ALPS). The main idea of ALPS is to use a special concept of age to divide the population into several layers. This allows new genetic material to be effectively inserted into the population and flow upwards from the bottom, while older genetic material evolves and improves as it makes its way towards the topmost layers. This helps to keep the population from becoming too genetically similar while it explores the search space. ALPS and other techniques have been successfully demonstrated in a number of problem domains.

Although there have been a lot of improvements that address premature convergence with varying degrees of success, I believe that there is still a lot of room to do better, and this is my current research focus. It is known that techniques designed to combat premature convergence often incur the cost of taking longer to find the optimal solution to a problem, although these approaches generally increase robustness (the reliability of finding optimal solutions). This is because these approaches avoid prematurely converging by aggressively exploring new areas of the very large search space instead of focusing in on one particular area. With this in mind, one of my research questions is: can we design robust algorithms that are capable of effectively avoiding premature convergence while still finding the optimal solution faster?

For more information about Armand’s work, you can contact him at burksarm at msu dot edu.

Posted in BEACON Researchers at Work | Tagged , , , , | Comments Off

BEACON Researchers at Work: Spatial dynamics of evolution

This week’s BEACON Researchers at Work blog post is by MSU graduate student Emily Dolson.

EmilyAll biological organisms must occupy a single location in physical space. This idea is so obvious that most people don’t give it much thought, but it has important consequences. The spatial location of an organism controls what other organisms it can interact with. The sets of organisms can mate with each other, for instance, has a big impact on the way mutations can spread through a population. Similarly, the sets of organisms that compete with each other is a fundamental component of the selection pressures that they experience. If you consider other types of interactions, such as parasitism, mutualism, and predation, the implications of spatial structure only continue to grow.

As an undergrad, I did a project on the spatial distribution of biodiversity in a forest. We didn’t end up having enough data to draw many conclusions, but I learned a lot about statistical techniques for describing spatial structure. More importantly, I learned that these techniques fascinate me! But perhaps the most important thing that I learned from that project was that there were benefits to collaborations between computer scientists and ecologists.

I was doing this project at the end of my sophomore year, the time when students at my college had to declare a major. And it just so happened that at this time, for the first time in many years, I was unsure what field I wanted to go into. Since high school, I had been sure that I wanted to be a quantitative ecologist. But during the first two years of college, I had taken some computer science classes and fallen in love with them. Double-majoring happened to fit fairly neatly into my schedule, but I worried about how I was going to choose one field over the other when it came time to apply to grad school. Through the collaborations that it encouraged, the spatial biodiversity project allayed these fears. It gave me an opportunity to apply my skills from both fields and get advice from professors in both departments. Suddenly, I not only felt that I didn’t have to choose between fields, I felt that I could do more for science as a whole if I wasn’t forced to.

Fast-forward a few years and now I’m a second year PhD student in the Ofria lab, pursuing a dual degree in Computer Science & Engineering and Ecology, Evolutionary Biology, & Behavior. I’m working on a number of projects, but the ones that are perhaps nearest and dearest to me are surprisingly similar to my undergrad project; they, too, deal with spatial structure and diversity. The biggest difference is that now I’m studying these phenomena in the context of evolution. I like studying the process of evolution in as general a case as possible, because it means I can do work that simultaneously has implications for both evolutionary computation and biological evolution.

The other big difference between my work now and my work as an undergrad is that now I do most of my research in computational systems. The system I most commonly use is the Avida Digital Evolution Platform. Basically, Avida consists of a grid of virtual CPUs running self-replicating sequences of assembly code (Avidians). Any new organisms generated over the maximum population size overwrite pre-existing organisms, resulting in pressure to replicate quickly. Because the instruction that Avidians use to copy their genomes has the potential to introduce mutations, evolution occurs. Of course, Avidians do not actually exist in physical space. For the sake of answering questions about spatial dynamics, however, we choose to treat the grid as if it represents their spatial proximity to each other. This is a neat and intuitive way of achieving a gradient of interactions between organisms. Having such a gradient is useful both because it has theoretically interesting properties and because these properties are also present in natural systems as a result of them existing in physical space.

This image shows an environment containing 16 evenly spaced circular patches of 8 different resources. Cooler colors represent locations in the grid where a greater number of resources are rewarded. As you can see, this one environment contains an intricate variety of spatial niches.

This image shows an environment containing 16 evenly spaced circular patches of 8 different resources. Cooler colors represent locations in the grid where a greater number of resources are rewarded. As you can see, this one environment contains an intricate variety of spatial niches.

So far, the primary spatial project that I’ve been working on is one that I started for the BEACON spring course last semester, with Samuel Perez and Audra Chaput. Ecologists have long debated the reason that we see so many different species in ecosystems that don’t seem to have very many different niches (i.e. potential roles for a species within in an ecosystem). It turns out that there are a lot of reasons that this occurs, but one that is particularly commonly suggested and yet uncommonly tested is the idea that spatial heterogeneity allows an ecosystem to support more species. Spatial heterogeneity is the idea that most ecosystems have areas within them that are different from each other. For instance, while an ecosystem as a whole might have two resources present in relatively equal amounts, there might be pockets within that ecosystem in which one of these resources is much more common than the other. This has the potential to create many additional niches. Since even quantifying spatial heterogeneity in nature is challenging, exploring its implications in Avida makes a lot of sense. In Avida, we are able to set up patches of different resources and overlap them in different ways, creating a wide variety of niches (see image above). We can then explore the way that diversity changes over time across this spatial environment, in comparison with more or less heterogeneous environments. Because spatial strategies could be used to maintain a diversity of solutions in evolutionary computation, this work has practical implications for ecology, evolutionary biology, and evolutionary computation.

A number of other BEACONites are also doing interesting work on spatial evolutionary dynamics, and I am fascinated to see where it leads.

For more information about Emily’s work, you can contact her at dolsonem at msu dot edu.

Posted in BEACON Researchers at Work | Tagged , , , , , , | Comments Off

BEACON Researchers at Work: Evolution of plasmid host range

This week’s BEACON Researchers at Work blog post is by University of Idaho postdoc Wesley Loftie-Eaton.

WesleyI stumbled into the world of plasmids at my alma mater, the University of Stellenbosch in South Africa. My advisor, Prof. D. E. Rawlings, asked me to determine the biological implications of a very specific mutation within the origin of replication of two isogenic plasmids (Loftie-Eaton and Rawlings, 2009; Loftie-Eaton and Rawlings 2010). What started as a very simple and specific question quickly enthralled me with its true complexity. During that time I learned that you can’t understand plasmids, or anything in biology for that matter, without the context of evolution; and what better place to study plasmid evolution than at the University of Idaho alongside Dr. Eva M. Top and under the umbrellas of IBEST and BEACON.

Plasmid

Plasmids are genetic elements that replicate separately from the bacterial chromosome. Many can transfer horizontally between diverse bacteria, often resulting in the spread of antibiotic resistance (red insert on plasmid) to pathogens (red bacterium).

In case you’re not a plasmid geek like me, plasmids are generally circular (linear forms also exist) DNA entities that replicate autonomously in bacteria. The smallest plasmids are but a few hundred base pairs while the largest are over a million base pairs long (Kado et al., 1998). Based on sequence information, about half of plasmids are either self-transferable by means of conjugation or can be transferred (mobilized) by conjugative plasmids, while the remainder appear to be non-transferable (Smillie et al., 2010). Many plasmids are cryptic or do not encode any non-essential genes while others may carry genes that provide benefit to the bacterial host, allowing it to occupy and proliferate in an otherwise hostile niche (Kado et al., 1998). Depending on whether a plasmid provides its host with benefit or cost, they can be likened to symbionts or parasites, respectively, and like symbionts and parasites, they have a host range that is either narrow or broad.

Projects I am currently working on are focused on a) elucidating the molecular mechanisms that evolve to permit plasmids to shift, contract or expand their host range and b) to understand how broad host range plasmids proliferate and spread in complex microbial communities. The latter will not be discussed here. Besides the fundamental nature of these questions, they are relevant to human health due to the rampant plasmid-mediated spread of antibiotic resistance amongst pathogens.

When a plasmid conjugates to a naïve host it may not necessarily persist unless there is selection for its maintenance in that host. Failure to persist could be due to inefficient replication, poor partitioning or segregational loss of the plasmid during cell division, or plasmid-containing cells can simply be outcompeted by plasmid-free cells if the plasmid imposes a cost on the host. We know, however, that plasmid and host can rapidly adapt to each other while under strong selection for plasmid maintenance, after which the plasmid can continue to persist for prolonged periods even when the strong selection is removed (Bouma and Lenski, 1988; Dahlberg and Chao, 2003; Sota et al., 2010). Panels A and B in the figure below demonstrate exactly this; an initially unstable plasmid-host relationship evolved a more stable phenotype in less than 200 generations and after ~600 generations the plasmid was highly stable in that host1. Not clear from this figure (due to me not showing the full assay data) was that in the absence of antibiotic selection the ancestral plasmid tended towards extinction, however, a persistent relationship in which the plasmid was maintained in ~10% of the population due to horizontal transfer evolved within the first 100 generations (panel C, below)! Together with collaborators Sam Hunter and Haruo Suzuki I am currently working on elucidating the genetic changes behind this increase in persistence and collaborator Jose Ponciano is working on means to quantify how easy or difficult it is to switch between trajectories of persistence and extinction.

Figure 2

While under antibiotic selection for plasmid maintenance, plasmid stability evolved rapidly and resulted in a persistent state wherein the plasmid was maintained even when the antibiotic selection was removed. (A) Plasmid stability measured over 10 days in the absence of antibiotic for evolving cultures sampled every 100 generations [G] over the course of the evolution experiment. (B) Summary of the endpoint [day 10] data for each stability assay for three replicate lineages. (C) A prediction of whether the plasmid will persist or go to extinction in the absence of antibiotic selection for its maintenance.

In another experiment (different plasmid and host)2 we found that a transposon encoded on a plasmid native to that host ‘jumped’ to the introduced plasmid, which was being maintained under antibiotic selection. The result was increased stability of the introduced plasmid, even in the absence of antibiotic selection, and loss of the native plasmid. Encoded on the transposon are a toxin-antitoxin (TA) system and a resolvase, both of which we have now shown to promote plasmid stability (Loftie-Eaton et al., in prep.). Even more significant, however, was that the evolved plasmid that acquired the transposon was completely stable in other beta- and gamma-proteobacteria in which its ancestor was unstable. Thus, acquisition of a transposon encoding stability functions resulted in an apparent expansion of the plasmid’s host range and more broadly, we demonstrated the interplay and fate of genes that could arguably be labeled as “selfish”.

In yet another system one of the outcomes was a deletion mutation within the plasmid’s origin of replication. Though much work remains to be done, this mutation has me extremely excited. The type and location of the mutation is the same as the mutation I set out to understand during my PhD. The exciting part is that the plasmids I studied then belong to a different family and evolved in the environment, whereas here the mutations occurred during experimental evolution in the lab. Preliminary results showed that this deletion abolished the plasmid’s ability to replicate in a previously permissive host, which already is novel information, however, if my hypothesis based on my previous research withhold scrutiny, then these results will demonstrate that what we observe during experimental evolution in the lab also occurs in nature, and vice versa. Validation!

Though far from complete, what we’ve learned thus far is that plasmid host-range can evolve quite rapidly, that such rapid changes tend to occur through mutations that result in gain or loss of function and that there are multiple molecular solutions that can lead to stable plasmid-host relationships. From the perspective of a plasmid geek this is fascinating, but from a medical perspective it’s concerning; once a multidrug resistance plasmid has established in a population it intends to stay, even if the antibiotics that initially selected for its maintenance are removed from the system! However, by accumulating more data of this kind we hope to define general patterns in the evolution of plasmid host range that can aid in the development of novel drugs to inhibit the spread and establishment of multidrug resistance plasmids.

Acknowledgements: 

1The experimental work was done by undergraduate Kelsie Bashford.

2Much of the experimental work was done by undergraduates Ryan Simmons and Stephen Burley as well as our lab manager Linda Rogers.

References

Loftie-Eaton, W., and D. E. Rawlings. 2009. Comparative biology of two natural variants of the IncQ-2 family plasmids, pRAS3.1 and pRAS3.2. J Bacteriol 191:6436-6446.

Loftie-Eaton, W., and D. E. Rawlings. 2010. Evolutionary competitiveness of two natural variants of the IncQ-like plasmids, pRAS3.1 and pRAS3.2. J Bacteriol 192:6182-6190.

Kado, C. I. 1998. Origin and evolution of plasmids. Antonie Van Leeuwenhoek 73:117-126.

Smillie, C., M. P. Garcillan-Barcia, M. V. Francia, E. P. Rocha, and F. de la Cruz. 2010. Mobility of plasmids. Microbiol Mol Biol Rev 74:434-452.

Stewart, F. M., and B. R. Levin. 1977. The population biology of bacterial plasmids: a priori conditions for the existence of conjugationally transmitted factors. Genetics 87:209-228.

Ponciano, J. M., L. De Gelder, E. M. Top, and P. Joyce. 2007. The population biology of bacterial plasmids: a hidden Markov model approach. Genetics 176:957-968.

Bergstrom, C. T., M. Lipsitch, and B. R. Levin. 2000. Natural selection, infectious transfer and the existence conditions for bacterial plasmids. Genetics 155:1505-1519.

Bouma, J. E., and R. E. Lenski. 1988. Evolution of a bacteria/plasmid association. Nature 335:351-352.

Sota, M., H. Yano, H. M, Julie, G. W. Daughdrill, Z. Abdo, L. J. Forney, and E. M. Top. 2010. Shifts in the host range of a promiscuous plasmid through parallel evolution of its replication initiation protein. ISME J 4:1568-1580.

Dahlberg, C., and L. Chao. 2003. Amelioration of the cost of conjugative plasmid carriage in Eschericha coli K12. Genetics 165:1641-1649.

Loftie-Eaton, W., Burleigh, S., Simmons, R., Rogers, L., Hunter, S., Settles, M., Ponciano, J.M. and Eva Top. Transposition of a toxin-antitoxin system and plasmid-host coevolution increase plasmid persistence and host range. In preparation.

For more information about Wesley’s work, you can contact him at wesleyl at uidaho dot edu.

Posted in BEACON Researchers at Work | Tagged , , , , , , | Comments Off