Determining functionality in a genome

This post is written by MSU grad student Bethany Moore

Bethany Moore, at work programming

Imagine you are traveling in space, searching for a hospitable planet. Not only does the planet have to have elements present on earth, but it has to be the right distance from a star in order to avoid extreme temperatures, and has to have the correct proportions of water, oxygen, and carbon. There are millions light years of empty space between stars and planets, and you cannot see a planet given its close proximity to a bright star. How can you know where to search for such a planet? A similar conundrum is experienced in finding out the function of the DNA in a genome. First, you have to find where a gene (or planet) is in the midst of a galaxy of DNA that includes many non-functional regions (empty space) and thousands of genes (potentially hospitable planets). As you might imagine, matching up genes to their function is tricky to say the least.

Gene expression, or the measurement of RNA a gene produces or “expresses” is one way to determine function—genes that are expressed at high levels might be doing something important, while genes expressed at the same time or under similar conditions (called co-expression) might be involved in the same kinds of processes. A more direct approach is gene knockout, an experimental procedure where a gene is mutated in some way to make it non-functional, and the phenotype of the mutant is recorded. While this shows a more definitive relationship between the gene and the function, this process can take weeks or months for each gene in question.

The Shiu lab focuses on predicting gene function using computational approaches such as machine-learning. Given a set of example inputs and desired outputs, a computer program “learns” the general rule by which you can get from the input to the output. When the computer program is thus “trained” by given inputs, the type of machine-learning is called supervised-learning. How can this approach be applied to finding gene function? If we train our program with inputs from genes whose function we know, we receive output as to whether an unknown gene looks like our known gene. This approach can be highly efficient and accurate in predicting gene function and narrowing down a set of candidate genes that can be experimentally validated using more time-consuming techniques, such as making a gene knockout.

Predicting Lethal Gene Phenotypes

A previous graduate student in our lab to predicted essential genes in plants (Lloyd et al., 2015). Only a small proportion (15%) of genes in the well-annotated genome of the model plant A. thaliana have experimental evidence that connects a gene to a function in the plant. The goal of our project was then to predict what genes are essential, or in other words cause a lethal phenotype, in A. thaliana. Characteristics of known essential genes and non-functional genes (pseudogenes) were used to create a model capable of predicting the likelihood of an uncharacterized gene to be functional. Characteristics such as mechanisms of gene duplication, gene expression, evolution and conservation, and gene networks were compared between lethal phenotype genes and pseudogenes. Using a supervised machine-learning approach, we combined these characteristics to model what lethal phenotype genes and pseudogenes look like. Finally, we applied the model to genes with unknown function, predicting 1,970 undocumented genes to have a lethal phenotype. Not only did this model enable us to document the functionality of genes without a known phenotype, but can help future research in prioritizing candidate genes for further study.

Predicting Gene Regulation

Cis-elements (CREs) important in predicting the up-regulation of salt stress in the shoots of A. thaliana from Uygun, et al., 2017. This first and second columns are sequence logos, which represent the sequence of a CRE, and their corresponding reverse complement sequence. The third column contains the sequence logos and transcription factor family of the best matching transcription factor binding site.

Some regions of DNA in the genome that are not genes can play a role in how and when genes are expressed. This is known as gene regulation, and can be thought of as turning genes on or off. Many genes can be described as “cryptic”, in that they are only turned on under certain conditions, for example during viral infection, so both the gene function and how it is regulated can be difficult to detect unless a given stress is present. This sort of cryptic expression has allowed plants to adapt to many diverse environments around the globe, from deserts to alpine regions to marshes. What regions of a plant genome can actually respond to drought, or cold, or flooding? Does this have implications for our crop plants? What if we could grow crops that under a particular stress, like drought, turn on genes that increase that crop’s resistance to that stress?

To answer these questions, we looked at DNA sequences in specific regions that are frequently involved in regulation. These regions are adjacent to genes, and commonly known as the cis region. We asked if there was a cis-regulatory “code” that can turn on a gene under a given stress. Using gene expression data from plants under salt stress, we were able to determine important cis-elements that tend to regulate genes under this condition, and elements that responded in specific parts of the plant, including the root and the shoot (Uygun, et. al., 2017). We then used machine-learning to predict how well our putative cis-regulatory codes explained plant gene expression under salt stress, and found our putative cis-elements explained approximately 79% of expression. Currently the Shiu lab is working on finding cis-elements that regulate wounding, heat, and drought stress.


When I first joined the Shiu lab in 2014, I had no previous computational experience, only a desire to learn more about genes and genomes of plants. By learning how to program and how to deal with big datasets, I developed my skill set in bioinformatics and with it the future job market looks a little brighter. Additionally, I have gained insight into the biology and complexity of what is happening in a given genome, and tools of how to parse that complexity into meaningful data.

As a lab, we have found computational methods, particularly machine-learning, can be combined with gene expression data to make powerful predictions of gene function. If we can predict the function of a gene, or a genomic region, this can provide a starting point for experimental validation, reducing the amount of guesswork and time involved in the validation process. As many genes and functional regions of the genome remain unknown or uncharacterized, gene prediction or predictions of functional regions are important to discover and is a way to navigate the seas of genomic data.


This entry was posted in BEACON Researchers at Work and tagged , , , , . Bookmark the permalink.

Comments are closed.