By: Megan Chan, Undergraduate Student, University of Texas – Austin
When I started college at The University of Texas at Austin a couple of years ago, I enrolled as a biochemistry/pre-pharmacy major. I didn’t know anything about computational biology back then but have since had the opportunity to participate in computational biology research under the guidance of Dr. Rebecca Young and Dr. Hans Hofmann in the Department of Integrative Biology at UT Austin. Over the last couple of years, I have grown more and more interested in the realm of data analytics, and my experience in hands-on research has completely changed my goals for the future. Because of this, I finally transferred majors last year to computational biology.
At the University of Texas, we have a program called the Freshman Research Initiative (FRI) that helps new students get experience in research labs. Although I originally applied just to get something interesting on my resume, I ended up gaining much more. As part of FRI, I joined a research stream called Big Data in Biology, led by Dhivya Arasappan. The goal of this stream was to introduce freshmen to concepts in genetics and how statistics and computer science are being used to study biological systems. I chose this stream over others I was interested in (like streams working in genetically engineering bacteria or chemical analysis of wine tannins) because I had really enjoyed a year of programming when I was in high school. I had never considered myself very knowledgeable about computers and often felt overwhelmed when around guys who had been writing code since middle school, but I found the challenge of solving problems and discovering something new exciting. In my sophomore year I realized that I wanted to continue exploring this field and completely changed my career focus from pharmacy to computational biology.
As part of FRI, I had the opportunity to join Dr. Young and Dr. Hofmann in an independent project adding evidence to a long-standing debate over the validity of what is commonly known as the hourglass model of vertebrate development. The hourglass model hypothesizes that the vertebrate body plan imposes a constraint on diversification of mid-embryonic development across vertebrate species. Early evidence for this theory was based on qualitative analysis of anatomical developmental variation, but in recent years gene expression data has been used as evidence for and against the hourglass model. The part of this overall project that I have been working on focuses on describing patterns of similarity in developmental gene expression through embryogenesis among several vertebrate species. This has involved the processing and analysis over 150 open-source gene expression datasets representing developmental stages for six species. By comparing the similarity of gene expression between each combination of species at each time point in development I can ask whether mid-embryonic stages are most similar in gene expression across species.
A major challenge in achieving this goal has been the lack of consistency in staging for different species. There is not a common quantitative way to equate a particular stage of development in one species with that in another. To add to this problem, of the species we have data for, most only have data for a select set of stages, and the number of stages sequenced for each species is also different. For example, there are 8 out of 46 stages represented for chicken embryos and 24 out of a possible 44 stages for a species of frog (not including free-swimming tadpoles). To overcome this essential problem, I’ve turned to machine learning and comparing qualitative descriptions of stages to group developmental time points within each species into comparable sets.
Of the various methods I integrated into my approach, the first method I employed was K-means clustering. K-means is an unsupervised machine learning algorithm that iteratively computes the distance between each data point and a set of k centroids to calculate which points cluster together around a mean, with k being the number of clusters to find. This was the first method I tried because it is a fairly common way of classifying data without pre-determining classes. To find the appropriate k, I generated an elbow plot visualizing the amount of variation that would be accounted for by several possible numbers of clusters and chose a k that represented a reasonable amount of variation without dividing the data into too small of clusters. A known feature with K-means, however, is that it randomizes the initial centroids which can result in some variation in cluster membership when the clusters are not robust. To enhance/strength of this method, I used partitioned hierarchical clustering, another form of unsupervised machine learning. Similar to the first, this algorithm’s goal is to group the data points into a predetermined number of clusters with similar values, but it starts by considering the entire dataset one cluster and then partitions it into smaller pieces until it’s reached the appropriate number of clusters. Hierarchical clustering, unlike K-means, tends to be consistent, and our results showed that, at an appropriate number of clusters found with the earlier described method, it also conserved the order of the developmental stages. Further analysis showed that these clusters could be defined by at least some biological significance. We are now confronted with the challenge of aligning these clusters across species.
Now, my work has turned from heavy computation to intense reading. I’ve made it this far without having to know too much about the details of what all these stages mean, but I’ve come to face the fact that I will need some biological knowledge of vertebrate development in order to compare these stages in any reasonable way. The beauty of being in an interdisciplinary field.
The knowledge that I’ve gained while working on this project is invaluable to me as I start to pursue my own projects and begin exploring my future options as graduation slowly approaches. I’ve enjoyed the work I’ve done in this lab so much that last year I started analyzing data for fun; in one instance looking for patterns in word choice in a dataset of Russian disinformation tweets, and in another instance predicting the length of time a dog will stay in the local shelter based on its age. This research experience has also opened many doors for me, allowing me the opportunity to pursue positions analyzing data for other labs on campus and jobs mentoring new students in research, and giving me the tools I needed to land a software internship in biotech this summer. In my last year, I hope to publish results for this project and leave an impact on future research.