Browsing by Author "Mathieson, Sara"
Now showing 1 - 13 of 13
Results Per Page
Sort Options
- ItemAncestral DNA Reconstruction Using Pedigrees(2021) Wiggers, Alton; Mathieson, SaraIn this thesis we look at the purpose, process, and application of DNA reconstruction using a known family tree, known as a pedigree. First, we see the applications of having a pedigree with DNA sequences for all individuals in the pedigree since the ultimate goal of DNA reconstruction is to recreate the genome of all members of the pedigree. We do this to better understand trait inheritance, particularly in humans. With a family tree complete with DNA, we can study things such as heritable diseases. Then to better understand how DNA reconstruction is possible, we look at how family relatedness between people in a pedigree can help us understand the connections within their DNA. From that understanding, we will be able to see how DNA reconstruction can be performed on ancestors in a pedigree, given the genome of their descendants. Focusing on an algorithm called "Thread," we see how DNA reconstruction is being applied today. For this work, evaluation was done on the performance of Thread in comparison to Merlin, an implementation of an older algorithm for DNA reconstruction. The experimentation done here reconfirms the greatly more efficient execution time of Thread but exposes weaknesses in its performance under certain circumstances.
- ItemComparison of Identity by Descent Detection Algorithms and their Implementation with Pedigrees(2022) Johnson, Claire; Mathieson, SaraIn recent history, access to biological and genomic data (provided by DNA sequencing) has exponentially increased as more efficient technologies have been developed. Newer, more accurate algorithms for analyzing this data are in high demand: in particular, the search for relatedness amongst individuals in a population to detect deleterious mutations that predispose one to certain diseases. Genome-wide association studies (GWAS) search for these disease-causing mutations, and motivation for this thesis is to improve identification and tracing of these mutations. Identity by descent (IBD)—segments of DNA shared between individuals sharing a common ancestor—is a principle that can be used to detect genetic variants and estimate mutation rates (Zhou et al., 2020). Several algorithmic methods can be utilized to pinpoint regions of IBD in genomic data: GERMLINE (used for thread, an algorithm that can reconstruct the genomes of ancestors via tracing IBD segments back through a pedigree), hap-IBD, and templated positional Burrows-Wheeler transform (TPBWT) will be the main focal points of this thesis. These methods use different tactics to find IBD segments such as "seed-and-extend" of IBS regions—small, shared segments of nucleotides—as for hap-IBD and GERMLINE (Zhou et al., 2020), or using an improved PBWT and heuristic algorithm to stitch together the IBD segments in TPBWT. All methods struggle to control levels of error introduced through handling genomic data. False negative and false positive errors are nearly inescapable despite the best efforts of contemporary algorithms. Therefore, one might consider using a pedigree—a tree-like diagram depicting the relationships between several generations of family members—to reduce the likelihood of these errors. These pedigrees, armed with the ability to show true relatedness, can be applied to algorithmically determined IBD segments to check if the amount of IBD sharing between pairs of individuals is realistic given their genealogical relationship. A comparison between these IBD identification algorithms will provide insight into what methods will provide consistent results long-term as GWAS data continues to increase. The foremost of these algorithms could then be combined with pedigrees to further bolster accuracy and efficiency of IBD detection. We endeavor to determine what method can reliably compute IBD segments shared between individuals and how this method might be enhanced with pedigrees.
- ItemComparison of Traditional and Machine Learning Methods in Population Genetics(2020) Griesman, Kendra; Mathieson, SaraPopulation genetics focuses on understanding the evolutionary history of specific populations to gain insight into evolutionary events leading to the variation observed in nature. Increasing our understanding of evolution and genetic differences within species and populations can deepen our knowledge about human health and diseases. Approximate Bayesian Computation (ABC) is one of the traditional statistical methods applied to population genetics problems. The goal of ABC is to estimate parameters of interest using likelihood free inferences on summary statistics from simulated datasets (Beaumont et al. 2002). However, ABC is limited by the need to use a small number of summary statistics due to model complexity and the curse of dimensionality. More recently, machine learning techniques have been applied to population genetics problems. Supervised machine learning uses ground truth labeled data to learn a model and make predictions about unseen test data. Support vector machines (SVMs) are a machine learning method that has been applied to population genetic problems (Ronen et al. 2013). Deep learning in particular, which uses a network of layers between the input and output, has been shown to be a promising method for future population genetics research. Convolutional neural networks (CNNs) were designed to take images as input and have been adapted to take full genetic datasets as input in the form of DNA alignment matrices. CNNs are well suited to handle inputs with a large number of features and can learn the latent structure of the input. CNNs can be used to make predictions even if no analytical model exists. However, CNNs are often considered "black boxes" and until recently have not been very interpretable (Flagel et al. 2018; Chan et al. 2018). With the development of machine learning methods applied to population genetic problems, a comparison between these new methods and traditional methods is necessary. Future research should focus on comparing the accuracy, difficulty, and ability of each method to be applied to certain problems. Estimating mutation rate in humans is an interesting problem in population genetics because of the potential to be applicable to human lives and the availability of human genetic data. We conduct a comparison between ABC and CNN methods on mutation rates in humans to gain insight into the practicability of machine learning in the field of population genetics
- ItemDeep Networks for Population Genetics Data Generation(2020) Wang, Zhanpeng; Mathieson, SaraAs the population genetic database such as 1000 Genomes Dataset (Consortium et al. 2015) grows in size every day, it becomes more and more challenging to understand the large flow of the genetic information. Recent works in population genetics involved with Machine Learning heavily rely on simulated data based on the collected real data of population genetics because machine learning is very promising when dealing with large scale data. The main motivation behind this is that Machine Learning can help researchers figure out the potential patterns that lead to the occurrences of evolution (e.g. natural selection, mutation, migration, etc.) and thus broaden the knowledge about how to maintain biodiversity within and between species. However, researchers always have a worry that the simulated data do not actually match the real data such that the predictions resulted from the trained Machine Learning model does not actually reflect and explain the evolutionary event for the real world. To circumvent this problem, a new population genetic data simulation framework by applying generative adversarial neural nets is introduced. Generative Adversarial Nets (GANs) is a framework for training generative models through adversarial process, in which the generative model and the discriminator model are trained simultaneously. During each training section, the generative model enhances its skill of simulating data from real data and tries to fool the discriminator model for believing the simulated data are real, whereas the discriminator model improves its judgment of distinguishing between the real data and the simulated data. In this work, we are going to examine the previous works of simulating real data with different approaches, the usage of GANs in generating data outside the field of population genetics, and how GANs can be applied to the real data of population genetics to generate data as close to genuine as real data.
- ItemGenerate better data for population genetics using generative adversarial nets(2020) Wang, Jiaping; Mathieson, SaraFor the past years, researches in population genetics, a subfield of biology studying the variation of genes in a population with respect to space and time, rely heavily on simulated data. That is to say, analysis of the relationship between gene patterns and evolutionary parameters (ie. mutation rate, recombination rate, etc.) is based on simulated data, usually generated using softwares such as msprime. However, the problem with simulated data is that we?re not sure how well it can match the real data as these data are simulated with parameters configured by human beings. Furthermore, this methodology might have a greater effect on machine learning approaches, as our algorithms are trained on these simulated datasets. Under such a circumstance, it is necessary to find out a way so that we can generate good data for research purposes. In this paper, we present some basics about machine learning, and some possible approaches to fulfill our goal using generative adversarial nets (GANs).
- ItemGenerative Adversarial Networks for Population Genetics(2022) Ali, Mohamed; Mathieson, SaraPopulation genetics can be defined as the study of distributions and changes in the genetic data of populations through time. This field relies heavily on simulated data for validation. Simulating a population involves knowledge of demographic parameters such as mutation rate, recombination rate, population size, and times when changes in population has occurred. In the paper (Wang et al., 2021), an algorithm to estimate these demographic parameters and adapt to data from different populations is proposed, a Generative Adversarial Network that learns the probability distribution of a population so that it can estimate what parameters have led to such configuration. There are two problems with the approach presented in the paper, as is the case with the classical GAN approach from the original paper, (Goodfellow et al., 2014). The problems are that the learning process of a classical GAN is not stable and relies heavily on hyperparameters and the loss function is not useful for understanding the development of the model. Another problem that arises when trying to classify special parameters such as natural selection is mode collapse, a situation where a generator can only generate a single aspect or type of output, without fully exploring the probability space. The last problem we will address in this paper is evaluation. It's very difficult to evaluate the accuracy of a model if you can't "view" the final product. This is mostly the case in population genetics, where parameters and simulations might not be enough to validate a model.
- ItemGenetic Reconstruction in Pedigrees to Study Disease Inheritance(2021) Spano, Lizzie; Mathieson, SaraIn this paper we look at ways to reconstruct DNA in a pedigree. We start by discussing what a pedigree is, how it can be used to learn more about a population, and some challenges of reconstructing its genetic information. We then review some of the relevant literature on the topic. First, we motivate pedigree reconstruction with a full bird pedigree being used to study migration. We then look at GERMLINE, an algorithm that finds IBD segments, which are a useful tool for reconstruction. Next we examine thread, an algorithm that uses IBD segments to reconstruct a pedigree. Then we look at HMM, an algorithm that reconstructs a pedigree one generation at a time. Lastly we consider RABBIT, an algorithm that expands on HMM. After the literature review, we propose our work that expands on this research by studying thread's accuracy. We would do so by running thread on the full bird pedigree and comparing it to the real data, and by running thread and HMM on the same dataset to compare the two algorithms. We conclude by considering the future work in this field, including looking at reconstructed DNA for patterns about disease inheritance.
- ItemImprovement on the interpretability of convolutional neural network for population genetics(2021) Xu, Yongxin; Mathieson, SaraPopulation genetics is a study of genetic variations in populations and evolutionary forces that explain these variations. Relevant studies are usually based on simulated genomic data in matrix form. Many existing methods, such as statistical likelihood inferencesand SVM, can only deal with the summary statistics of simulated matrices, which suffer a loss of information and reduced accuracy. CNNs, with their ability to process raw genomic data and stable performance, have outperformed the existing methods in solving many population genetic problems such as detecting natural selection. However, since the inner architecture of CNNs is complex, it is usually difficult for researchers to understand what their models are learning and why the models make certain decisions on the given inputs. To enhance the interpretability of CNN models, we look into two techniques that have been successfully applied in other fields, where the application of CNNs is more mature than population genetics. One technique is an intrinsically interpretable CNN design called SincNet in speech recognition, which utilizes band-pass filters to limit the number of learnable parameters and thereby improve interpretability. The other is a post-hoc interpretation technique known as saliency maps which visualize the importance of each input unit to the final decisions, and have been widely applied in computer vision and natural language processing. In the end, we propose two approaches to fit these two techniques accordingly into the studies of natural selection.
- ItemInterpreting Machine Learning Models used in Population Genetics(2020) Thiel, Pablo; Mathieson, SaraMany problems in population genetics are well suited to supervised machine learning (ML) methods, which can leverage characteristics like high input dimensionality to result in considerable performance gains over traditional statistical models. However, ML has yet to be fully embraced in this field, due in large part to the lack of intelligible explanations for many models' predictions, particularly when compared with highly interpretable statistical approaches. Improving interpretability for ML models used in population genetics work not only would help strengthen community trust by making it easier to examine why a faulty model makes incorrect conclusions, but it could also bring new insights to the field by providing clearer explanations of how successful models arrive at correct conclusions. With this in mind, this thesis explores the effectiveness of ML methods when used in conjunction with population genomic data, investigates techniques for making ML models more interpretable using either model-agnostic or model-specific approaches, and discusses the obstacles present in applying these techniques within the context of population genetics. Additionally, this thesis proposes a new framework using decision trees to interpret ML models used in population genetics work by means of leveraging preexisting, widely-used summary statistics. It further reports on experimental results investigating the effectiveness of this method for creating explanations which are both intuitively understandable and which remain loyal to the underlying ML model when applied on new data.
- ItemLearning Natural Selection with Convolutional Neural Networks(2021) Zhang, Mingxuan (Kira); Mathieson, SaraConvolutional Neural Networks (CNNs) is one of the most efficient approaches to analyze population genetics data and draw conclusions on the species' evolution. Population genetics is one approach people take to learn about biological evolution on the genetic level. Genetic differences between populations encode loads of information; decoding all the information can be computationally costly. Fortunately, machine learning is known for its ability to compute large data sets, especially high-dimensional ones, efficiently and draw insightful conclusions. The key to applying CNN to genetic data is to treat the data (alignments of DNA sequences from different individuals in the same population) as images and identify patterns in the images. Modification has been made on the original CNN architecture to utilize unique properties of genetic data like exchangeability. The second half of this thesis is devoted to modifying the current CNN design to study the tomato. The process of plant domestication is often challenging andtakes a long time. Until now, the process still involves many unclear and under-explored intermediate stages with potentially important information about how domestication traits evolved. Our research is based on previous CNN designs that are already shown to work on population genetics data. Our model is adapted from the OnePop model by Sara Mathieson. Starting from some initial idea of the design, we experimented with different data-processing approaches and trained the model using various combinations of parameters.
- ItemLimitations of Genomic Analysis on Novel Species(2021) Contreras-Orendain, Luis; Mathieson, SaraFor widely studied species such as humans, fruit flies, and mice, there are many sequenced genomes, but for novel species, only afew or a singular processed genome is available. Being able to study novel species is important to understand their environmental impact, in the case of invasive species, or their genetic relation to other species but they pose the greatest difficulty to study. Genetic sequences are created using assemblers and assembling the genome for a diploid species is a computationally complex task which is why diploid assemblers create a phased collapsed genome that contains similar genetic information. Applications of Pairwise Sequential Markovian Coalescent (PSMC) modeling for population size inference and Phylogenetic tree generation for building a species family tree become more difficult with novel species and it is not clear how to proceed. Other tools exist, such as Read Mapping and NCBI's Blast, that provide the initial steps to the first two tools mentioned but are not well integrated with them. As a proof of concept, we applied Read Mapping with PSMC analysis and Blast with Phylogenetic tree construction on the novel species, the Spotted Lanternfly, to investigate the feasibility of these tools on a phased genome. At least for this genome, our analysis shows there to be significant limitations due to computational run time and with the processed output of our pipeline. More work is needed to better integrate various tools to analyze novel species.
- ItemOn Big Data in Population Genetics: Comparison of Methods(2020) Gerhard, Russell; Mathieson, SaraI review commonly used methods for approximating model parameters within the setting of population genetics and compare approximate Bayesian computation to a convolutional neural network. Results from population genetics are meaningful in many fields including anthropology and medicine, as its study often concerns the evolution of the human genome with respect to physical migration and genetic disease. Thus, the appropriate use of methods to implement population genetic studies it extremely important. Within the umbrella of traditionally-used methods, there exist trade-offs between two types of models: those that use summary statistics and those which make full use of the data. I will use approximate Bayesian computation (ABC) as a concrete example of the former and Markov chain Monte Carlo (MCMC) simulation as an example of the latter. Here, the main trade-off exists between computability and accuracy. Because of the simplification inherent in the use of summary statistics, ABC is more feasible computationally; however, MCMC - when tractable - typically results in higher accuracy. Interestingly, both of these methods suffer from the curse of dimensionality in their own unique way: ABC is intractable if too many summary statistics are used, and MCMC becomes increasingly intractable as the number of parameter estimations increases. This commonality among traditional methods presents the opportunity for new, machine learning methods to be useful. For instance, new techniques like support vector machines (SVMs) and convolutional neural networks (CNNs) often work better with higher-dimensional data. While their use in fields like image processing, computer vision, and natural language processing has become widespread, they have yet to permeate population genetics. Hopefully, comparing machine learning results with those of traditional methods on a canonical problem will open the door to increased interest in and use of machine learning in population genetics.
- ItemUnderstanding Convolutional Neural Networks Applied to COVID-19(2021) Wang, Yuchen; Mathieson, SaraWith the great advantage of DNA sequencing at the end of the 20th century, researches are provided with larger and far more complicated genetic datasets to study. They tried to infer human evolutionary facts like natural selections, migrations and mutations from these complex data. However, traditional models that uses only a few aspects of data as summaries could have the possibility of ignoring important information in the large datasets today, and these methods usually involve likelihood calculations which require individual analysis for each problem, and are focusing on a single aspect of the data. To better utilize the advantage of these population-scale genetic datasets, we introduce a new way of analyzing genetic data using supervised Machine Learning, more specifically, Convolutional Neural Networks (CNN). In this literature review, we are going to look at two papers in depth, each with a different architecture of CNN. Both researches showed great potential of CNNs' application in genetics study, with comparable accuracies to traditional methods and great scalability to all kinds of related problems. However, as an emerging field of study, there are still problems and blanks awaiting people to answer and fill. First, even though statistics showed great capabilities of CNNs on genetic datasets, we still don't have a thorough understanding on what and how CNNs learn from these datasets. Also, current researches are generally focused on applying CNNs to simulated human evolutionary data. I propose that we could also apply CNNs to other species like virus with faster generation iterations, thus there would be realistic data for us to test and verify on. A good choice would be COVID-19, considering its wide spread across the world and the urgent need for development of COVID-19 vaccines.