Comparison of Traditional and Machine Learning Methods in Population Genetics

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
Producer
Director
Performer
Choreographer
Costume Designer
Music
Videographer
Lighting Designer
Set Designer
Crew Member
Funder
Rehearsal Director
Concert Coordinator
Moderator
Panelist
Alternative Title
Department
Haverford College. Department of Computer Science
Type
Thesis
Original Format
Running Time
File Format
Place of Publication
Date Span
Copyright Date
Award
Language
eng
Note
Table of Contents
Terms of Use
Rights Holder
Access Restrictions
Haverford users only until 2030-01-01, afterwards Open Access
Tripod URL
Identifier
Abstract
Population genetics focuses on understanding the evolutionary history of specific populations to gain insight into evolutionary events leading to the variation observed in nature. Increasing our understanding of evolution and genetic differences within species and populations can deepen our knowledge about human health and diseases. Approximate Bayesian Computation (ABC) is one of the traditional statistical methods applied to population genetics problems. The goal of ABC is to estimate parameters of interest using likelihood free inferences on summary statistics from simulated datasets (Beaumont et al. 2002). However, ABC is limited by the need to use a small number of summary statistics due to model complexity and the curse of dimensionality. More recently, machine learning techniques have been applied to population genetics problems. Supervised machine learning uses ground truth labeled data to learn a model and make predictions about unseen test data. Support vector machines (SVMs) are a machine learning method that has been applied to population genetic problems (Ronen et al. 2013). Deep learning in particular, which uses a network of layers between the input and output, has been shown to be a promising method for future population genetics research. Convolutional neural networks (CNNs) were designed to take images as input and have been adapted to take full genetic datasets as input in the form of DNA alignment matrices. CNNs are well suited to handle inputs with a large number of features and can learn the latent structure of the input. CNNs can be used to make predictions even if no analytical model exists. However, CNNs are often considered "black boxes" and until recently have not been very interpretable (Flagel et al. 2018; Chan et al. 2018). With the development of machine learning methods applied to population genetic problems, a comparison between these new methods and traditional methods is necessary. Future research should focus on comparing the accuracy, difficulty, and ability of each method to be applied to certain problems. Estimating mutation rate in humans is an interesting problem in population genetics because of the potential to be applicable to human lives and the availability of human genetic data. We conduct a comparison between ABC and CNN methods on mutation rates in humans to gain insight into the practicability of machine learning in the field of population genetics
Description
Citation
Collections