On Big Data in Population Genetics: Comparison of Methods

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
Producer
Director
Performer
Choreographer
Costume Designer
Music
Videographer
Lighting Designer
Set Designer
Crew Member
Funder
Rehearsal Director
Concert Coordinator
Moderator
Panelist
Alternative Title
Department
Haverford College. Department of Computer Science
Type
Thesis
Original Format
Running Time
File Format
Place of Publication
Date Span
Copyright Date
Award
Language
eng
Note
Table of Contents
Terms of Use
Rights Holder
Access Restrictions
Dark Archive until 2025-01-01, afterwards Open Access
Tripod URL
Identifier
Abstract
I review commonly used methods for approximating model parameters within the setting of population genetics and compare approximate Bayesian computation to a convolutional neural network. Results from population genetics are meaningful in many fields including anthropology and medicine, as its study often concerns the evolution of the human genome with respect to physical migration and genetic disease. Thus, the appropriate use of methods to implement population genetic studies it extremely important. Within the umbrella of traditionally-used methods, there exist trade-offs between two types of models: those that use summary statistics and those which make full use of the data. I will use approximate Bayesian computation (ABC) as a concrete example of the former and Markov chain Monte Carlo (MCMC) simulation as an example of the latter. Here, the main trade-off exists between computability and accuracy. Because of the simplification inherent in the use of summary statistics, ABC is more feasible computationally; however, MCMC - when tractable - typically results in higher accuracy. Interestingly, both of these methods suffer from the curse of dimensionality in their own unique way: ABC is intractable if too many summary statistics are used, and MCMC becomes increasingly intractable as the number of parameter estimations increases. This commonality among traditional methods presents the opportunity for new, machine learning methods to be useful. For instance, new techniques like support vector machines (SVMs) and convolutional neural networks (CNNs) often work better with higher-dimensional data. While their use in fields like image processing, computer vision, and natural language processing has become widespread, they have yet to permeate population genetics. Hopefully, comparing machine learning results with those of traditional methods on a canonical problem will open the door to increased interest in and use of machine learning in population genetics.
Description
Citation
Collections