Abstract:
I review commonly used methods for approximating model parameters within the setting of population genetics and compare approximate Bayesian computation to a convolutional neural network. Results from population genetics are meaningful in many fields including anthropology and medicine, as its study often concerns the evolution of the human genome with respect to physical migration and genetic disease. Thus, the appropriate use of methods to implement population genetic studies it extremely important. Within the umbrella of traditionally-used methods, there exist trade-offs between two types of models: those that use summary statistics and those which make full use of the data. I will use approximate Bayesian computation (ABC) as a concrete example of the former and Markov chain Monte Carlo (MCMC) simulation as an example of the latter. Here, the main trade-off exists between computability and accuracy. Because of the simplification inherent in the use of summary statistics, ABC is more feasible computationally; however, MCMC - when tractable - typically results in higher accuracy. Interestingly, both of these methods suffer from the curse of dimensionality in their own unique way: ABC is intractable if too many summary statistics are used, and MCMC becomes increasingly intractable as the number of parameter estimations increases. This commonality among traditional methods presents the opportunity for new, machine learning methods to be useful. For instance, new techniques like support vector machines (SVMs) and convolutional neural networks (CNNs) often work better with higher-dimensional data. While their use in fields like image processing, computer vision, and natural language processing has become widespread, they have yet to permeate population genetics. Hopefully, comparing machine learning results with those of traditional methods on a canonical problem will open the door to increased interest in and use of machine learning in population genetics.