Deep Networks for Population Genetics Data Generation
Haverford College. Department of Computer Science
Place of Publication
Table of Contents
Tri-College users only
As the population genetic database such as 1000 Genomes Dataset (Consortium et al. 2015) grows in size every day, it becomes more and more challenging to understand the large flow of the genetic information. Recent works in population genetics involved with Machine Learning heavily rely on simulated data based on the collected real data of population genetics because machine learning is very promising when dealing with large scale data. The main motivation behind this is that Machine Learning can help researchers figure out the potential patterns that lead to the occurrences of evolution (e.g. natural selection, mutation, migration, etc.) and thus broaden the knowledge about how to maintain biodiversity within and between species. However, researchers always have a worry that the simulated data do not actually match the real data such that the predictions resulted from the trained Machine Learning model does not actually reflect and explain the evolutionary event for the real world. To circumvent this problem, a new population genetic data simulation framework by applying generative adversarial neural nets is introduced. Generative Adversarial Nets (GANs) is a framework for training generative models through adversarial process, in which the generative model and the discriminator model are trained simultaneously. During each training section, the generative model enhances its skill of simulating data from real data and tries to fool the discriminator model for believing the simulated data are real, whereas the discriminator model improves its judgment of distinguishing between the real data and the simulated data. In this work, we are going to examine the previous works of simulating real data with different approaches, the usage of GANs in generating data outside the field of population genetics, and how GANs can be applied to the real data of population genetics to generate data as close to genuine as real data.