Abstract:
Many problems in population genetics are well suited to supervised machine learning (ML) methods, which can leverage characteristics like high input dimensionality to result in considerable performance gains over traditional statistical models. However, ML has yet to be fully embraced in this field, due in large part to the lack of intelligible explanations for many models' predictions, particularly when compared with highly interpretable statistical approaches. Improving interpretability for ML models used in population genetics work not only would help strengthen community trust by making it easier to examine why a faulty model makes incorrect conclusions, but it could also bring new insights to the field by providing clearer explanations of how successful models arrive at correct conclusions. With this in mind, this thesis explores the effectiveness of ML methods when used in conjunction with population genomic data, investigates techniques for making ML models more interpretable using either model-agnostic or model-specific approaches, and discusses the obstacles present in applying these techniques within the context of population genetics. Additionally, this thesis proposes a new framework using decision trees to interpret ML models used in population genetics work by means of leveraging preexisting, widely-used summary statistics. It further reports on experimental results investigating the effectiveness of this method for creating explanations which are both intuitively understandable and which remain loyal to the underlying ML model when applied on new data.