Abstract:
Authorship attribution is the process of identifying the author of a given work. This thesis surveys the history and foundations of authorship attribution, and then analyzes multiple machine learning methods that are used frequently in this field. In the classic authorship attribution problem, a text with unknown authorship is assigned an author from a set of candidate authors for whom documents of irrefutable authorship exist. Prior to the 1960’s, authorship attribution was a linguistics-focused field in which linguistic experts would determine the authors of unknown texts. In 1964, the analysis of ‘The Federalist Papers’ by Mosteller and Wallace was the first statistically driven approach to authorship attribution. This study marked the beginning of authorship attribution as a computational field rather than a linguistics field. The modern approach to authorship attribution involves selecting a set of linguistic features from the texts at hand and then applying a machine learning method on that feature set to classify authorship. This thesis analyzes multiple machine learning methods used for this purpose. Principal Components Analysis (PCA) is a popular unsupervised learning method that considers each text’s feature set as a vector in a multivariate vector space and has had success in authorship attribution. Support Vector Machines (SVMs) are a powerful supervised learning technique that creates a linear classifier used to attribute authorship. SVMs have outperformed all other analytical techniques used in authorship attribution. Due to the plethora of electronic texts that exist, authorship attribution has extensive applications in many different fields, with current research focusing primarily on developing application-specific methodologies.