Authorship Attribution of Song Lyrics
Date
2017
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Producer
Director
Performer
Choreographer
Costume Designer
Music
Videographer
Lighting Designer
Set Designer
Crew Member
Funder
Rehearsal Director
Concert Coordinator
Advisor
Moderator
Panelist
Alternative Title
Department
Swarthmore College. Dept. of Linguistics
Type
Thesis (B.A.)
Original Format
Running Time
File Format
Place of Publication
Date Span
Copyright Date
Award
Language
en_US
Note
Table of Contents
Terms of Use
Full copyright to this work is retained by the student author. It may only be used for non-commercial, research, and educational purposes. All other uses are restricted.
Rights Holder
Access Restrictions
Terms of Use
Tripod URL
Identifier
Abstract
Authorship attribution is a common application offorensic linguistics that can be
performed on a variety of data types. The goal of authorship attribution is to predict the
creator of a piece of linguistic data by analyzing the quirks and patterns of a text or audio
sample and comparing them to a set of potential authors to determine the best match. In
this paper, I apply this technique to a large database of song lyrics scraped from the
Internet as I attempt to train a computational model to predict the performing artist of a
given song. A key element of this project is to find a list of relevant features, or
calculable information, that best distinguishes the songs of a certain artist from the songs
of all other artists. For example, the most obvious difference between the line "In chilly
sub-depth railways, the weathered concrete stairways provide me with a means of getting
home" (from Owl City's "Early Birdie") and the line "So get out, get out, get out of my
head / And fall into my arms instead" (from One Direction's "One Thing") is the
presence of more unusual words in the former example than in the latter. Therefore, my
model uses the inverse document frequency to determine the rareness of each word in the
song and uses it to help find a matching artist. The entire feature set I discuss in this
paper contains various types of linguistic information, although syntax is the most
difficult to manage because the syntax of lyrics is strongly constrained by the meter of
the song.
Tarlin 2
This topic is inherently susceptible to a data sparsity problem-the number of
words in a single song may not be enough to effectively perform the statistical
component of the model. In fact, the reason that I choose to define the author as the
performing artist rather than the lyricist is that there is not enough lyricist information
available. In many cases, the song's metadata lists zero or multiple composers, both of
which are incompatible with the machine learning algorithms I use from Python's scikitlearn
package. However, I claim that predicting the performing artist is still a worthwhile
task because bands will choose to record songs that have similar styles-both in terms of
the music and the lyrics.
Though my model does not correctly predict the majority of the artists, it does
perform significantly better than chance, meaning that the selected features do give some
indication of the performing artist. Although the success of the classifier is more visible
with a smaller number of possible authors, the ratio between its accuracy and chance is
maintained even when applied to a larger data set.