Abstract:
This thesis explores n-grams-based gender classification analyses using various n-grams
types, sizes, and feature sets. This study expanded on previous research by including
a non-binary gender category. First, a state-of-the-art n-grams analysis using a simple
dissimilarity measure was replicated, and peak accuracy reached 71%. Seeking to
improve this result, a formal feature selection and extraction process was performed.
This secondary analysis yielded lower peak accuracy of 61% overall, but non-binary
and female-specific accuracy reached 99–100%. Both results are comparable to findings
from previous research.