Browsing by Author "Friedler, Sorelle"
Now showing 1 - 20 of 33
Results Per Page
Sort Options
- ItemA Comparison of Fairness-Aware Machine Learning Algorithms(2018) Roth, Derek; Friedler, SorelleFairness, as it applies to algorithms, implies that the decisions made by an algorithm are being made such that there is no discrimination against individuals or groups in labeled data sets. In this paper, I give a summary of the relevant endings by computer scientists regarding fair algorithms, discuss three techniques used to reduce/remove bias in algorithms, and examine three case studies that clearly demonstrate the importance of this field of study. The goal of this thesis is to determine the best practices for case studies on this topic and to discover ways of developing algorithms that are unbiased. I apply existing algorithms to four data sets and compare their results in order to determine which are the most useful in a specifc situation.
- ItemA Rule Learning Approach to Discovering Contexts of Discrimination(2017) Tionney, Nix; Friedler, SorelleIn fairness-aware data mining, discrimination discovery refers to determining if social discrimination against certain individuals or groups of individuals exists in labeled data sets or in learned models. In this thesis, I focus on the problem of discovering contexts or niches of discrimination in data sets, i.e. revealing groups of features in a given data set that, when considered together, have a greater degree of discriminatory influence in the data than when any one feature is examined individually. Our approach to this problem involves using the CN2 and CN2-SD rule learning algorithms to identify groups in the data that have signifcant predictive ability, and then using the Gradient Feature Algorithm to quantify and examine each group's discrimination potential. This approach shows promising results for identifying contexts of discrimination.
- ItemActive Meta-Learning(2020) Nicholas, Gareth; Friedler, SorelleThis thesis outlines the methods used in machine learning to generate models which are effective on a variety of tasks. We begin with a quick overview of the field of machine learning, covering the topics necessary to understand more complex learning algorithms. We then discuss the field of transfer learning and the common strategies used to train models so that they can function in different domains and solve different tasks. We then continue with current research in the field of meta-learning, which aims to find optimal solutions to the problem of learning how to learn. We introduce model-agnostic meta-learning, or MAML, as an algorithm which addresses the difficulty of few-shot learning on a variety of tasks. We then consider PLATIPUS as a tool for reasoning about model uncertainty in meta-learned models. Using this uncertainty,we explore active learning as a means of selecting the optimal examples to train our models. Using PLATIPUS and active learning, we seek to address the problem of exploring the space of chemical reactions efficiently.
- ItemAnalyzing Energy Efficiency in Neural Networks(2020) Susai, Silvia; Friedler, SorelleRecent advances in deep learning have led to the development of state-of-the-art models with remarkable accuracy; however, previous work has shown that these results incur a high environmental cost due to their significant energy usage. Nevertheless, accuracy remains the predominant evaluation criterion for neural network performance, so much so that computationally-expensive techniques such as neural architecture searches are oftentimes employed to only moderate success. What is more, the relationship between energy usage and accuracy has been shown to be non-linear. Thus, an increase in energy usage may not necessarily lead to an increase in accuracy. This thesis surveys the current literature pertaining to energy efficiency in deep learning and proposes that future work should examine how energy usage is a distinct trade-off in neural network models.
- ItemANALYZING THE COMPAS ALGORITHM IN CRIMINAL DEFENDANT RISK ASSESSMENT(2019) Ayad, Yasmine; Friedler, SorelleFor my thesis, I analyzed the COMPAS recidivism prediction tool made by Equivant which aims to see how likely a defendant charged with a crime will re-offending given a score from 1-10 where 1 indicating lowest risk and 10 indicating highest risk and is used by many states in the country. ProPublica dataset consisted of re-arrest data of COMPAS predictions of 6172 people made between 2012-2014 which they proved that COMPAS was more likely to falsely label African-American defendants as high risk more often than White defendants and more likely to falsely label White defendants as low risk more often than African-American defendants. Looking at ProPublcia’s dataset along with Jai Nimgaonkar’s dataset who took ProPublica’s dataset to see if these people were convicted of a crime in order to see if there is still bias from re-arrest data to conviction data when looking at intersectionality between sex and race and different fairness aware algorithms.
- ItemAuditing Deep Neural Networks and Other Black-box Models(2016) Falk, Casey; Friedler, SorelleIn this era of self-driving cars, smart watches, and voice-commanded speakers, machine learning is ubiquitous. Recently, deep learning has shown impressive success in solving many machine learning problems related to image data and sequential data - with the result that people are frequently impacted by deep learning models on a daily basis. However, how do we judge if these models are fair, and how do we discover what information is important when making a decision? And as APIs become ever-more common, how do we determine this information if we do not have access to the model itself? We developed a novel technique called Gradient Feature Auditing which gradually obscures information from a data-set and evaluates how a model's predictions change as yet more of that information is obscured. This allows a deeper investigation of what information and features are actually used by machine learning models when making predictions. Throughout our experiments, we apply Gradient Feature Auditing on multiple data-sets using several popular modeling techniques (linear SVMs, C4.5 decision trees, and shallow feed-forward neural networks) to provide evidence that Gradient Feature Auditing indeed affords deeper insight into what information a model is using.
- ItemAuditing Deep Neural Networks to Understand Recidivism Predictions(2016) Smith, Brandon; Friedler, SorelleIn recent years, deep neural network models have proven to be incredibly accurate on many classification benchmarks. Due to this high accuracy, many non-technical fields are interested in using these models to assist in decision making processes. However, this curiosity is generally tempered by the realization that it is di fficult to understand what features of the data contribute to the prediction. We present a method to evaluate the effect of each feature in a data set on the predictions of a model, which we refer to as gradient feature auditing (GFA). To test this method, we trained four models (a deep neural network, SVM, SLIM, and decision tree) on recidivism data and then applied GFA to each model. The experimental portion verified the ability of GFA to obtain a ranked ordering of features. Next, we attempted to use methods from interpretable learning to validate our procedure. Overall, GFA allows domain experts to use the most effective model of their data in the decision making process, while also retaining the ability to explain how those decisions are being made.
- ItemBenchmarking Four Approaches to Fairness-Aware Machine Learning(2017) Hamilton, Evan; Friedler, SorelleBenchmarking fairness aware machine learning.
- ItemCommunity Detection in Multidimensional Social Networks(2014) Moll, Karl; Friedler, SorelleInformation about interactions between human actors, and the attributes about the actors in the networks, has become increasingly abundant in computer systems over the last decade. Multidimensional social networks are an increasingly common representation of interactions in markets, political networks, social networking sites, etc. The problem of detecting communities based off of this information is one that is of emerging interest in a variety of fields. Traditional clustering techniques, however, are not suited for dealing with the hybrid network of attribute information and structural relationships. Algorithms dealing with the extraction of communities that are based on multidimensional relationships are the focus of this paper. The topic of multidimensional community detection has many applications. One such application is personalizing the web, since many web services are using less sophisticated models on high value, high dimension data. There are also implications for improving research in other fields, especially Sociology and Social Movement Theory. Generally, from social media to advertising, these methods can lead to a more connected world, with more information passing, and could allow people to connect in dimensions of similarity that aren’t their most obvious feature (e.g. internet communities not being limited by geolocation).
- ItemComputational Fairness: Preventing Machine-Learned Discrimination(2015) Feldman, Michael; Friedler, SorelleMachine learning algorithms called classifiers make discrete predictions about new data by training on old data. These predictions may be hiring or not hiring, good or bad credit, and so on. The training data may contain patterns such as a higher rate of good outcomes for members of certain groups (e.g. racial groups) and a lower rate of good outcomes for other groups. This is quantified by the "80% rule" of disparate impact, which is a legal measure and definition of bias. It is ethically and legally undesirable for a classifier to learn these biases from the data. We propose two methods of modifying data, called Combinatorial and Geometric repair. We test our repairs on three data sets. Experiments show that our repairs perform favorably in terms of training classifiers that are both accurate and unbiased.
- ItemComputerized Redistricting: Examining the Weighted Points Version of the Capacitated K-Center Problem(2014) Levin, Harry; Friedler, SorelleEvery ten years, when states are forced to redraw their congressional districts, the process is intensely partisan, and the outcome is rarely fair and democratic. In the last few decades, the growing capabilities of computers have offered the promise of objective, computerized redistricting. Unfortunately, the redistricting problem can be shown to be NP-Complete, but there are a number of approximation algorithms and heuristics that are effective. I focus on an approximation algorithm for the capacitated k-center problem. I revise this algorithm, so that it can be applied to the redistricting problem, and I show through experimental testing that the algorithm is effective. My results demonstrate that computers can facilitate the process by giving mapmakers access to more options.
- ItemDark Reactions: Recommender Guided Materials Discovery(2014) Raccuglia, Paul; Friedler, SorelleWe present an exploration of data mining and machine learning techniques applied to a materials science dataset, with the goal of improving a lab's efficiency when running experiments. The primary product of our work is two tools to help chemists' better explore the space of possible reactions: a recommender system which we hope will increase the serendipitous discovery of interesting reactions that the chemists would not have thought to explore; and a seed-based ranking system which helps chemists prioritize which reactions to run, and with what parameters. We present a number of different techniques for tuning our recommender system, as well as presenting an automated approach to evaluating recommender systems in contexts where labels are expensive to learn (time, resources, equipment). Reactions are given a label in f1; 2; 3; 4g, where 4 corresponds to successful formation of a crystalline product, 3 corresponds to mostly successful formation of a crystalline product and 1 and 2 correspond to different failure cases. Using SVM we are able to achieve 65% accuracy on a 4-category classification on a held-out test set of 30% of our data set. Preliminary empirical results suggest a significant improvement in efficiency: observed rate of observing a 3 or a 4 increased from 65% (n=5486) without our system to 86% (n=190) using recommendations from our system. Our system is available at http://darkreactions.haverford.edu/.
- ItemExpert-Assisted Transfer Reinforcement Learning(2019) Slack, Dylan; Friedler, SorelleReinforcement Learning is concerned with developing machine learning approaches to answer the question: "What should I do?" Transfer Learning attempts to use previously trained Reinforcement Learning models to achieve better performance in new domains. There is relatively little work in using expert-advice to achieve better performance in the transfer of reinforcement learning models across domains. There is even less work that concerns the use of expert-advice in transfer deep reinforcement learning. This thesis presents a method that gives experts a route to incorporate their domain knowledge in the transfer learning process in deep reinforcement learning scenarios by presenting them with a decision set that they can edit. The decision set is then included in the learning process in the target domain. This work describes relevant background to both Reinforcement and Transfer Learning, provides the implementation of the proposed method, and suggests its usefulness and limitations by providing a series of applications.
- ItemExplaining Active Learning Queries(2017) Chang, Kyu Hyun; Friedler, SorelleIn contrast to traditional supervised machine learning that takes a set of labeled data and builds a model that best fits the given data, active learning selects instances from which it will learn. In a typical setting, active learning starts with some labeled instance, and queries an unlabeled instance that it can learn the most from. Then the queried instance is labeled by an oracle, and the learner re-trains the model and continues the learning cycle. By selecting the most informative instances, active learning attempts to find the optimal set of training data. Often, the oracle is a human annotator: for speech data, for example, an annotator can be a trained linguist. In a typical active learning setting, an annotator's role is to provide a label to the instance that the active learner asks for. In this setting, it is difficult for the annotator to understand why the queried instance is important, and the annotator takes a passive role in a sense that he or she merely provides the label to the active learner. In this paper, I propose a technique that explains active learning queries and an expert-aided active learning procedure in which experts are more involved in the learning cycle. The technique was applied to Haverford's Dark Reactions Project dataset, which consists of organically-templated metal oxide synthesis reactions. The explanations of queries are provided to a chemist, who was able to interpret the explanations and found it helpful identifying chemical space that is poorly understood by the model. Moreover, the expert-aided active learning showed performance commensurate with the standard active learning.
- ItemFairness and Information Access Clustering in Social Networks(2020) Beilinson, Hannah; Friedler, SorelleMy thesis focuses on strategies to analyze fairness in information spread in social networks. Building off the field of influence maximization, I examine how the spread of information in a social network advantages some individuals over others. I review how others have handled fairness analysis in influence maximization and propose information access clustering as a new method to examine fairness. I formalize information access disparity by clustering individuals in social networks into groups based on their level of information access. I then show that these information access clusters correlate to existing measures of information access, using a coauthorship dataset as an example. I also explore variations on the information access clustering algorithm.
- ItemFairness in Information Access: Emphasizing the Network(2023) Rousseau, Jade; Friedler, SorelleSocial networks - systems of interconnected people - are fundamental to our being in the world; and one’s belonging and positionality within networks greatly determines one’s exposure to resources and to harms. Information, understood in a broad sense, is vehiculated through(out) social networks; and machine learning algorithms are increasingly involved in determining access to information. The use of machine learning algorithm for information spread has been shown to reproduce, perpetuate, and exacerbate existing inequalities. Efforts have thus been made to create ‘fairnessaware’ machine learning algorithms. But these algorithms have tended to focus on individuals and demographics, at the expense of looking at the network itself. I argue that because the very structure of a network encodes sensitive attributes such as demographics, and because network belonging and positionality can themselves be sources of harm as well as privilege, the network should be thought of as an agent. I focus on the need for ’fair’ machine learning algorithms to take into account the ways in which harm does not operate merely at the individual level, but always at the network level by developing an experimental framework around collateral consequences - second-order effects that I model by positing a wellbeing for each individual in a social network.
- ItemIdentifying the Relationship Between Evolutionary Distance and the Accuracy of Cis-Regulatory Module Predictions(2014) Cueto, Paulina; Friedler, SorelleCis-Regulatory Modules (CRMs) are the portion of DNA that initiates gene expression. Gene expression is the process through which the body turns DNA into functions and cells within an organism. In this paper I build upon a program, MultiModule, created by Zhou and Wong (2007) that utilizes hidden Markov models and multiple sequence alignments to determine novel Cis- regulatory modules. I use the program to determine if there is a relationship between the evolutionary distance between species, and the ability to identify CRMs based on multiple alignments. The results indicated that there is a higher prediction rate between the closest species, and that the greater the variety in evolutionary distance the more precise the predictions are.
- ItemIdentity and Computer Science: A Mismatch?(2021) Lee, Steve; Friedler, SorelleAt the bachelor's level, female students, Black students, and Indigenous students pursue computer science degrees at disproportionately lower rates. For example, in recent years, approximately 20% of bachelors degrees in Computer and Information Science have awarded to women whereas approximately 57% of bachelor's degrees were awarded to women in all fields overall. Why does the disparity exist for underrepresented students in computer science, and how can we do better? This is the central question of this literature review. In this paper, I explore the existing literature for some possible reasons that may explain why this is the case (such as stereotypes, representation, and accessibility) and how we could do better. This literature review describes the methodology of each study and culminates to a possible future study to learn and improve the experiences in computer science for marginalized students, including but not limited to students who are women or gender-nonconforming, Black and Indigenous, queer, first-generation and low-income, and with disabilities.
- ItemInclusivity and Transparency in Machine Learning Model Auditing(2021) Byars, Monique; Friedler, SorelleThe goal of this literature review is to highlight the importance of inclusivity in auditing algorithms. Machine Learning (ML) models affect many aspects of our lives such as providing us with relevant ads or predicting our movie preferences. Thus, their auditing and critique is integral to ensuring they are done in a holistic manner. Many papers who discuss and research auditing have good intentions. There is an important focus on gender bias in these models in the world of Human Computer Interaction (HCI). However, many of these papers fail to take transgender and non-binary people into account. Many methods used to determine if a model is biased end up using biased methods themselves. They often have no access to self reported gender and, therefore, default to outdated methods such as visual cues and stereotypes. This paper will highlight the importance of transparency on the part of ML models as it aids these audits as well as how the audits themselves can be improved in an inclusive way.
- ItemIndirect Discrimination in the Age of Big Data(2016) Rybeck, Gabriel; Friedler, Sorelle; Binder, Carola ConcesRapid advancements in the use of big data to make automated decisions may result in indirect discrimination. For example, Larson et al. (2015) find that the Princeton Review charges different prices by zip code resulting in disproportionately high prices in predominantly Asian zip codes. In an economic sense, this "indirect discrimination" has no inherent unfairness as we expect the firm to maximize pro fits by exploiting consumers' willingness to pay, even if the outcome disproportionately affects certain groups. However, Larson et al. and the press coverage of their findings suggest that Princeton Review prices exhibit an unfair level of indirect discrimination. My regression analysis finds that the level of indirect discrimination on the Asian population in the Princeton Review Prices is not significantly different than that of retail gasoline prices, even though gasoline price variation is not typically perceived as unfair. I use behavioral economic theory to explain this result. In the second part of this thesis, I further develop and apply a technique that serves to both remove indirect discrimination and identify the most important features in machine learning models of automated decisions. I use the technique to further explore the differences between Princeton Review prices and retail gasoline prices. I also contribute an analysis of the technique on recidivism data, in which all the features are categorical, and on sexual offender data, in which surname and address are used to predict race to simulate the problem of hidden data.