Data Locality Practices of MapReduce and Spark: Efficiency and Effectiveness
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Producer
Director
Performer
Choreographer
Costume Designer
Music
Videographer
Lighting Designer
Set Designer
Crew Member
Funder
Rehearsal Director
Concert Coordinator
Advisor
Moderator
Panelist
Alternative Title
Department
Haverford College. Department of Computer Science
Type
Thesis
Original Format
Running Time
File Format
Place of Publication
Date Span
Copyright Date
Award
Language
eng
Note
Table of Contents
Terms of Use
Rights Holder
Access Restrictions
Open Access
Terms of Use
Tripod URL
Identifier
Abstract
The parallel computing community has produced various cluster computing frameworks to process the immense amounts of data we generate in the modern age. Two in particular, Hadoop MapReduce and Spark, rose above the rest as the most popular frameworks due to their performance and accessibility. Their performance envelopes are majorly dependent upon how and where they store input data in relation to the computation unit, known as data locality practices. We sought to identify which specific practices give rise to MapReduce and Spark’s strengths and weaknesses by examining an assortment of performance comparisons between the two. Performance tests include: running Word Count, Sorting, K-Means Clustering, and PageRank on variously sized datasets. MapReduce’s overlapping map and shuffle stages help it triumph over Spark when sorting. However, MapReduce fails to outperform Spark when running Word Count, K-Means, and PageRank. This is due to MapReduce’s use of HDFS (Hadoop Distributed File System), which primarily saves data on the distant hard disk, and inability to reuse intermediate results. Spark’s ability to save and reuse intermediate data from memory enables the framework to complete iterative algorithms much faster than MapReduce. Conversely, Spark executes shuffling calls slower than MapReduce because of its reliance on the OS buffer cache to reuse data and RDD recovery system.