Data Locality Practices of MapReduce and Spark: Efficiency and Effectiveness

dc.contributor.advisorDougherty, John P.
dc.contributor.authorSnow, Alex
dc.date.accessioned2019-04-08T14:49:10Z
dc.date.available2019-04-08T14:49:10Z
dc.date.issued2018
dc.description.abstractThe parallel computing community has produced various cluster computing frameworks to process the immense amounts of data we generate in the modern age. Two in particular, Hadoop MapReduce and Spark, rose above the rest as the most popular frameworks due to their performance and accessibility. Their performance envelopes are majorly dependent upon how and where they store input data in relation to the computation unit, known as data locality practices. We sought to identify which specific practices give rise to MapReduce and Spark’s strengths and weaknesses by examining an assortment of performance comparisons between the two. Performance tests include: running Word Count, Sorting, K-Means Clustering, and PageRank on variously sized datasets. MapReduce’s overlapping map and shuffle stages help it triumph over Spark when sorting. However, MapReduce fails to outperform Spark when running Word Count, K-Means, and PageRank. This is due to MapReduce’s use of HDFS (Hadoop Distributed File System), which primarily saves data on the distant hard disk, and inability to reuse intermediate results. Spark’s ability to save and reuse intermediate data from memory enables the framework to complete iterative algorithms much faster than MapReduce. Conversely, Spark executes shuffling calls slower than MapReduce because of its reliance on the OS buffer cache to reuse data and RDD recovery system.
dc.description.sponsorshipHaverford College. Department of Computer Science
dc.identifier.urihttp://hdl.handle.net/10066/20805
dc.language.isoeng
dc.rights.accessOpen Access
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/
dc.titleData Locality Practices of MapReduce and Spark: Efficiency and Effectiveness
dc.typeThesis
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
2018SnowA.pdf
Size:
884.77 KB
Format:
Adobe Portable Document Format
Description:
Thesis
Loading...
Thumbnail Image
Name:
2018SnowA_release.pdf
Size:
164.09 KB
Format:
Adobe Portable Document Format
Description:
** Archive Staff Only **
Collections