Data Locality Practices of MapReduce and Spark: Efficiency and Effectiveness
dc.contributor.advisor | Dougherty, John P. | |
dc.contributor.author | Snow, Alex | |
dc.date.accessioned | 2019-04-08T14:49:10Z | |
dc.date.available | 2019-04-08T14:49:10Z | |
dc.date.issued | 2018 | |
dc.description.abstract | The parallel computing community has produced various cluster computing frameworks to process the immense amounts of data we generate in the modern age. Two in particular, Hadoop MapReduce and Spark, rose above the rest as the most popular frameworks due to their performance and accessibility. Their performance envelopes are majorly dependent upon how and where they store input data in relation to the computation unit, known as data locality practices. We sought to identify which specific practices give rise to MapReduce and Spark’s strengths and weaknesses by examining an assortment of performance comparisons between the two. Performance tests include: running Word Count, Sorting, K-Means Clustering, and PageRank on variously sized datasets. MapReduce’s overlapping map and shuffle stages help it triumph over Spark when sorting. However, MapReduce fails to outperform Spark when running Word Count, K-Means, and PageRank. This is due to MapReduce’s use of HDFS (Hadoop Distributed File System), which primarily saves data on the distant hard disk, and inability to reuse intermediate results. Spark’s ability to save and reuse intermediate data from memory enables the framework to complete iterative algorithms much faster than MapReduce. Conversely, Spark executes shuffling calls slower than MapReduce because of its reliance on the OS buffer cache to reuse data and RDD recovery system. | |
dc.description.sponsorship | Haverford College. Department of Computer Science | |
dc.identifier.uri | http://hdl.handle.net/10066/20805 | |
dc.language.iso | eng | |
dc.rights.access | Open Access | |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ | |
dc.title | Data Locality Practices of MapReduce and Spark: Efficiency and Effectiveness | |
dc.type | Thesis |