Performance Tuning Of Apache Spark Framework In Big Data Processing with Respect To Block Size And Replication Factor

Main Article Content

Brijesh Y Joshi
Poornashankar .
Deepali Sawai

Abstract

Apache Spark has recently become the most popular big data analytics framework. Default configurations are provided by Spark. HDFS stands for Hadoop Distributed File System. It means the large files will be physically stored on multiple nodes in a distributed fashion. The block size determines how large files are distributed, while the replication factor determines how reliable the files are. If there is just one copy of each block for a given file and the node fails, the data in the files become unreadable. The block size and replication factor are configurable per file. The results and analysis of the experimental study to determine the efficiency of adjusting the settings of tuning Apache Spark for minimizing application execution time as compared to standard values are described in this paper. Based on a vast number of studies, we employed a trial-anderror strategy to fine-tune these values. We chose two workloads to test the Apache framework for comparative analysis: Wordcount and Terasort. We used the elapsed time to evaluate the same.

Downloads

Download data is not yet available.

Article Details

How to Cite
1.
Joshi B, . P, Sawai D. Performance Tuning Of Apache Spark Framework In Big Data Processing with Respect To Block Size And Replication Factor. sms [Internet]. 30Jun.2022 [cited 29Sep.2022];14(02):152-8. Available from: https://smsjournals.com/index.php/SAMRIDDHI/article/view/2719
Section
Research Articles