Performance Tuning Of Apache Spark Framework In Big Data Processing with Respect To Block Size And Replication Factor
Main Article Content
Abstract
Apache Spark has recently become the most popular big data analytics framework. Default configurations are provided by Spark. HDFS stands for Hadoop Distributed File System. It means the large files will be physically stored on multiple nodes in a distributed fashion. The block size determines how large files are distributed, while the replication factor determines how reliable the files are. If there is just one copy of each block for a given file and the node fails, the data in the files become unreadable. The block size and replication factor are configurable per file. The results and analysis of the experimental study to determine the efficiency of adjusting the settings of tuning Apache Spark for minimizing application execution time as compared to standard values are described in this paper. Based on a vast number of studies, we employed a trial-anderror strategy to fine-tune these values. We chose two workloads to test the Apache framework for comparative analysis: Wordcount and Terasort. We used the elapsed time to evaluate the same.
Downloads
Download data is not yet available.
Article Details
How to Cite
1.
Joshi B, . P, Sawai D. Performance Tuning Of Apache Spark Framework In Big Data Processing with Respect To Block Size And Replication Factor. sms [Internet]. 30Jun.2022 [cited 18May2025];14(02):152-8. Available from: https://smsjournals.com/index.php/SAMRIDDHI/article/view/2719
Issue
Section
Research Article

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.