Apache Spark - Tuning and Resource Allocation - I

                          
Spark is a cluster computing system and is faster then any other tool available due to its In-Memory behaviour.

Having said that , due to its In-Memory nature any component in spark can be a bottleneck or can cause bottleneck to other components in the cluster, be it memory, cpu or Network Bandwidth.

Since it is in-memory, IO doesn't plays a big role, it does has an impact though but for now we will concentrate on Memory and network part of it.

Although spark has its own scheduler/framework which can execute the Spark jobs but we will discuss the most commonly used Framework I.e. YARN to deploy Spark Applications.

Resource Allocation: 

 
Spark Jobs can be submitted in 2 modes to Yarn :-

1) Yarn-Client

 

2) Yarn-Cluster 

There are two deploy modes that can be used to launch Spark applications on YARN :-

In Yarn Client Mode the driver runs on the client and the resources are requested from Yarn by Application Master whereas, in Yarn Cluster mode the client initiates the application and disconnects. The Spark Driver runs inside the application master process managed by Yarn and Cores are number of tasks which can run inside a Executor. Will discuss more about Executor and Cores Later in the section.

Application Master in Client Mode:

 
Assume we have below settings:

- spark.yarn.am.cores 4
- spark.yarn.am.memory 2g

Lets first calculate the Spark.yarn.am.memoryOverhead , the default value to this parameter is AM memory*0.10 with a minimum allocation of 384MB memory or which ever is greater.

Lets calculate based on above settings :-

AM Memory = 2048

Total Allocation (AM) = 2048+(2048*0.10) = 2191.36  but since (2048*0.07=204.8) < 384 the total memory allocated to AM is 2048+384=2432 .

Application Master in Cluster Mode:

 
In this mode the Spark Driver is inside Yarn AM,  Assume we have below settings:

- spark.driver.cores (--driver-cores) 4
- spark.driver.memory ( --driver-memory)  2g
- spark.yarn.driver.memoryOverhead = AM Memory *0.10 with minimum of 384MB.

Now assume that we call a spark-submit job with below settings:

--driver-memory 1G
--driver-cores 2

 Calculation:

AM = Driver-Memory(2048)*0.10 or 384 which every is maximum, Hence
AM=2048+384=2432 and Java Heap size is 2G allocated within the AM with 2Cores.

This memory tends to grow with container size typically 6-10%.

Spark Executors

 
Lets start with default configuration and understand how the allocation works for Spark Executor:
 
- spark.executor.cores (--executor-cores) 4
- spark.executor.instances (--num-executors)
- spark.executor.memory (--executor-memory)  2g
- spark.yarn.driver.memoryOverhead = AM Memory *0.10 with minimum of 384MB.
 
with above settings spark will start 2 executors with 4 tasks/Core within each executor allocating 2432 Mb of memory each executor with a Java Heap size of 2GB.
 

                                                                                                                 More to come in part II ....

 

Comments

  1. very informative blog and useful article thank you for sharing with us , keep posting Big data hadoop online Course

    ReplyDelete

Post a Comment