Spark is adopted by tech giants to bring intelligence to their applications. Predictive analysis and machine learning along with traditional data warehousing is using spark as the execution engine behind the scenes.
I have been exploring spark since incubation and I have found spark core as an effective replacement for map reduce. Optimizing jobs in spark is tricky area as there are no many ways to do it. I have tried to optimize performance by sequencing code differently. But as Spark use lazy evaluation and DAGs are pre created during execution there are no may ways to alter it.
This blog share some information
about optimizing spark jobs - programmatically i.e. writing better code and playing around with hardware.
Before getting into details let us get familiarized how spark runs a job through YARN
Now let us have a look at configure able properties.
spark.executor.cores
Number of cores to use for the executor processspark.executor.memory
Amount of memory to use per executor process spark.executor.instances
Number of executor process For further explanation I am using a cluster with configuration
So before deciding on how much RAM we need to use per executor, let us look at YARN's memory needs which is again configured by spark.yarn.executor.memoryOverhead
The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).Default is executorMemory * 0.10, with minimum of 384
So YARN needs say 6-10% i.e is 4-6 MB RAM . YARN also need 1 core for processing that leaves to 15 core / Node with say 60 GB RAM per core.
But 15 core per executor will cause bad HDFS input output, optimal will be 5-6 core per executor considering HDFS IO needs.
So cluster had 20 Node * 15 cores (took out 1 core for YARN) = 300 Cores
Number of executor is 300 Core/ 6 Core per executor = 50 Executors with 6 Core each
So each node has 50/20 executor 2.5 ~ 3 executors
memory per executor will be 60 /3 =20 multiplied by (1-.06) for heap overhead i.e 19 GB RAM
Interested in Hadoop? We will notify you when knowledge is shared.