Choose the right executor config for your PySpark jobs

In the world of Apache Spark, two types of executors reign supreme: thin and thick. Each has its own strengths and weaknesses, so it’s important to choose the right one for your application to get the best performance.

Thin Executor 🐁

Thin executors are small and lightweight, making them perfect for CPU-intensive tasks. They have a small memory footprint, which allows for more of them to be provisioned on a single node. This can lead to better parallelism and faster execution times for CPU-bound applications.

Key Features of Thin Executors:

1. Small memory footprint
2. High CPU utilization
3. Good for CPU-intensive tasks
4. Batch processing
5. Quickly provisioned
6. Quickly deprovisioned

Lets take in-depth dive and understand by examples:

Small Memory Footprint:

Application Example: Real-Time Sensor Data Aggregation
- Use Case:
  1. Imagine a fleet of IoT devices (e.g., temperature sensors, humidity sensors) deployed across a large area (e.g., a smart city).
  2. These sensors continuously collect data and transmit it to a central processing system.
- Tasks:
  1. Receive and process data from multiple sensors simultaneously.
  2. Aggregate sensor readings (e.g., average temperature, humidity levels) over specific time intervals.
  3. Store aggregated data for further analysis (e.g., trend analysis, anomaly detection).
- Justification:
  1. Thin executors, with their small memory footprint, are ideal for this scenario.
  2. Each executor can handle data from multiple sensors without excessive memory overhead.
  3. Since sensor data is relatively lightweight, thin executors efficiently process and aggregate it.
High CPU Utilization:

Application Example: Real-Time Fraud Detection
- Use Case:
  1. A financial institution processes a continuous stream of transaction data from credit card transactions.
  2. The goal is to detect fraudulent transactions in real time.
- Tasks:
  1. Analyze transaction patterns (e.g., spending behavior, transaction frequency).
  2. Apply machine learning models to identify anomalies.
  3. Flag suspicious transactions for further investigation.
- Justification:
  1. Thin executors focus on efficient CPU utilization.
  2. Fraud detection involves complex computations (e.g., anomaly detection algorithms).
  3. Thin executors can efficiently process and classify transactions, maximizing CPU resources.
Good for CPU-Intensive Tasks:

Application Example: Genome Sequencing
- Use Case:
  1. A research institute analyzes DNA sequences to identify genetic variations.
  2. The goal is to find associations between specific genes and diseases.
- Tasks:
  1. Align and compare DNA sequences.
  2. Identify mutations and variations.
  3. Perform statistical analyses on large genomic datasets.
- Justification:
  1. Genome sequencing tasks are highly CPU-intensive.
  2. Thin executors can efficiently parallelize sequence alignment and variant calling.
  3. Their small memory footprint allows for more parallel tasks per node.
Batch Processing:

Application Example: Log File Analysis
- Use Case:
  1. An e-commerce platform generates log files containing user interactions, product views, and purchases.
  2. The platform wants to analyze user behavior patterns.
- Tasks:
  1. Parse and process log files.
  2. Extract relevant information (e.g., popular products, user sessions).
  3. Generate reports or visualizations.
- Justification:
  1. Log file analysis involves batch processing of large volumes of data.
  2. Thin executors can efficiently handle log file parsing and aggregation.
  3. Their quick provisioning and deprovisioning adapt to varying log file sizes.
Quickly Provisioned:

Application Example: Ad Hoc Data Exploration
- Use Case:
  1. Data analysts need to explore a new dataset to understand its structure and identify potential insights.
  2. The dataset may be large and unfamiliar.
- Tasks:
  1. Load data.
  2. Run exploratory queries (e.g., aggregations, joins).
  3. Visualize preliminary results.
- Justification:
  1. Thin executors can be provisioned rapidly for ad hoc tasks.
  2. Analysts can quickly explore data without waiting for resource allocation.
  3. Once exploration is complete, thin executors can be released.
Quickly Deprovisioned:

Application Example: Seasonal Demand Forecasting
- Use Case:
  1. A retail chain needs to forecast demand for various products during holiday seasons.
  2. Demand patterns change dynamically.
- Tasks:
  1. Analyze historical sales data.
  2. Apply time series models (e.g., ARIMA, exponential smoothing).
  3. Generate forecasts.
- Justification:
  1. Thin executors can be deprovisioned promptly after peak demand periods.
  2. During non-peak times, fewer executors are needed.
  3. Efficient resource management ensures cost-effectiveness.

Thick Executor 🏋️‍♀️

Thick executors are the opposite of thin executors: they’re large and memory-hungry, making them ideal for memory-bound tasks. They have a large memory footprint, which allows them to cache large datasets in memory. This can lead to faster execution times for memory-bound applications.

Key Features of Thick Executors:

1. Large memory footprint
2. Low CPU utilization
3. Good for memory-bound tasks
4. Streaming
5. Slowly provisioned
6. Slowly deprovisioned

Lets take in-depth dive and understand by examples:

Large Memory Footprint:

Application Example: Real-Time Image Processing and Feature Extraction
- Use Case:
  1. A system processes real-time image data from surveillance cameras or satellite imagery.
- Tasks:
  1. Detect objects (e.g., vehicles, pedestrians) in images.
  2. Compute image descriptors (e.g., color histograms, texture features).
  3. Store processed data for further analysis.
- Justification:
  1. Thick executors with substantial memory allocation are necessary because image processing often involves loading and manipulating large image datasets.
  2. High-resolution images require significant memory to hold intermediate results during feature extraction and object detection.
Low CPU Utilization:

Application Example: Log Aggregation and Monitoring
- Use Case:
  1. A log aggregation system collects logs from various services (e.g., web servers, application servers, network devices).
- Tasks:
  1. Parse logs (extract relevant data).
  2. Aggregate metrics (e.g., counting requests, calculating response times).
  3. Detect anomalies (unusual patterns in log entries).
- Justification:
  1. Thick executors are suitable for low CPU utilization tasks like log aggregation because they can handle log parsing and aggregation efficiently.
  2. Log data processing primarily involves I/O operations and minimal CPU-intensive computations.
Good for Memory-Bound Tasks:

Application Example: Large-Scale Data Ingestion and Transformation
- Use Case:
  1. Ingest massive amounts of raw data (e.g., sensor data, logs, IoT data).
- Tasks:
  1. Transform data (filtering, sorting, aggregating).
  2. Store processed data in a data warehouse or data lake.
- Justification:
  - Thick executors efficiently handle memory-intensive data transformation steps.
  - Data ingestion and transformation often involve buffering and caching intermediate results, which benefit from larger memory allocations.
Streaming:

Application Example: Real-Time Social Media Sentiment Analysis
- Use Case:
  1. Analyze social media posts (tweets, Facebook updates) to determine sentiment (positive, negative, neutral).
- Tasks:
  1. Process real-time social media data streams.
  2. Perform sentiment scoring.
- Justification:
  1. Thick executors are well-suited for handling large text datasets efficiently in real-time.
  2. Sentiment analysis involves processing a continuous stream of textual data, which benefits from memory resources.
Slowly Provisioned:

Application Example: Long-Running Batch Processing with Dynamic Scaling
- Use Case:
  1. Process large historical datasets (e.g., ETL jobs, data cleansing, feature extraction).
- Tasks:
  1. Gradually provision new executors as needed.
  2. Retain executors for a certain duration after completing tasks.
- Justification:
  1. Thick executors provisioned slowly help avoid sudden resource spikes.
  2. Long-running batch processing tasks benefit from stable and gradual resource allocation.
Slowly Deprovisioned:

Application Example: Machine Learning Model Training
- Use Case:
  1. Train complex machine learning models (e.g., deep neural networks, gradient-boosted trees).
- Tasks:
  1. Slowly deprovision executors after training completion.
  2. Retain executors for subsequent training tasks.
- Justification:
  1. Thick executors remain active to accommodate future training iterations.
  2. Machine learning model training requires stability and continuity in resource availability.

Balanced Executors 👌

Balanced executors strike a middle ground between thin and thick executor configurations. They aim to optimize resource utilization by balancing memory allocation and CPU cores per executor. Here are the key features of balanced executors:

Key Features of Thick Executors:

Moderate Memory Footprint:
- Balanced executors allocate memory that is neither too small nor too large.
- They efficiently use memory for caching intermediate results and handling data processing tasks.
- Suitable for workloads that require a reasonable amount of memory.
Reasonable CPU Utilization:
- Balanced executors distribute CPU cores effectively.
- They can handle moderately CPU-intensive tasks without excessive overhead.
- Well-suited for a wide range of computational workloads.
Adaptability:
- Balanced executors can adjust to varying workload demands.
- They provide flexibility in resource allocation based on task requirements.
- Ideal for scenarios where both memory and CPU play crucial roles.
Stability and Efficiency:
- By maintaining a balance, they avoid resource contention and underutilization.
- Stable provisioning and deprovisioning ensure efficient cluster management.

Problems 📒

Now since we are familiar with the terms and possible uses, let us try to implement out knowledge using a resource allocation problem.

Question: Create a spark submit command that uses the below resources.

Nodes: 6
Cores (per node): 15
RAM (per node): 64

1. Thin Executor*[1 core in 1 executor]*

Total executor: 90 [6node * 15executor]
Executor memory: ~4.2GB [64GB ÷ 15executor]
Total Cores: 90 [15core ÷ 15executor]

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 18 \
  --executor-cores 5 \
  --executor-memory 21 g \
  --conf spark.sql.shuffle.partitions=100  \
  my_spark_app.jar \
  --args "input_data_path output_data_path"

2. Thick Executor*[All core in 1 executor ]*

Total executor: 6 [6node * 1executor]
Executor memory: 64GB [64GB ÷ 1executor]
Average core per node: 15 [15core ÷ 1executor]

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 18 \
  --executor-cores 5 \
  --executor-memory 21 g \
  --conf spark.sql.shuffle.partitions=100  \
  my_spark_app.jar \
  --args "input_data_path output_data_path"

3. Balanced Executor*[5 cores in 1 executor]*

Total executor: 18 [6node * 3executor]
Executor memory: ~21GB [64GB ÷ 3executor]
Total Cores: 90 [15core ÷ 15executor]

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 18 \
  --executor-cores 5 \
  --executor-memory 21 g \
  --conf spark.sql.shuffle.partitions=100  \
  my_spark_app.jar \
  --args "input_data_path output_data_path"

Thin vs Thick vs Balanced Executors

Choose the right config for your spark application

Table of contents

Thin Executor 🐁

Thick Executor 🏋️‍♀️

Balanced Executors 👌

Problems 📒