📌Pinterest Engineering·February 17, 2026

Pinterest's Auto Memory Retries for Apache Spark OOMs

Pinterest engineered "Auto Memory Retries" to mitigate out-of-memory (OOM) errors in their large-scale Apache Spark deployment, enhancing resource efficiency and reliability. This system automatically identifies Spark tasks with high memory demands and retries them on executors with larger memory profiles, dynamically adjusting resource allocation. The solution involves extending core Spark classes to support task-level resource profiles and a hybrid retry strategy, showcasing a practical approach to optimizing distributed data processing.

Distributed Systems Performance & Scaling Cloud & Infrastructure

Read original on Pinterest Engineering

Pinterest operates a massive Apache Spark infrastructure, processing over 90,000 jobs daily on tens of thousands of compute nodes. A significant challenge they faced was frequent Out-Of-Memory (OOM) errors, accounting for over 4.6% of job failures. These OOMs led to substantial compute waste, increased on-call burden, and delayed downstream jobs. The root cause was often small executor sizes, but simply increasing them globally wasn't feasible due to cluster memory constraints and the difficulty of manual, per-job tuning.

The Challenge of Dynamic Resource Allocation in Spark

The core problem lies in Spark's default resource allocation model, where all tasks within a TaskSet share a uniform resource profile. Data skew can cause individual tasks within the same stage to have vastly different memory requirements. Predicting these demands upfront is nearly impossible, making static configuration inefficient. Manual tuning is time-consuming and often ineffective across diverse job stages and dynamic data characteristics. This necessitated an elastic approach to executor sizing.

Auto Memory Retries: A Hybrid Approach

Pinterest's solution, "Auto Memory Retries," dynamically adjusts resources for OOM-failed tasks. It employs a hybrid strategy: 1. Increase CPU per Task: On the first OOM, if an executor has multiple cores, the failing task's `cpus per task` property is doubled. This allows the task to run on an existing, default executor while sharing memory with fewer tasks, effectively giving it more dedicated memory without provisioning new hardware. This is a fast and cheap initial step. 2. Launch Larger Executor: If the OOM persists, a new, physically larger executor is launched for the retrying task. Pre-defined immutable retry resource profiles (2x, 3x, 4x memory) are used sequentially. This ensures tasks with genuinely higher memory demands get the necessary resources.

Architectural Modifications to Apache Spark

Implementing this required extending several core Apache Spark classes, showcasing how custom logic can be embedded into a complex distributed framework: * Task: Modified to hold an optional `taskRpId` for tasks deviating from their TaskSet's resource profile. * TaskSetManager: Tracks tasks with deviating profiles and automatically assigns the next larger retry profile upon OOM failure. * TaskSchedulerImpl: Decides which resource profile to schedule, allowing tasks with increased `cpus per task` to run on default executors for efficiency. * ExecutorAllocationManager: Notified of resource profile updates and dynamically launches physically larger executors when tasks with retry profiles are pending.

💡

System Design Takeaway: Extending Open Source Frameworks

When facing limitations in off-the-shelf distributed frameworks, consider whether extending or customizing core components is a viable solution. This often provides finer control and better integration than external listeners or wrapper approaches, but requires deep understanding of the framework's internals and careful management of compatibility with future upgrades.

Apache SparkResource ManagementOut-of-MemoryDistributed ComputingPinterest EngineeringData InfrastructureKubernetesScalability

Comments

Loading comments...

Architecture Design

Design this yourself

Design a distributed data processing platform on Kubernetes using Apache Spark that can dynamically allocate resources for individual tasks based on their runtime memory requirements to minimize OOM errors and optimize cost. Your design should include mechanisms for identifying OOM failures, adjusting task resource profiles, and provisioning larger executors on demand.

Focus: dynamic resource allocation and retry mechanism for Apache Spark tasks

Other design angles

· Design a more proactive system that predicts task memory requirements before execution using historical data and machine learning.· Design a generalized dynamic resource allocation framework that can be applied to other distributed computing engines beyond Spark.· Evaluate the trade-offs between 'increasing CPU per task' and 'launching larger executors' in a cloud-native Spark environment.