📌Pinterest Engineering·February 13, 2026

GPU-Serving for Ads Engagement Prediction with MMOE-DCN Architecture

Pinterest engineered a significant upgrade to their ads lightweight ranking system by migrating two-tower models to GPU serving. This shift enabled the adoption of a more complex MMOE-DCN architecture, improving prediction accuracy and efficiency. The article details the architectural evolution, optimizations for GPU training, and the observed performance gains in both offline and online metrics.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Pinterest Engineering

Pinterest's ads recommendation system utilizes a lightweight ranking stage to efficiently filter candidate ads before more complex downstream models process them. This stage is critical for balancing prediction accuracy with serving latency. Historically, their two-tower models for engagement prediction, which compute Pin (ad) embeddings offline and query (user) embeddings in real-time, were served entirely on CPUs. The recent migration to GPU serving marks a significant evolution in their machine learning infrastructure.

Architectural Evolution: From MTMD to MMOE-DCN

The core of the architectural upgrade involves transitioning from a Multi-Task Multi-Domain (MTMD) model to a more sophisticated Multi-gate Mixture-of-Experts (MMOE) with Deep & Cross Networks (DCN) design. The MTMD model relied on domain-specific modules, whereas the MMOE architecture effectively handles multi-domain and multi-task challenges without explicit domain modules by employing multiple 'experts' with MLP gating. Each expert within their MMOE model incorporates both full-rank and low-rank DCN layers, allowing for deeper feature interactions while managing model complexity.

💡

Why MMOE-DCN?

MMOE is particularly effective in multi-task learning scenarios where different tasks (e.g., click prediction, conversion prediction) might share some underlying features but also require task-specific modeling. By using multiple experts and a gating mechanism, it allows the model to learn both shared and task-specific patterns more effectively than a single-expert approach. DCN layers, on the other B hand, are designed to capture explicit and implicit feature interactions, which are crucial for high-dimensional sparse data typical in recommendation systems.

Optimizing GPU Training Efficiency

The increased complexity and size of the new MMOE-DCN model, coupled with large training datasets, necessitated significant optimizations to maintain training efficiency. Key improvements included:

Dataloader Optimization: Enabled GPU prefetch to overlap data preparation with GPU processing and tuned the number of worker threads to leverage ample CPU memory on p4d instances.
Model Code Efficiency: Minimized costly CPU zero allocations by performing these operations directly on the GPU and replaced multiple individual kernels with fused kernels to reduce overhead.
Model Training Configuration: Adopted BF16 precision for faster processing and increased batch size to better utilize available GPU memory.

These optimizations were crucial in achieving the reported 5-10% reduction in offline loss and substantial improvements in online metrics like Cost-Per-Click (CPC) and Click-Through Rate (CTR). The segmentation of standard and shopping ad scenarios, along with training on relevant data, further reduced loss and doubled model iteration speed, highlighting the importance of data strategy alongside model and infrastructure improvements.

GPU servingmachine learning infrastructurerecommendation systemsMMOEDCNads rankingperformance optimizationMLOps

Comments

Loading comments...

Architecture Design

Design this yourself

Design a real-time ads lightweight ranking system capable of serving complex multi-task, multi-domain deep learning models on GPUs. The system should efficiently generate user and ad embeddings, support the MMOE-DCN architecture, and include mechanisms for optimizing GPU training and inference latency while handling large-scale ad inventories.

Focus: GPU-served lightweight ranking model for ads engagement prediction

Other design angles

· Design the data pipeline and MLOps practices for training and deploying the described GPU-served models, focusing on iteration speed and data segmentation.· Architect a generic, scalable GPU inference service that can host various deep learning models for real-time prediction, considering resource allocation, batching strategies, and fault tolerance.· Compare and contrast the trade-offs of CPU vs. GPU serving for real-time recommendation systems, focusing on cost, latency, model complexity, and development effort.

GPU-Serving for Ads Engagement Prediction with MMOE-DCN Architecture

Architectural Evolution: From MTMD to MMOE-DCN

Optimizing GPU Training Efficiency

Comments

Architecture Design

Related Lessons