📦Dropbox Tech·December 18, 2025

Dropbox Dash's Hybrid Feature Store Architecture for Real-time AI Ranking

This article details the architectural design and implementation of Dropbox Dash's hybrid feature store, crucial for powering real-time AI ranking. It highlights the challenges of operating across on-premises and cloud environments, achieving sub-100ms latency for massive parallel reads, and ensuring feature freshness. The solution combines Feast for orchestration, Spark for computation, and Dynovault for low-latency serving, showcasing key system design trade-offs and optimizations.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on Dropbox Tech

Dropbox Dash utilizes a sophisticated feature store to manage and deliver data signals ('features') for its real-time AI ranking models. This system is critical for quickly surfacing relevant documents, images, and conversations from tens of thousands of potential work items. The core challenge in its design was to build a system that could serve features quickly, adapt to changing user behavior, and facilitate rapid development cycles for machine learning engineers.

Architectural Goals and Challenges

Hybrid Infrastructure: Bridging an on-premises ecosystem for low-latency service communication with a Spark-native cloud environment for data processing.
Massive Parallel Reads: Handling thousands of feature lookups per user query across diverse data points (interaction history, metadata, real-time signals).
Strict Latency Budgets: Achieving sub-100ms end-to-end latency for feature retrieval.
Real-time Freshness: Ingesting user behavior signals (e.g., opening a document) within seconds to reflect in subsequent searches.
Unified Computation: Supporting both real-time streaming and batch processing patterns within a single, consistent framework to reduce engineering cognitive load.

Hybrid Feature Store Design

Dropbox adopted a hybrid architecture centered around Feast for orchestration and serving APIs. Key components include:

Feast Core: Used for its clear separation of feature definitions and infrastructure concerns, allowing ML engineers to focus on PySpark transformations. Its modularity and adapter ecosystem facilitated integration.
Custom Go Serving Layer: Replaced Feast's Python online serving path to meet stringent concurrency and latency requirements. This Go service leverages goroutines, shared memory, and faster JSON parsing, achieving p95 latencies of 25-35ms.
Dynovault: Dropbox's in-house DynamoDB-compatible storage solution, co-located with inference workloads in their hybrid cloud. Dynovault provides ~20ms client-side latency by avoiding public internet calls, balancing cost and scalability.
Spark Jobs & Cloud Storage: Responsible for offline indexing, large-scale feature computation, and ingestion.

Ensuring Feature Freshness and Performance

To maintain ranking quality, the feature store employs a three-part ingestion system balancing freshness, reliability, and scale:

Batch Ingestion: Handles complex, high-volume transformations using a medallion architecture. Intelligent change detection reduces write volumes from hundreds of millions to under one million records per run, cutting update times from over an hour to under five minutes.
Streaming Ingestion: Processes fast-moving signals (e.g., collaboration activity) in near real-time to ensure features align with current user behavior.
Direct Writes: Bypasses batch pipelines for lightweight or precomputed features (e.g., LLM relevance scores), writing directly to the online store in seconds.

💡

Key Learning: Python vs. Go for High-Concurrency Serving

A critical lesson learned was that Python's Global Interpreter Lock (GIL) and JSON parsing overhead became significant bottlenecks for high-throughput, mixed CPU and I/O workloads. Rewriting the serving layer in Go offered a more predictable scaling path for concurrency.

Feature StoreMachine Learning InfrastructureReal-time AIHybrid CloudLow LatencyGoFeastSpark

Comments

Loading comments...

Architecture Design

Design this yourself

Design a hybrid feature store for an AI-powered search system, considering real-time data ingestion, low-latency serving across on-premises and cloud environments, and scalability for millions of feature lookups per second.