🍃MongoDB Blog·December 18, 2025

Optimizing Embedding Inference for Short Queries with Token-Count Batching

This article discusses an architectural approach to improve the efficiency and reduce latency of embedding model inference for short, bursty requests, common in search and recommendation systems. It focuses on 'token-count-based batching' combined with padding removal, demonstrating how to optimize GPU utilization and reduce operational costs. The core challenge addressed is the inefficiency of processing many short, memory-bound requests sequentially, and the architectural solutions involve intelligent queue design and inference engine capabilities.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on MongoDB Blog

The Challenge: Inefficient Short Query Inference

Embedding model inference for 'queries' (short requests like search terms) presents unique challenges. These queries are typically very short, have skewed token-length distributions, and require low latency (100–300 ms). Due to their brevity, inference becomes memory-bound rather than compute-bound. Furthermore, query traffic is often spiky, making traditional autoscaling inefficient. Processing these short requests sequentially leads to high overheads and underutilization of GPU resources.

Key Techniques for Efficient Batching

1. Padding Removal

Traditional inference engines often require sequences to be padded to a maximum length (B, S), where B is batch size and S is max sequence length. This means inference latency scales with B × S, wasting compute and memory on padding tokens. Modern inference engines like vLLM and SGLang support padding removal and variable-length processing. They concatenate all active sequences into a 'super sequence' (T = Σtoken_count_i), allowing inference time to track the actual token count (T) rather than the padded length, thus aligning GPU work with useful computation.

2. Token-Count-Based Batching

Voyage AI by MongoDB introduced token-count-based batching. Unlike time-window or request-count batching, which can lead to oscillating utilization due to bursty traffic, this method groups queries based on their *total token count* (Σtoken_count_i) within a batch. This strategy aligns the batch size directly with the actual compute required, moving inference from a memory-bound to a compute-bound regime. By aiming for an 'optimal batch size' that corresponds to the GPU's saturation point, fixed per-request overheads are amortized, reducing latency and increasing throughput.

💡

Optimal Batch Size

The optimal batch size for token-count batching is often at the 'saturation point' where inference latency transitions from being nearly flat (dominated by fixed overheads) to linearly scaling with token count. This point balances latency and throughput/MFU.

Architectural Implications: Queue Design

Implementing token-count-based batching requires a specialized queueing system. Generic brokers like Kafka or RabbitMQ, which typically batch by message count or bytes, are insufficient because they lack the ability to 'peek' across requests and atomically claim a subset based on a cumulative token count. The system needs to estimate `token_count` for each request, inspect multiple pending requests, and then atomically group them until the optimal total token count for a batch is reached. Solutions include a lightweight aggregator in front of traditional brokers or, as implemented by Voyage AI, using a store like Redis with Lua scripting for atomic `peek + conditional batching` operations.

lua

local total_tokens = 0
local batch_requests = {}

-- Atomically fetch requests until optimal_batch_size is reached
while total_tokens < OPTIMAL_BATCH_SIZE do
    local item = redis.call('LPOP', 'query_queue')
    if not item then break end

    local token_count_str = string.match(item, '^([^:]+)::')
    local token_count = tonumber(token_count_str)

    if total_tokens + token_count <= OPTIMAL_BATCH_SIZE then
        table.insert(batch_requests, item)
        total_tokens = total_tokens + token_count
    else
        -- If current item overfills, push it back and stop
        redis.call('LPUSH', 'query_queue', item)
        break
    end
end
return batch_requests

embedding inferencebatchingGPU optimizationlow latencyqueue designRedisvLLMresource utilization