Menu
🐶Datadog Blog·February 23, 2026

Evaluating AI Guard: LLM Observability for Safe AI Agents

Datadog leveraged LLM Observability to develop and test their internal 'AI Guard' application. This system design focuses on safeguarding Bits AI Agents by detecting and blocking unsafe Large Language Model (LLM) behavior, highlighting the importance of robust monitoring and evaluation in AI system development for quality and cost control.

Read original on Datadog Blog

The article describes Datadog's internal process of building and evaluating "AI Guard," an application designed to protect their Bits AI Agents from generating unsafe content. This process heavily relies on LLM Observability, which is a crucial system design consideration for any application integrating Large Language Models (LLMs). The core challenge is to ensure the quality and safety of AI agent outputs while managing the operational costs associated with LLM usage.

LLM Observability in Action

Implementing effective LLM Observability is paramount for developing reliable AI systems. It involves capturing, monitoring, and analyzing various metrics related to LLM interactions, including prompt and response content, latency, token usage, and model-specific evaluations. This data allows developers to identify performance bottlenecks, detect model drift, and, in the case of AI Guard, pinpoint and mitigate unsafe behaviors.

💡

Key Aspects of LLM Observability

When designing systems with LLMs, consider logging and monitoring: 1. Prompt and response pairs for debugging and auditing. 2. Token usage and cost for financial tracking and optimization. 3. Latency and error rates for performance monitoring. 4. Semantic evaluation metrics for content quality and safety.

Architectural Implications for AI Guard

The architecture of AI Guard likely involves an interceptor or proxy layer that sits between the Bits AI Agents and the LLMs. This layer would be responsible for routing prompts, analyzing responses for compliance and safety using pre-defined rules or secondary AI models, and potentially rewriting or blocking responses before they reach the end-user. The data gathered during these interactions feeds back into the observability platform for continuous evaluation and improvement.

From a system design perspective, this highlights the need for a robust data pipeline to handle LLM telemetry, a scalable storage solution for logs and metrics, and a powerful analytics engine to derive actionable insights. The feedback loop between observability and application improvement is critical for evolving AI systems effectively.

  • Prompt/response logging and storage for audit and analysis.
  • Metrics collection for cost, performance, and usage tracking.
  • Evaluation pipelines to assess safety and quality of LLM outputs.
  • Alerting mechanisms for detecting anomalous or unsafe behavior.
LLM ObservabilityAI GuardrailsMLOpsMonitoringEvaluationCost OptimizationSafetyDatadog

Comments

Loading comments...
Evaluating AI Guard: LLM Observability for Safe AI Agents | SysDesAi