Course/Data Storage/Time-Series Databases

Time-Series Databases

Purpose-built databases for time-stamped data: InfluxDB, TimescaleDB. Write optimization, retention policies, and downsampling.

10 min read

What Is Time-Series Data?

Time-series data is a sequence of measurements indexed by time. Every data point has a timestamp, a metric name, and a value — plus optional tags (metadata). Examples include server CPU usage every 10 seconds, stock prices every millisecond, IoT sensor readings every minute, and application request latency tracked over time. The defining characteristic: writes always append new timestamps; old data is rarely updated or deleted.

Why Not Use a General-Purpose Database?

You *can* store time-series data in PostgreSQL, but at scale it becomes painful. Consider a monitoring system writing 1 million metrics per second. In PostgreSQL: B-tree index maintenance on a rapidly growing table becomes expensive, range scans over time windows degrade, and storing billions of rows requires constant maintenance (VACUUM, index bloat). Time-series databases address these problems with purpose-built storage engines.

Key Time-Series Databases

Database	Model	Best For	Notes
InfluxDB	Custom TSM engine	DevOps metrics, IoT, real-time analytics	Flux query language; native retention policies
TimescaleDB	PostgreSQL extension (hypertables)	SQL familiarity, existing PG stack	Full SQL, JOINs to relational data, automatic partitioning
Prometheus	Pull-based, in-memory TSDB	Kubernetes / infrastructure monitoring	Short retention; pair with Thanos/Cortex for long-term storage
Apache Druid	Columnar, pre-aggregated	Sub-second analytics on massive event streams	Used by Lyft, Airbnb for user-facing analytics
Amazon Timestream	Serverless AWS TSDB	AWS-native IoT/ops monitoring	Automatic tiering from memory → SSD → S3

Internal Architecture: Why TSDBs Are Fast

Time-series databases are optimized for sequential writes (always appending to the latest time window) and time-range reads (give me all CPU readings for the last hour). They achieve this through:

Time-partitioned storage: Data is stored in time-bucketed chunks (e.g., 1-hour or 1-day chunks). Queries over a time range only touch the relevant chunks.
Columnar compression: Consecutive timestamps differ by small deltas — highly compressible with delta encoding. Values (floats) compress well with XOR encoding (Gorilla algorithm, used by Prometheus).
Automatic chunk expiration: Old chunks can be dropped atomically (one file delete) instead of row-by-row deletes, which would cause index fragmentation.

Loading diagram...

Time-series chunk architecture: data is partitioned by time, expired chunks drop atomically

Retention Policies and Downsampling

Raw time-series data grows unboundedly. A monitoring system collecting 1M metrics/second generates 86 billion data points per day. Retention policies automatically expire data older than N days. Downsampling (or rollup) aggregates fine-grained data into coarser granularity before expiring the raw data — keeping 30 days of raw 10-second data, 1 year of 1-minute averages, and 5 years of 1-hour averages.

sql

-- TimescaleDB: create a hypertable partitioned by time
SELECT create_hypertable('cpu_metrics', 'time');

-- Continuous aggregate: pre-compute 1-minute averages
CREATE MATERIALIZED VIEW cpu_1min
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 minute', time) AS bucket,
  host,
  AVG(cpu_percent) AS avg_cpu,
  MAX(cpu_percent) AS max_cpu
FROM cpu_metrics
GROUP BY bucket, host;

-- Retention policy: drop raw data older than 30 days
SELECT add_retention_policy('cpu_metrics', INTERVAL '30 days');

💡

TimescaleDB for SQL Teams

If your team already uses PostgreSQL, TimescaleDB is the lowest-friction time-series solution. It's a PostgreSQL extension — same connection strings, same SQL, same tooling. Hypertables partition automatically by time, and continuous aggregates pre-compute rollups. You get 10-100x better time-series performance with zero new infrastructure.

💡

Interview Tip

Time-series databases rarely appear as the central topic in system design interviews, but they show up as a component. If asked to design a metrics platform, monitoring system, or IoT data pipeline, mention: 'For the time-series data, I'd use InfluxDB or TimescaleDB with a 30-day raw retention policy and continuous aggregates for historical trends.' This shows architectural awareness without getting lost in a low-weight topic.

SQL vs NoSQL Decision Framework

Data Lakes & Data Warehouses