Course/Data Storage/Data Lakes & Data Warehouses

Data Lakes & Data Warehouses

Structured vs unstructured data storage: data warehouses (Redshift, BigQuery) vs data lakes (S3 + Spark). ETL, ELT, and the lakehouse pattern.

12 min read

OLTP vs OLAP: Two Different Worlds

Production databases (PostgreSQL, MySQL, DynamoDB) are OLTP — Online Transaction Processing. They're optimized for low-latency, high-concurrency reads and writes of individual rows. Analytics systems are OLAP — Online Analytical Processing. They run complex queries scanning billions of rows to answer business questions like 'What was our revenue by region last quarter?' These two workloads have fundamentally incompatible access patterns and require separate systems.

Dimension	OLTP	OLAP
Optimized for	Low-latency row reads/writes	High-throughput column scans
Query type	Simple lookups, inserts	Complex aggregations, JOINs
Data volume	GB to TB	TB to PB
Storage layout	Row-oriented	Column-oriented
Examples	PostgreSQL, MySQL, DynamoDB	Redshift, BigQuery, Snowflake

Data Warehouses

A data warehouse stores structured, schema-on-write data from multiple OLTP sources. Data is cleaned, transformed, and loaded via ETL (Extract, Transform, Load) pipelines. Warehouses use columnar storage — a query scanning only the `revenue` column reads only that column's data block, skipping all other columns. This gives 10-100x speedups on analytical queries vs row-oriented databases.

Amazon Redshift: Columnar MPP (Massively Parallel Processing) database. Nodes distribute query execution. Sort keys and distribution keys control co-location.
Google BigQuery: Serverless; you pay per query by bytes scanned. Dremel query engine. Excellent for ad-hoc analytics on petabyte datasets.
Snowflake: Cloud-agnostic, separates compute from storage. Scale compute up/down independently. Popular for cross-cloud and multi-region analytics.

Data Lakes

A data lake stores raw, unprocessed data in its native format (JSON logs, Parquet files, CSV exports, images, videos) in cheap object storage (S3, GCS, Azure Data Lake). Unlike warehouses, data lakes use schema-on-read — structure is applied when the data is queried, not when it's stored. This enables storing data you don't know how to use yet.

Loading diagram...

Data lake architecture: raw data lands in object storage, then flows to warehouse or is queried directly

ETL vs ELT

ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was necessary when warehouses were expensive and storage was limited — you only loaded clean, useful data. ELT (Extract, Load, Transform) loads raw data first, then transforms it using the warehouse's own compute. Modern cloud warehouses are powerful and cheap enough that ELT is now preferred — you retain raw data for future reprocessing.

ℹ️

dbt: The ELT Standard

dbt (Data Build Tool) is the industry-standard ELT transformation layer. Data engineers write SQL transformation models that dbt compiles and runs inside the data warehouse. It handles dependency ordering, testing, documentation, and incremental builds. If someone mentions 'the modern data stack', dbt is almost always part of it.

The Lakehouse Pattern

Data lakes and warehouses each have weaknesses: lakes lack ACID transactions and query optimization; warehouses are expensive and rigid. The lakehouse pattern combines both — open table formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID transactions, schema evolution, and time travel to data lake storage. Compute engines like Databricks or Spark query these tables with warehouse-like performance.

Pattern	Strength	Weakness
Data Warehouse	Fast SQL queries, governed schema, BI-friendly	Expensive at scale, rigid schema, poor for ML raw data
Data Lake	Cheap storage, raw data, ML-friendly	No ACID, slow queries without optimization, schema chaos
Lakehouse	ACID on cheap storage, unified for BI + ML	Newer ecosystem, more operational complexity

💡

Interview Tip

Analytics architecture questions come up in senior system design interviews, especially for data platform or analytics roles. The key point to communicate: 'I would not run analytics on the production OLTP database — that would contend with user traffic. I'd set up a data warehouse fed by nightly ETL or real-time CDC (Change Data Capture) from the primary database, so analysts query the warehouse without impacting production.' This is the answer interviewers want to hear.

Time-Series Databases

Object Storage (S3 Pattern)