Course/Performance & Scalability Patterns/Pipes and Filters Pattern

Pipes and Filters Pattern

Decompose processing into a pipeline of independent filters: composability, parallel execution, error handling, and real-world data pipeline examples.

12 min read

What Are Pipes and Filters?

The Pipes and Filters pattern decomposes a complex processing task into a sequence of discrete, independent processing steps called filters, connected by data channels called pipes. Each filter receives data from its input pipe, transforms it in some way, and writes results to its output pipe — unaware of what came before or after it in the pipeline. This architectural pattern is the foundation of Unix shell pipelines, Apache Kafka Streams, AWS Glue ETL jobs, and modern ML feature pipelines.

Core Components

Component	Role	Example
Source	Produces input data to the pipeline	Kafka topic, S3 bucket, HTTP webhook
Filter	Transforms, validates, enriches, or routes data	Deduplication step, schema validator, ML classifier
Pipe	Transports data between filters	Kafka topic between each step, in-memory queue, gRPC stream
Sink	Receives final output from the last filter	Database write, downstream API, output Kafka topic

Loading diagram...

A data enrichment pipeline: each filter is independent and can be developed, tested, and scaled separately

Key Properties

Composability: Filters are reusable building blocks. The schema validation filter can be used in multiple pipelines without modification.
Parallelism: Independent filters can run concurrently. If enrichment is slow, add more worker instances of just the enrichment filter.
Replaceability: Swap out any filter with a new implementation without changing the rest of the pipeline.
Testability: Each filter can be unit tested in complete isolation — feed it input data, verify its output.

Synchronous vs Asynchronous Pipes

Pipes can be either synchronous (in-process function calls, forming a processing chain within a single service) or asynchronous (message queues or Kafka topics between services). Asynchronous pipes enable each filter to be an independent, horizontally scalable service and provide natural buffering when filters have different throughput rates. The trade-off is added latency (broker round-trips) and complexity (at-least-once delivery, idempotency).

typescript

// Synchronous pipeline with typed filters
type Filter<TIn, TOut> = (input: TIn) => TOut | null; // null means drop the message

function createPipeline<T>(...filters: Filter<T, T>[]): Filter<T, T> {
  return (input: T): T | null => {
    let current: T | null = input;
    for (const filter of filters) {
      if (current === null) return null;
      current = filter(current);
    }
    return current;
  };
}

// Define individual filters
const validateSchema: Filter<Event, Event> = (event) =>
  event.userId && event.type ? event : null;

const deduplicate: Filter<Event, Event> = (event) =>
  seenIds.has(event.id) ? null : (seenIds.add(event.id), event);

const enrichWithUser: Filter<Event, Event> = (event) => ({
  ...event,
  user: userCache.get(event.userId),
});

// Compose the pipeline
const pipeline = createPipeline(
  validateSchema,
  deduplicate,
  enrichWithUser
);

// Process events
for (const rawEvent of eventStream) {
  const result = pipeline(rawEvent);
  if (result !== null) {
    sink.write(result);
  }
}

Error Handling in Pipelines

Error handling is the trickiest part of pipeline design. A filter that throws an uncaught exception can bring down the entire pipeline. Common strategies:

Dead Letter Queue (DLQ): Messages that fail a filter after N retries are routed to a DLQ for manual inspection rather than dropped silently.
Error sink: A separate output pipe that receives failed records, allowing downstream analysis and reprocessing.
Skip and log: For non-critical enrichment (e.g., user data lookup), continue with partial data rather than blocking the whole pipeline.
Checkpoint and replay: Record processing position (e.g., Kafka offset) so the pipeline can restart from the last successful point after a crash.

📌

Real World: Apache Kafka Streams

Kafka Streams implements the Pipes and Filters pattern natively. Each `KStream.map()`, `.filter()`, `.flatMap()`, or `.join()` call is a filter. Topics are the pipes. The topology is compiled into a stateful processing graph deployed across partitions. Netflix uses Kafka Streams for real-time recommendation signal processing, with dozens of filter stages.

💡

Interview Tip

When designing a data processing system (ETL pipeline, event processing, media transcoding), proposing a Pipes and Filters architecture demonstrates solid engineering judgment. Emphasize independent scalability of each filter, the testability advantage, and how asynchronous pipes (Kafka topics) provide buffering and allow each stage to scale independently. Then address the error handling question before the interviewer asks.

Claim Check Pattern

Choreography vs Orchestration