Tools & Techniques

Error Handling Patterns That Keep Systems Reliable

How a program handles failure is often more important than what it does under normal conditions. The patterns here — result types, retries, circuit breakers, error boundaries — are the difference between a system that degrades gracefully and one that silently corrupts data or falls over in production.

Published June 22, 2026

Errors come in two broad families: expected failures (a user enters an invalid email address, a file doesn't exist, a network request times out) and unexpected failures (a null pointer dereference, an integer overflow in input you didn't validate, running out of disk space on a server you assumed had capacity). The patterns for handling each differ significantly, and conflating them produces code that is either too defensive or not defensive enough.

Exceptions vs result types

Most languages offer exceptions as the primary error mechanism, but they have a significant problem: a function's signature doesn't tell you it can fail, or what it fails with. When you call a function that can throw, you have to read its documentation (or source) to know which exceptions to expect. If you forget to catch one, the program crashes at runtime with no compile-time warning.

Result types (also called Either types) make errors part of the return type, forcing the caller to handle them. Rust uses Result<T, E>:

fn read_config(path: &str) -> Result<Config, io::Error> {
    let content = fs::read_to_string(path)?;
    let config: Config = serde_json::from_str(&content)?;
    Ok(config)
}

match read_config("config.json") {
    Ok(cfg)  => start_server(cfg),
    Err(e)   => eprintln!("Failed to load config: {}", e),
}

The ? operator propagates an Err upward automatically. The caller receives either a Config or an io::Error and must handle both explicitly; ignoring the error is a compile warning or error depending on context.

Go takes a similar approach with multiple return values:

func readConfig(path string) (Config, error) {
    data, err := os.ReadFile(path)
    if err != nil {
        return Config{}, fmt.Errorf("reading config: %w", err)
    }
    var cfg Config
    if err := json.Unmarshal(data, &cfg); err != nil {
        return Config{}, fmt.Errorf("parsing config: %w", err)
    }
    return cfg, nil
}

Note fmt.Errorf("...: %w", err): the %w verb wraps the original error so callers can use errors.Is() or errors.As() to inspect the underlying cause without losing context. Wrapping errors while adding context at each layer — "parsing config: reading config: open config.json: no such file or directory" — is much more useful for diagnosis than a bare "file not found."

Checked vs unchecked exceptions (Java)

Java tried to solve the "silent failures" problem with checked exceptions: if a method can throw a checked exception, it must declare it in its signature with throws, and callers must either handle it or declare it themselves. Unchecked exceptions (subclasses of RuntimeException) escape this requirement.

In practice, checked exceptions work well for recoverable failures where the caller genuinely has a useful recovery action (file not found, network timeout). They create friction for genuinely unrecoverable failures (null pointer, index out of bounds), which is why those are unchecked. The anti-pattern to avoid is catching a checked exception and either swallowing it silently or wrapping it in a generic RuntimeException just to escape the signature requirement — both lose the error information entirely.

Retrying transient failures

Network requests fail transiently: a brief packet loss, a server that's momentarily overloaded, a DNS lookup that times out. Retrying with a fixed delay is usually wrong because if many clients all retry at the same interval, they produce a synchronized storm of retries that overwhelms the recovering server.

Exponential backoff with jitter is the standard approach: each retry waits twice as long as the previous one, with a random offset to desynchronize clients:

import time, random

def call_with_retry(fn, max_retries=5, base_delay=0.5):
    for attempt in range(max_retries):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            time.sleep(delay)

Critically: only retry errors that are safe to retry. Idempotent operations (GET requests, reads, operations you can repeat without side effects) are safe. Non-idempotent operations (a payment charge, an email send) are not safe to retry without deduplication — retrying an already-succeeded charge double-charges the customer. Use idempotency keys to make non-idempotent operations safe to retry: send a unique key per logical operation, and the server returns the same response for duplicates.

Circuit breakers

When a downstream service is genuinely down, retrying every request wastes your threads and delays recovery. A circuit breaker wraps calls to the downstream service and tracks the failure rate. After a threshold of consecutive failures, it "opens": requests fail immediately without hitting the downstream service, sparing both systems from load. After a timeout, the circuit enters "half-open" mode and allows one test request. If it succeeds, the circuit closes; if it fails, it stays open.

The three states map directly to the operating state of the downstream service:

Closed: normal operation; requests go through; failures are counted.
Open: downstream is failing; requests short-circuit immediately with a fast failure or fallback response.
Half-open: downstream may have recovered; a probe request determines which state to enter.

Libraries like Netflix Hystrix (deprecated but widely documented), Resilience4j (Java), and circuitbreaker (Go) implement this pattern. The important configuration parameters are the failure threshold (how many failures before opening), the sleep window (how long the circuit stays open before testing), and the slow-call threshold (calls that exceed a timeout also count as failures, not just network errors).

Error boundaries and bulkheads

In monolithic applications, an unhandled exception in one request can kill the process unless there's a top-level handler. Web frameworks provide this: Express.js's error middleware, Django's exception middleware, and similar constructs catch unhandled exceptions, log them, and return a 500 response without crashing the server process.

In distributed systems, the equivalent concept is the bulkhead: isolate components so a failure in one doesn't exhaust resources in another. A common implementation is a separate thread pool per downstream dependency. If all threads in pool A are blocked waiting for a slow database query, pool B (handling a different service) keeps running. Without bulkheads, a single slow downstream can exhaust the shared thread pool and make the entire application unresponsive.

In React (frontend JavaScript), Error Boundaries are components that catch rendering errors in their subtree and display a fallback UI instead of unmounting the entire component tree. The same principle applies: contain the blast radius of a failure so the rest of the system continues to function.

Designing error messages for two audiences

Every error exists in two contexts: the user who sees it and the developer who investigates it. Conflating the two produces errors that reveal internals to users (a stack trace with database schema details) or that are useless to developers ("an error occurred").

For user-facing errors: describe what went wrong in terms the user can act on, without implementation details. "Your session has expired. Please log in again." is actionable. "NullPointerException in AuthFilter.java:47" is not.

For developer-facing errors: include everything useful for diagnosis. Log the full stack trace, the input that triggered the error, the user ID (not PII beyond what's needed for correlation), the request ID, and the timestamp. Structure logs as JSON so they're machine-parseable by your log aggregation system. If you're using distributed tracing (OpenTelemetry, Jaeger), attach the trace context to the error record so you can follow the request path across services.

A good pattern is to generate an error reference ID at the point of failure, log all diagnostic details keyed to that ID, and surface only the ID to the user: "Something went wrong. If you contact support, reference error ID: d4e8f2a1." The user has something concrete to share; the developer can look up the full context.