Concepts & Theory

Concurrency vs Parallelism: What Every Developer Should Know

These two words appear in job listings, framework docs, and architecture reviews as though they are synonyms. They are not. Getting the distinction right shapes how you design systems, choose data structures, and debug the failures that only happen under load.

Published June 22, 2026

Rob Pike, one of Go's creators, put it cleanly: "Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." The difference is about structure versus execution. A concurrent program is structured to manage multiple tasks that may overlap in time. A parallel program actually executes multiple computations simultaneously, using more than one CPU core. You can have one without the other, and conflating them leads to wrong tool choices and unexpected behavior.

Concurrency: interleaved progress, not simultaneous execution

A single-core processor running a web server handles hundreds of connections "concurrently" by rapidly switching between them. When connection A is waiting on a database response, the CPU services connection B. From each connection's point of view, it makes progress; from the hardware's point of view, only one thread is ever executing at any instant. This is concurrency without parallelism.

The classic model is the operating system scheduler. The OS divides CPU time into small slices (typically a few milliseconds) and assigns each slice to a runnable thread. With enough threads competing, the machine appears to do many things simultaneously even on one core. The programmer must reason about interleaving — the fact that another thread may execute any time the current thread is preempted — but does not need multiple physical cores.

Event loops take a different approach to concurrency. Rather than threads, they use a single thread that never blocks. I/O operations are kicked off asynchronously; the event loop processes the results via callbacks or coroutines when the OS signals completion. Node.js and Python's asyncio use this model:

import asyncio

async def fetch(url):
    reader, writer = await asyncio.open_connection('example.com', 80)
    writer.write(b'GET / HTTP/1.0\r\n\r\n')
    data = await reader.read(1024)
    writer.close()
    return data

async def main():
    results = await asyncio.gather(
        fetch('http://example.com'),
        fetch('http://example.org'),
    )

asyncio.run(main())

Both fetches are in-flight simultaneously at the network level, but the Python thread never blocks. While one fetch waits on the OS, the event loop runs other callbacks. This enables enormous I/O concurrency from a single thread, with no need for locks on shared data (because only one coroutine runs at a time at any given await point).

Parallelism: multiple computations running simultaneously

Parallelism requires hardware: multiple CPU cores, or SIMD units, or GPU shader cores, actually executing separate instruction streams at the same moment. Sorting a large array by splitting it into four chunks and sorting each on a separate core is parallelism. Rendering a video frame by distributing rows to GPU cores is parallelism. On a single-core machine, no amount of threading produces true parallelism; on a 32-core machine, 32 threads performing independent computation can all run at once.

Most operating system threads run in parallel when the machine has multiple cores and enough threads are runnable. Python's threading module creates OS threads, but the Global Interpreter Lock (GIL) prevents more than one Python bytecode instruction from executing at a time across all threads. Python threads are concurrent (interleaved) but are not parallel for CPU-bound work. The workaround for CPU-bound parallelism in Python is the multiprocessing module, which spawns separate processes, each with their own GIL:

from multiprocessing import Pool
import math

def compute(n):
    return sum(math.sqrt(i) for i in range(n))

with Pool(processes=4) as pool:
    results = pool.map(compute, [10_000_000] * 4)
print(sum(results))

This runs four processes simultaneously on four cores. The trade-off is higher overhead (spawning a process is slower than a thread) and that data sharing between processes requires explicit mechanisms like queues or shared memory rather than simple variable access.

When concurrency helps and when parallelism is needed

The distinction matters because the bottleneck determines the right tool. I/O-bound work — network requests, disk reads, database queries — spends most of its time waiting. While waiting, the CPU is idle. Adding more threads or using an async event loop fills that idle time with other work. Even a single core can saturate a network connection by interleaving many in-flight requests. This is the success story of async web servers: a Node.js process on one core handles thousands of HTTP connections because each connection spends most of its time waiting, not computing.

CPU-bound work — image processing, cryptography, machine learning inference, compiling code — does not wait for external resources; it continuously consumes CPU cycles. Concurrency without additional cores produces no speedup here because the CPU is already fully occupied. Only parallelism helps: distributing the work across multiple cores so that the wall-clock time is divided by the number of cores (up to Amdahl's law limits). Choosing an event loop for CPU-bound work just adds overhead without benefit.

Go's model: concurrency as a first-class design tool

Go was designed with concurrency as a core language feature. Goroutines are cheap, cooperatively-scheduled coroutines multiplexed by the Go runtime onto OS threads. The runtime creates as many OS threads as there are available CPU cores (controlled by GOMAXPROCS), so goroutines achieve both concurrency and parallelism simultaneously: thousands of goroutines run concurrently; the runtime maps them onto multiple cores so that independent goroutines run in parallel when cores are free.

func processItems(items []Item) []Result {
    results := make(chan Result, len(items))
    for _, item := range items {
        go func(it Item) {
            results <- process(it)
        }(item)
    }
    out := make([]Result, 0, len(items))
    for range items {
        out = append(out, <-results)
    }
    return out
}

Each goroutine is scheduled cooperatively at I/O points and function calls. The runtime's scheduler handles preemption so that a goroutine running a tight loop does not starve others. The result is a programming model where you write concurrent code naturally (spawn a goroutine per task, communicate over channels) and the runtime delivers parallelism automatically where hardware allows.

Shared state, race conditions, and the price of coordination

Both concurrency and parallelism introduce the problem of shared state. If two threads (or goroutines, or coroutines on different threads) access the same memory concurrently and at least one write is involved, a data race exists. Data races produce incorrect results that are hard to reproduce because the outcome depends on scheduling order.

The standard tools are mutexes (allowing only one thread to hold a lock at a time), read-write locks (allowing multiple concurrent readers but exclusive writers), and channels (communicating data between concurrent tasks without sharing memory). Every coordination mechanism adds latency and code complexity, which is why the rule of thumb is: design your concurrent system to minimize shared mutable state. Prefer message-passing over shared memory where possible, and when you do share memory, keep the critical section as small as you can.

Async single-threaded models like asyncio sidestep most locking concerns because only one coroutine runs between await points. But they introduce their own hazard: code that accidentally blocks (a time.sleep() instead of await asyncio.sleep()) freezes the entire event loop, starving every other coroutine. Understanding which operations are truly asynchronous in your framework is essential.

The practical upshot: identify whether your bottleneck is I/O or CPU, choose concurrency for I/O-bound work and parallelism for CPU-bound work, and design data access patterns before reaching for synchronization primitives. Most performance problems that developers attribute to "needing more threads" are actually I/O-bound problems solvable with async I/O, or CPU-bound problems that need process-level parallelism rather than more threads sharing the same GIL.