Software Engineering

Unit Testing Fundamentals: What to Test and How

A unit test verifies a small, isolated piece of behavior and fails for exactly one reason. That precision is what makes a test suite a safety net rather than a drag — but achieving it requires deliberate choices about what counts as a unit, when to use mocks, and which assertions actually catch bugs.

Published June 25, 2026

Automated tests exist to let you change code with confidence. A test suite that achieves this has two properties: it catches real bugs before they reach production, and it does not break when you refactor internals without changing behavior. Many test suites have the first property but not the second, making every refactor a slog through broken assertions. The difference almost always comes down to what the tests are coupled to.

This article covers the mechanics and the judgment calls: the structure of a good test, the difference between mocks and stubs, the limits of coverage metrics, and the patterns that make a suite maintainable over years rather than months.

What is a unit?

The term "unit test" is contested. In the classic definition, a unit is a single function or method, and a unit test tests it in complete isolation, replacing every dependency with a test double. In the sociable or "London school" definition, a unit is a cluster of objects that collaborate to deliver a behavior, and only external dependencies (databases, network calls, the clock) are replaced.

The classic definition maximizes isolation but produces tests so granular they break whenever you reorganize code — even when the external behavior is identical. The sociable definition produces tests that are more resilient to refactoring but harder to write for complex graphs of objects.

In practice, a good heuristic is: isolate at the process boundary. Replace anything that crosses a process boundary (database, filesystem, external HTTP, clock, random number generator), but let in-process collaborators interact normally. This gives you fast, deterministic tests that still reflect how the components actually work together.

The AAA pattern

Every test should be readable as a story in three acts: Arrange (set up the state), Act (run the code under test), Assert (verify the outcome). Separating these three phases visually makes the test self-documenting.

import pytest
from decimal import Decimal
from myapp.cart import ShoppingCart

def test_discount_applied_when_total_exceeds_threshold():
    # Arrange
    cart = ShoppingCart()
    cart.add_item("widget", price=Decimal("30.00"), qty=4)   # total = 120.00

    # Act
    cart.apply_loyalty_discount(threshold=Decimal("100.00"), rate=Decimal("0.10"))

    # Assert
    assert cart.total() == Decimal("108.00")

A test that does not follow this shape is harder to read and usually tests too many things at once. If you find yourself writing "act, assert, act, assert" in a single test, split it into two tests. Each test should verify one logical outcome: not one assertion, but one scenario.

Test behavior, not implementation

The most common source of brittle tests is coupling assertions to how the code does something rather than what it produces. A test that checks which private methods were called, or that inspects internal data structures, will break every time you refactor even when the behavior is correct.

# Brittle: coupled to implementation detail
def test_cache_hit():
    service = ProductService()
    service.get_product(42)
    service.get_product(42)
    assert service._cache._store.get(42) is not None   # reaches into internals

# Resilient: tests the observable outcome
def test_second_call_returns_same_product():
    service = ProductService()
    first = service.get_product(42)
    second = service.get_product(42)
    assert first == second

The second test survives a complete rewrite of the caching layer. The first one breaks if you rename a field.

Mocks, stubs, and fakes

Test doubles are objects that stand in for real dependencies during testing. The vocabulary matters because each type makes different guarantees:

Stub: returns a pre-configured response. Used when you need the dependency to return something specific so the code under test can proceed. No assertions are made about the stub itself.
Mock: records how it was called and can assert that it was called in a specific way. Use mocks when the call itself is the side effect you want to verify (sending an email, writing to a queue).
Fake: a working implementation simplified for testing (an in-memory database, a local filesystem). Fakes are more expensive to write but far more robust than mocks for complex interactions.

from unittest.mock import MagicMock, call

def test_order_confirmation_email_sent_on_success():
    # Arrange
    mailer = MagicMock()                        # mock: we will assert on it
    repo = MagicMock()                          # stub: just needs to not fail
    repo.save.return_value = True

    service = OrderService(repo=repo, mailer=mailer)

    # Act
    service.place_order(order_id=99, customer_email="[email protected]")

    # Assert -- the call IS the behavior we care about
    mailer.send_confirmation.assert_called_once_with(
        to="[email protected]",
        order_id=99
    )

Overusing mocks is the most common testing mistake after not writing tests at all. If a test mocks out ten objects to exercise one function, you are testing the wiring, not the logic. The function should be broken down, or the dependencies should be simplified, until the test doubles are minimal.

Edge cases are where bugs live

The typical test verifies the happy path with a representative input. This is necessary but not sufficient. The bugs that reach production are almost always in edge cases: empty collections, zero values, maximum values, None inputs, concurrent access, network timeouts. Each of these deserves its own test.

@pytest.mark.parametrize("items,expected", [
    ([],           Decimal("0.00")),     # empty cart
    ([("x", "0.01", 1)], Decimal("0.01")),  # minimum price
    ([("y", "999.99", 100)], Decimal("99999.00")),  # large order
])
def test_cart_total_edge_cases(items, expected):
    cart = ShoppingCart()
    for name, price, qty in items:
        cart.add_item(name, price=Decimal(price), qty=qty)
    assert cart.total() == expected

Parametrized tests (pytest's @pytest.mark.parametrize, JUnit 5's @ParameterizedTest) let you express many input/output pairs without duplicating the test body. This is far better than repeating nearly identical test functions for each case.

What coverage tells you and what it does not

Code coverage measures which lines (or branches) were executed during the test run. 80% line coverage means 20% of lines were never touched. Coverage has a ceiling effect: reaching 100% does not mean your tests are good — it means every line was executed at least once. A test that calls a function with a valid input, never checks the return value, and still "covers" the function is not a useful test.

Use coverage as a floor, not a goal. A coverage report that flags a large block of untested code is useful: go write tests for it. A project that treats 80% as a success criterion and stops there has learned the wrong lesson from the metric. The useful question is not "what percentage of lines did we execute?" but "do we have tests for the failure modes and edge cases that matter?"

Mutation testing is a more rigorous alternative: the tool automatically introduces small bugs into your code (flipping a > to >=, deleting a line) and checks whether your tests catch the change. A suite that misses most mutations is not doing its job regardless of coverage percentage.

Test speed and organization

Tests that run slowly do not get run. The goal is a suite that completes in seconds for local development, so developers run it constantly rather than at commit time only. The main strategies:

Keep unit tests (in-memory, no I/O) separate from integration tests (database, network). Run unit tests locally on every save; run integration tests in CI.
Do not hit the real database in unit tests. A fake repository or an in-memory SQLite database is faster by two orders of magnitude than a real PostgreSQL instance over a network socket.
Avoid time.sleep in tests. If you need to test time-dependent behavior, inject the clock as a dependency and control it in the test.

# Inject the clock so tests can control "now"
from datetime import datetime

class TokenService:
    def __init__(self, clock=None):
        self._clock = clock or datetime.utcnow

    def is_expired(self, token):
        return self._clock() > token.expires_at

# In the test
def test_expired_token_rejected():
    future = datetime(2030, 1, 1)
    service = TokenService(clock=lambda: future)
    token = Token(expires_at=datetime(2025, 1, 1))
    assert service.is_expired(token)

A test suite that is fast, focused on behavior, and minimal in its use of mocks stays useful as the codebase grows. One that is slow, coupled to implementation details, and littered with assertions on internal state becomes a burden within months — and gets turned off rather than fixed.