CS Fundamentals

Character Encoding and UTF-8: What Every Developer Needs to Know

Encoding bugs produce some of the most baffling output in software: garbled text, question-mark boxes, or the classic double-encoded mess that turns an em-dash into three strange symbols. The underlying model is simple enough to learn in one sitting — and once you have it, most encoding bugs become obvious.

Published June 28, 2026

Text in a computer is ultimately stored as bytes. A character encoding is the mapping that converts between characters (the symbols humans read) and bytes (the numbers computers store). When the encoder and decoder disagree about which mapping to use, you get garbage.

ASCII: 128 characters, one byte each

ASCII (American Standard Code for Information Interchange, 1963) maps 128 characters to the numbers 0–127. Every uppercase and lowercase Latin letter, digit, punctuation mark, and control character fits in seven bits, stored in one byte.

# ASCII in Python
>>> ord('A')     # character to code point
65
>>> chr(65)      # code point to character
'A'
>>> ord(' ')
32
>>> ord('~')
126

# ASCII was sufficient for English-only text.
# Code points 128-255 were left undefined by ASCII
# and were filled in differently by dozens of competing encodings.

The problem arrived when software needed to handle languages beyond English. Different countries defined different encodings for code points 128–255 (Latin-1, Windows-1252, ISO-8859-5 for Cyrillic, and many others). A file encoded in Windows-1252 and read as ISO-8859-1 would show different characters for the same bytes. This is the origin of most classic "garbage text" bugs.

Unicode: a universal character set

Unicode solves the mapping problem by assigning a unique code point to every character in every writing system. Code points are written as U+ followed by a hexadecimal number. There are currently over 140,000 code points covering scripts, symbols, emoji, and historical characters.

# Unicode code points in Python
>>> ord('A')         # Basic Latin
65
>>> hex(ord('A'))
'0x41'               # U+0041

>>> ord('é')    # e with acute accent (cafe in French)
233

>>> ord('\U0001F600') # emoji: grinning face
128512

# A "string" in Python 3 is a sequence of Unicode code points,
# not a sequence of bytes. The encoding question only arises
# when you write the string to disk or send it over a network.

Unicode defines what characters exist and their code point numbers, but not how to store those numbers as bytes. That is what encoding formats like UTF-8 do.

UTF-8: variable-width encoding

UTF-8 is the dominant encoding on the web and in modern systems. It stores each code point as 1 to 4 bytes, using a variable-width scheme designed so that ASCII code points (0–127) are stored as a single byte with the same value as in ASCII. This makes UTF-8 fully backwards-compatible with ASCII: any valid ASCII file is also a valid UTF-8 file.

# How UTF-8 encodes code points:
# U+0000 to U+007F  (ASCII):   1 byte   0xxxxxxx
# U+0080 to U+07FF  (Latin):   2 bytes  110xxxxx 10xxxxxx
# U+0800 to U+FFFF  (BMP):     3 bytes  1110xxxx 10xxxxxx 10xxxxxx
# U+10000 to U+10FFFF (emoji): 4 bytes  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

# In Python: encode a string to bytes, decode bytes to string
>>> "cafe".encode("utf-8")
b'cafe'

>>> "café".encode("utf-8")   # e with accent
b'caf\xc3\xa9'                    # 5 bytes for 4 visible characters

>>> b'caf\xc3\xa9'.decode("utf-8")
'cafe'   # with accent on the e

# Always specify the encoding explicitly:
with open("file.txt", "r", encoding="utf-8") as f:
    content = f.read()

Length, indexing, and the multi-byte trap

UTF-8's variable-width design creates a subtlety: the byte length of a string is not the same as its character count. Operations that count or index bytes rather than characters will silently break on non-ASCII text.

# Python 3 strings measure length in code points (correct)
s = "café"   # cafe with accent
len(s)            # 4: four characters

# The encoded form is longer in bytes
len(s.encode("utf-8"))   # 5: c, a, f, 0xC3, 0xA9

# C strings use byte length by default -- danger with UTF-8:
# strlen("caf\xc3\xa9") returns 5, not 4

# JavaScript: .length counts UTF-16 code units, not code points
# An emoji (U+1F600) uses two UTF-16 code units ("surrogate pair")
"grinning: 😀".length   // 12, not 11

Mojibake: when encodings mismatch

Mojibake (from Japanese: "character transformation") is the garbled text that results when bytes are decoded with the wrong encoding. The classic example is opening a UTF-8 file in a Latin-1 reader: the two-byte sequence for the accented "e" (0xC3 0xA9) is instead read as two separate Latin-1 characters ("A with tilde" and the copyright symbol), producing "cafÃ©" instead of "café".

# Simulating mojibake
original = "café"
encoded = original.encode("utf-8")    # b'caf\xc3\xa9'
wrong_decode = encoded.decode("latin-1")
print(wrong_decode)   # prints: cafÃ©  (wrong)

# Fixing it: re-encode with the wrong encoding, then decode correctly
fixed = wrong_decode.encode("latin-1").decode("utf-8")
print(fixed)          # prints: cafe  (with proper accent)

The BOM (Byte Order Mark, U+FEFF) is a zero-width character that some Windows tools prepend to UTF-8 files to identify the encoding. It is invisible but causes problems when software does not expect it — the BOM appears as three unexpected bytes (EF BB BF) at the start of the file. Use utf-8-sig in Python to automatically strip BOMs when reading.

The universal rule

One rule eliminates most encoding problems: use UTF-8 everywhere and be explicit about it. Configure your database connections, HTTP responses, file reads, and file writes to use UTF-8 explicitly. Never rely on a system default that could vary by locale, operating system, or server configuration.

# HTTP: declare encoding in Content-Type header
Content-Type: text/html; charset=utf-8

# HTML: declare in meta tag (before any non-ASCII characters)
<meta charset="UTF-8">

# MySQL: set collation on the database and tables
CREATE DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Note: use utf8mb4, not utf8 -- MySQL's "utf8" only supports 3-byte sequences
-- and cannot store 4-byte emoji code points

# Python: always pass encoding= to open()
with open("output.txt", "w", encoding="utf-8") as f:
    f.write(data)

The utf8mb4 distinction in MySQL catches developers by surprise: MySQL's legacy utf8 charset is actually a 3-byte subset of UTF-8 that cannot store code points above U+FFFF, including all emoji. Always use utf8mb4 in MySQL to get real UTF-8 support. PostgreSQL's UTF8 encoding has no such limitation.