Compilers

How Does a Compiler Work?

By the CodeCompiler Team · 11 min read

In our companion article, What Is a Code Compiler?, we described a compiler as a translator that converts source code into something a machine can execute. That description is accurate, but it hides an enormous amount of internal machinery. A compiler isn't a single step — it's a pipeline made up of several distinct phases, each responsible for a very specific transformation. Understanding these phases doesn't just satisfy curiosity; it also explains why compilers produce the specific error messages they do, why some optimizations are possible and others aren't, and why compiling large projects can take anywhere from milliseconds to several minutes.

Let's walk through the pipeline using a tiny, concrete example: the line of code total = price * quantity;. We'll follow this single statement all the way from raw text to something a machine could theoretically execute.

Stage 1: Lexical Analysis (Tokenizing)

The compiler's first job is deceptively simple: read the raw source code, which is just a sequence of characters, and group those characters into meaningful chunks called tokens. This stage is handled by a component usually called a lexer or scanner. For our example line, the lexer would produce a stream of tokens roughly like this: an identifier total, an assignment operator =, an identifier price, a multiplication operator *, an identifier quantity, and a statement terminator ;.

At this stage, the compiler doesn't yet understand what any of this means — it has no idea that total is a variable that will hold a number, or that this line is even a valid statement in the language. It is purely a text-processing step: turning a flat string of characters into a structured sequence of labeled pieces. Whitespace and comments are typically discarded here, since they don't carry meaning for the program's execution (though some compilers preserve comments as metadata for tooling purposes).

Stage 2: Syntax Analysis (Parsing)

Once the source code has been broken into tokens, the next stage — parsing — organizes those tokens according to the grammatical rules of the programming language. The result is typically a tree structure called an abstract syntax tree, or AST. For our example, the AST would represent the fact that this is an assignment statement, where the left-hand side is the variable total, and the right-hand side is a multiplication expression involving price and quantity.

This is also where many "syntax errors" are caught — for example, if you forgot the semicolon, or wrote total = * price quantity; with the operator in the wrong place, the parser would fail here, because the token sequence doesn't match any valid grammatical pattern the language defines. The error messages you see in your editor or terminal that say things like "unexpected token" or "expected an expression" almost always originate from this stage.

Stage 3: Semantic Analysis

A program can be grammatically valid and still be nonsensical. total = price * quantity; is syntactically fine even if price was never declared anywhere in the program, or if price is a string of text rather than a number. Semantic analysis is the stage where the compiler checks these deeper rules: that every variable used has actually been declared, that types are compatible with the operations being performed on them, that function calls match the expected number and type of arguments, and so on.

This is the stage responsible for most of the "type errors" you encounter in statically typed languages like Java, C#, or TypeScript. It's also where the compiler builds and consults a symbol table — essentially a lookup structure that tracks every identifier in the program, what it refers to, and what type it has.

Stage 4: Intermediate Representation and Optimization

Many compilers don't jump directly from the AST to final machine code. Instead, they first translate the AST into one or more simplified intermediate representations (IR) — a form that is easier to analyze and transform than the original source syntax, but still independent of any specific target processor. This is where much of the "smart" work of a modern compiler happens.

Optimization passes analyze this intermediate representation looking for improvements that don't change the program's observable behavior. Classic examples include:

Constant folding — computing the result of an expression like 2 * 3 at compile time instead of at runtime, since the result will always be 6.
Dead code elimination — removing code that can never be reached or whose results are never used.
Inlining — replacing a function call with the body of the function itself, avoiding the overhead of the call when it's safe and beneficial to do so.
Loop optimizations — restructuring loops to reduce redundant work performed on every iteration.

Production compilers like GCC, Clang, and the JVM's JIT compiler contain hundreds of individual optimization passes, refined over decades, which is a large part of why compiled code can run dramatically faster than a naive, unoptimized translation of the same source.

Stage 5: Code Generation

The final stage translates the (now optimized) intermediate representation into the actual target output. For a traditional native compiler, that means real machine instructions for a specific processor architecture — x86-64, ARM, RISC-V, and so on — taking into account that architecture's specific registers, instruction set, and calling conventions. For a language like Java, the target is Java bytecode, a platform-independent instruction format that the Java Virtual Machine knows how to execute (or further compile, just-in-time, into native code). For a JavaScript engine, the "target" might be a highly optimized internal machine code representation generated on the fly while your code is running in the browser.

Whatever the specific target, this stage is where our original line total = price * quantity; finally becomes something concrete: a small handful of instructions that load two values, multiply them, and store the result — expressed in whatever format the destination machine or virtual machine actually understands.

Worth remembering: not every compiler executes every stage in isolation, and some modern compilers interleave stages or loop back and forth between them. But conceptually, lexing → parsing → semantic analysis → optimization → code generation remains the mental model that holds up across virtually every compiler in existence, from tiny teaching compilers to industrial-grade systems like LLVM.

Why This Matters Even If You'll Never Write a Compiler

Very few working developers will ever write a compiler from scratch. So why does any of this matter? Because understanding the pipeline makes you a measurably better programmer. When you understand that semantic analysis happens after parsing, you understand why a type error in one part of a file doesn't prevent the compiler from also reporting a syntax error somewhere else. When you understand optimization passes, you understand why "clever" micro-optimizations you write by hand are often unnecessary — the compiler will do them for you, and often better than you would have. And when you understand code generation, you start to understand why the same high-level code can perform very differently on different platforms.

Key takeaways

Compilation is a pipeline: lexical analysis, parsing, semantic analysis, optimization, and code generation.
Lexical analysis turns raw text into tokens; parsing organizes tokens into a tree that reflects the language's grammar.
Semantic analysis checks meaning — types, declarations, and correctness — beyond just grammar.
Optimization passes improve performance without changing a program's observable behavior.
Code generation produces the final output: native machine code, bytecode, or another executable format.

See the pipeline in action

Theory is one thing — watching real code compile and run is another. Try writing a few lines in Python, C++, or Java and run them instantly in your browser.

Open the Free Online Compiler →