컴파일러 기초 — Lexer, Parser, IR, 최적화

개요

컴파일러는 소스 코드를 실행 가능한 기계어로 변환하는 프로그램입니다. 게임 개발에서도 셰이더 컴파일, 스크립트 언어(Lua, AngelScript), 블루프린트 컴파일 등 컴파일러 기술이 광범위하게 활용됩니다.

1. 컴파일러 파이프라인

소스 코드 (텍스트)
    ↓ 어휘 분석 (Lexer/Tokenizer)
토큰 스트림
    ↓ 구문 분석 (Parser)
AST (추상 구문 트리)
    ↓ 의미 분석 (Semantic Analysis)
타입 정보가 추가된 AST
    ↓ IR 생성 (IR Generation)
중간 표현 (SSA Form 등)
    ↓ 최적화 (Optimization Passes)
최적화된 IR
    ↓ 코드 생성 (Code Generation)
기계어 / 바이트코드

2. 어휘 분석 (Lexer)

#include <string>
#include <vector>

enum class TokenType {
    Number, Identifier, Plus, Minus, Star, Slash,
    LeftParen, RightParen, Equal, Semicolon, EndOfFile
};

struct Token {
    TokenType type;
    std::string value;
    int line;
};

class Lexer {
    std::string source;
    size_t pos = 0;
    int line = 1;

public:
    explicit Lexer(const std::string& src) : source(src) {}

    std::vector<Token> Tokenize() {
        std::vector<Token> tokens;

        while (pos < source.size()) {
            SkipWhitespace();
            if (pos >= source.size()) break;

            char c = source[pos];

            if (std::isdigit(c)) {
                tokens.push_back(ReadNumber());
            } else if (std::isalpha(c) || c == '_') {
                tokens.push_back(ReadIdentifier());
            } else {
                tokens.push_back(ReadSymbol());
            }
        }

        tokens.push_back({TokenType::EndOfFile, "", line});
        return tokens;
    }

private:
    Token ReadNumber() {
        std::string num;
        while (pos < source.size() && std::isdigit(source[pos]))
            num += source[pos++];
        return {TokenType::Number, num, line};
    }

    Token ReadIdentifier() {
        std::string id;
        while (pos < source.size() &&
            (std::isalnum(source[pos]) || source[pos] == '_'))
            id += source[pos++];
        return {TokenType::Identifier, id, line};
    }

    void SkipWhitespace() {
        while (pos < source.size() && std::isspace(source[pos])) {
            if (source[pos] == '\n') line++;
            pos++;
        }
    }

    Token ReadSymbol();
};

3. 구문 분석 (Parser) — 재귀 하강

// 간단한 수식 파서
// expr = term (('+' | '-') term)*
// term = factor (('*' | '/') factor)*
// factor = Number | '(' expr ')'

struct ASTNode {
    enum class Kind { Number, BinaryOp };
    Kind kind;
    int value; // Number일 때
    char op;   // BinaryOp일 때
    std::unique_ptr<ASTNode> left, right;
};

class Parser {
    std::vector<Token> tokens;
    size_t pos = 0;

    Token& Peek() { return tokens[pos]; }
    Token Consume() { return tokens[pos++]; }

    std::unique_ptr<ASTNode> ParseExpr() {
        auto left = ParseTerm();

        while (Peek().value == "+" || Peek().value == "-") {
            char op = Consume().value[0];
            auto right = ParseTerm();

            auto node = std::make_unique<ASTNode>();
            node->kind = ASTNode::Kind::BinaryOp;
            node->op = op;
            node->left = std::move(left);
            node->right = std::move(right);
            left = std::move(node);
        }

        return left;
    }

    std::unique_ptr<ASTNode> ParseFactor() {
        auto tok = Consume();
        auto node = std::make_unique<ASTNode>();
        node->kind = ASTNode::Kind::Number;
        node->value = std::stoi(tok.value);
        return node;
    }
};

4. 중간 표현 (IR) — SSA Form

// 소스 코드
int x = 3 + 4;
int y = x * 2;

// Three-Address Code IR
t1 = 3 + 4       // 임시 변수 t1
x = t1           // 대입
t2 = x * 2       // 임시 변수 t2
y = t2

// SSA (Static Single Assignment) — 각 변수는 한 번만 정의
x1 = 3 + 4      // 첨자로 버전 구분
y1 = x1 * 2

5. 최적화 패스

// 1. 상수 전파 (Constant Propagation)
// Before: x = 3 + 4; y = x * 2;
// After:  y = 14;  (컴파일 타임에 계산)

// 2. Dead Code Elimination
// if (false) { x = 5; }  → 제거

// 3. Loop Invariant Code Motion
// for (int i = 0; i < n; i++) {
//     y = heavy_compute(); // 루프 불변
//     arr[i] += y;
// }
// → 변환:
// y = heavy_compute(); // 루프 밖으로
// for (int i = 0; i < n; i++) arr[i] += y;

// 4. Inlining
// inline void f() { ... }
// f(); → f의 본문을 호출 위치에 직접 삽입

// 5. Strength Reduction
// x * 2  → x << 1  (곱셈 → 시프트)
// x * 8  → x << 3

6. LLVM 아키텍처

프론트엔드 (Clang, Rustc, swiftc)
  ↓ LLVM IR (.ll 파일) 생성
LLVM 최적화 패스 (opt)
  ↓ 최적화된 LLVM IR
백엔드 (x86-64, ARM, WebAssembly)
  ↓ 기계어 생성

# LLVM IR 확인
clang -S -emit-llvm hello.c -o hello.ll
cat hello.ll
# → define i32 @main() { ... }

# 최적화 적용
opt -O2 hello.ll -o hello_opt.ll

7. 인터프리터 vs 컴파일러 vs JIT

방식	장점	단점	사례
인터프리터	빠른 시작, 유연	느린 실행	Python, Lua
AOT 컴파일	빠른 실행	느린 빌드	C++, Rust
JIT 컴파일	빠른 실행 + 유연	워밍업 지연	Java, C#, JS V8

정리

컴파일러: Lexer → Parser → AST → IR → 최적화 → 코드 생성
Lexer: 문자 스트림 → 토큰
Parser: 토큰 → AST (재귀 하강이 가장 일반적)
IR: 기계 독립적 중간 표현 (SSA가 현대 표준)
최적화: 상수 전파, DCE, 인라이닝, 강도 축소
LLVM: 모듈화된 프론트/백엔드 분리 컴파일러 프레임워크

컴파일러 기초 — Lexer, Parser, IR, 최적화

개요

1. 컴파일러 파이프라인

2. 어휘 분석 (Lexer)

3. 구문 분석 (Parser) — 재귀 하강

4. 중간 표현 (IR) — SSA Form

5. 최적화 패스

6. LLVM 아키텍처

7. 인터프리터 vs 컴파일러 vs JIT

정리

참고 자료