Unified Optimization Stack for AI Systems - Semantic -> Memory -> Hardware

Overview

대규모 AI 모델의 성능은 단순한 연산 최적화만으로 결정되지 않는다.

실제 성능은 다음 세 가지 요소의 상호작용에서 결정된다.

연산의 수학적 의미 Computation Semantic
메모리 이동 구조 Memory Movement
하드웨어 실행 특성 Hardware Behavior

기존 프레임워크는 이 세 가지를 분리된 방식으로 다룬다.

Graph compiler - 연산 그래프 최적화
Kernel libarary - 하드웨어 최적화
Autotune - 경험적 성능 탐색

이 접근은 다음과 같은 한계를 가진다.

연산 의미 정보가 kernel 선택에 반영되지 않음
메모리 이동 구조가 IR 에서 명확히 표현되지 않음
하드웨어 특성이 정량적 모델로 연결되지 않음

이를 해결하기 위해 세 개의 프로젝트가 서로 연결된 구조로 설계되었다.

System Architecture

전체 구조는 다음과 같이 세 개의 계층으로 구성된다.

GPU Probing Lab
      │
      ▼
MCIR (Memory-Centric IR)
      │
      ▼
AICF (AI Compiler Framework)

각 계층은 서로 다른 질문을 다룬다.

GPU Probing : GPU 는 실제로 어떻게 동작하는가
MCIR : 메모리는 어떤 구조로 이동하는가
AICF : 연산의 의미는 무엇이며 어떻게 최적화 가능한가

1. AICF - Semantic Optimization Framework

AICF 는 연산의 수학적 의미를 기준으로 최적화를 수행하는 컴파일러 시스템이다.

일반적인 컴파일러는 연산 그래프의 구조를 기반으로 최적화를 수행한다.

예

operator fusion
constant folding
kernel selection

이러한 방식은 연산의 수학적 성질을 직접적으로 활용하지 않는다.

AICF 의 접근 방식은 다음과 같다.

Operator Semantics
      ↓
Semantic Properties
      ↓
IR Transformation
      ↓
Kernel Realization

즉, 연산이 가지는 수학적 성질을 기반으로 가능한 변환을 정의한다.

2. MCIR - Memory - Centric Intermediate Representation

MCIR 은 메모리 이동 구조를 중심으로 표현되는 IR 이다.

대규모 모델에서 실제 병목은 대부분 메모리 이동에서 발생한다.

특히 Transformer 이후 모델에서는 다음과 같은 특징이 나타난다.

Memory Bandwidth
        >
Compute Throughput

따라서 단순한 연산 그래프만으로는 성능을 설명하기 어렵다.

MCIR 은 계산을 다음과 같은 형태로 재구성한다.

Computation Graph
->
Memory Flow Graph

핵심 표현 단위는 다음과 같다.

Tile : 메모리 블록 단위
Stream : 순차 처리 구조
Rematerialization : 저장 대신 재계산
Online Reduction : global reduction 의 streaming 화
Buffer Lifetime : 메모리 재사용 범위

예를 들어 Attention 은 MCIR 에서 다음과 같이 표현될 수 있다.

기존 구조

QKᵀ
Softmax
Multiply V

MCIR 구조

for tile_K:
    partial_score = Q * K_tile
    update running max
    update running sum
    accumulate output

이 방식은 FlashAttention 같은 구조를 자연스럽게 표현할 수 있게 한다.

MCIR 의 핵심 목표는

메모리 이동 최소화를 위한 IR 표현이다.

3. GPU Probing Lab - Hardware Behavior Analysis

GPU 의 실제 실행 특성을 측정하기 위한 실험 시스템이다.

CUDA 문서만으로는 다음과 같은 정보들을 정확히 알기 어렵다.

cache behavior
memory coalesling penalty
bank conflict
warp scheduler behavior
tensorcore utitlization

이를 분석하기 위해 다양한 probe kernel 을 생성하고 실행한다.

실험 구조는 다음과 같다.

Probe Kernel Generator
        ↓
Execution
        ↓
Profiling (Nsight Compute)
        ↓
Metric Analysis
        ↓
Hardware Property Map

대표적인 실험 종류

Global stride sweep : memory coalescing
Shared memory bank test : bank conflict
Occupancy sweep : scheduler behavior
Latency chain : memory latency
Tensorcore sweep : tensorcore Throughput

이 프로젝트의 최종 목표는

GPU Hardware Property Map

을 구축하는 것이다.

Integration of the Three System

세 프로젝트는 서로 독립적인 것이 아니라 하나의 최적화 스택을 구성한다.

Mathematical Semantics
        │
        ▼
Semantic Transformations (AICF)
        │
        ▼
Memory Flow Representation (MCIR)
        │
        ▼
Hardware Execution Model (GPU Probing)

각 계층의 역할은 다음과 같다.

AICF : 연산 의미 기반 변환
MCIR : 메모리 이동 구조 표현
GPU Probing : GPU 실행 특성 모델

'명징직조' 카테고리의 다른 글

All elementary functions from a single operator 해설 (0)	2026.04.14
표현은 언제 관계 정보를 담는가 - 양자 얽힘을 출발점으로 본 구조적 정보 표현의 조건 (0)	2026.03.24
AI Workloads in the Memory Era (0)	2026.03.05
AI Compiler - A system that generates optimized kernel code for AI model inference (0)	2026.01.31
복잡한 문제 - 문제 자체가 하나의 고정된 해답 공간을 가지지 않는 상태 (0)	2025.12.12

뜻 지, 가르칠 훈

Unified Optimization Stack for AI Systems - Semantic -> Memory -> Hardware

Overview

System Architecture

1. AICF - Semantic Optimization Framework

2. MCIR - Memory - Centric Intermediate Representation

3. GPU Probing Lab - Hardware Behavior Analysis

Integration of the Three System

'명징직조' 카테고리의 다른 글

티스토리툴바

Unified Optimization Stack for AI Systems - Semantic -> Memory -> Hardware

Overview

System Architecture

1. AICF - Semantic Optimization Framework

2. MCIR - Memory - Centric Intermediate Representation

3. GPU Probing Lab - Hardware Behavior Analysis

Integration of the Three System

'명징직조' 카테고리의 다른 글

'명징직조' Related Articles

티스토리툴바