From Attention Graph to FlashAttention - A Memory-Centric Compilation Approach in AICF

1. Introduction

최근 딥러닝 모델의 성능 병목은 FLOPs 가 아닌 메모리 이동, memory traffic 에서 발생하는 경우가 많음

특히 transformer 계열 모델에서 핵심 연산인 Attention 은 이 문제를 대표적으로 보여준다.

기본적인 Attention 계산은 다음과 같은 연산 그래프로 표현된다.

Scores = Q × Kᵀ
Scores = Mask(Scores)
Scores = Softmax(Scores)
Output = Scores × V

이 표현은 수학적으로 정확하지만 실제 하드웨어에서 그대로 실행될 경우 메모리 트래픽을 발생시킨다.

특히 다음 중간 텐서들이 문제를 만든다.

Attention score matrix
Softmax intermediate
Probability matrix

이 텐서들은 종종 HBM 에 저장되며, 이는 전체연산 비용을 크게 증가시킨다.

이를 해결하기 위해 제안된 대표적인 접근이 FlashAttention 이다.

하지만 대부분의 시스템에서 FlashAttention 은 컴파일러가 유도하는 구조가 아니라. 별도의 hand-written kernel 로 제공된다.

AICF 는 이 문제를 다른 방식으로 접근한다.

FlashAttention 을 구현하는 것이 아니라. Sementic Attnetion Graph 로 부터 FlashAttention 과 유사한 실행 구조를 컴파일 단계에서 유도한다.

이를 위해 AICF 는 Memory-Centric IR 을 도입한다.

2. The Limitation of Operator-Centric Execution

기존 딥러닝 프레임워크와 컴파일러는 대부분 Operator Graph 를 중심으로 설계된다.

즉 프로그램은 다음과 같은 연산자 단위로 구성된다.

MatMul
Softmax
Mask
Elementwise
Reduction

이 구조는 다음과 같은 장점을 가진다.

의미론적 정확성 유지
자동 미분 지원
그래프 수준최적화 가능

하지만 이는 근본적인 한계가 있다.

Operator Boundaries Become Execution Boundartes

대부분의 시스템에서 operator 경계는 곧 kernel 경계가 된다.

MatMul kernel
→ write to memory
Softmax kernel
→ write to memory
MatMul kernel

이 과정에서 발생하는 문제는 다음과 같다

Intermediate tensor materialization
Global memory round trips
Kernel launch overhaed

결과적으로 연산 자체보다 데이터 이동 비용이 더 커지는 상황이 발생한다.

3. FlashAttention as a Memory-Centric Execution

FlashAttention 은 이 문제를 해결하기 위해 operatore graph 자체를 실행 구조로 사용하지 않는다.

대신 다음과 같은 memory-centric execution pipeline 을 사용한다.

load Q tile
    ↓
stream K/V tiles
    ↓
compute partial scores
    ↓
online softmax update
    ↓
accumulate output

핵심 특징은 다음과 같다

Tile-based computatoin
Streaming execution
Online softmax
No intermediate materialization

즉 새로운 연산이 아닌 새로운 실행 구조이다.

4. Why FlashAttention is Not Just a Kernel

대부분의 시스템에서 FlashAttention 은 다음과 같이 제공된다.

flash_attention(Q, K, V)

즉, 단순히 특수 커널 라이브러리로 제공

하지만 이 접근에는한계가 있다.

Limited pattern coverage
Hard-coded kernel logic
Limited extensibility

결국 컴파일러 최적화가 아니라 kernel replacement 에 가까운 방식이다.

5. AICF Approach: Deriving FlashAttention via Compilation

AICF 는 FlashAttention 을 다음과 같은 방식으로 접근한다.

Attention Semantic Graph
        ↓
Attention Pattern Detection
        ↓
Memory-Centric Execution Region
        ↓
Streaming Execution Schedule
        ↓
FlashAttention-like Kernel

즉 컴파일 결과물이 된다.

핵심 아이디어는 다음과 같다.

Detect Attention Pattern

Semantic IR 에서 다음 패턴을 인식한다.

MatMul(Q, Kᵀ)
Mask (optional)
Softmax
MatMul(..., V)

이를 Attnetion Region 으로 변환

Convert to Streaming Execution Region

AttentionRegion 은 다음과 같은 StreamingRegion 으로 변환된다.

StreamingRegion {
    tile(Q)
    stream(K,V)
    online_softmax
    accumulate
}

이 단계에서 다음 정보가 결정된ㄷ.

tile size
reduction axis
residency policy
streaming order

Lower to Kernel IR

StreamingRegion 은 다음과 같은 execution kernel 로 lowering 된다.

for q_tile:
    load Q_tile

    for kv_tile:
        load K_tile
        load V_tile

        compute score
        update softmax
        accumulate output

이 커널 구조는 FlashAttention 과 매우 유사하다.

하지만 중요 차이

컴파일러가 생성한 execution schedule 이라는 점

6. The Role of Memory-Centric IR

이 변환을가능하게 하는 핵심 요소가 Memory-Centric IR

기존 Semantic IR 은 다음을 표현

연산 의미
텐서 구조
데이터 의존성

하지만 Flash Attention 을 생성하려면 다음 정보가 필요

tile lifecycle
on-chip memory residency
streaming executino order
partial reduction state

Memory-Centric IR 은 이러한 정보를 표현하는 새로운 계층

7. Why This MAtters

이 접근의 가장 중요한 의미는 다음과 같다

Execution structure becomes a compiler decision

실행 구조가 더 이상 kernel library 에 고정되지 않는다.

Memory optimization becomes systematic

메모리 중심 최적화가 컴파일 단계에서 수행된다.

Hardware adaptation becomes possible

다른 GPU 아키텍처에서 다른 execution schedule 을 생성할 수 있다.

8. Outlook

Attention -> FlashAttention 변환은 Memory-Centric IR 의 대표적인 사례이다.

하지만 같은 접근을 다음과 같은 연산에도 적용될 수 있다.

LayerNorm streaming
Softmax streaming
Fused reduction pipelines
Persistent GEMM kernels

이러한 변환을 체계적으로 표현하기 위해 Memory-Centric IR 기반 컴파일 구조를 확장해 나간다.

'Memory-Centric IR for AICF' 카테고리의 다른 글

Attention MCIR Example - Full IR Walkthough in AICF (0)	2026.03.10
Memory-Centric IR Specification - AICF Intermediate Representation (0)	2026.03.10
Memory-Centric IR Structure for AICF : A Design Draft (0)	2026.03.10
AttentionRegion Transformation - Deriving FlashAttention via Memeory-Centric Compileation in AICF (0)	2026.03.10
Memory-Centric IR for AICF : A Design Draft (0)	2026.03.10

뜻 지, 가르칠 훈

From Attention Graph to FlashAttention - A Memory-Centric Compilation Approach in AICF

1. Introduction

2. The Limitation of Operator-Centric Execution

3. FlashAttention as a Memory-Centric Execution

4. Why FlashAttention is Not Just a Kernel

5. AICF Approach: Deriving FlashAttention via Compilation

Detect Attention Pattern

Convert to Streaming Execution Region

Lower to Kernel IR

6. The Role of Memory-Centric IR

7. Why This MAtters

8. Outlook

'Memory-Centric IR for AICF' 카테고리의 다른 글

티스토리툴바

From Attention Graph to FlashAttention - A Memory-Centric Compilation Approach in AICF

1. Introduction

2. The Limitation of Operator-Centric Execution

3. FlashAttention as a Memory-Centric Execution

4. Why FlashAttention is Not Just a Kernel

5. AICF Approach: Deriving FlashAttention via Compilation

Detect Attention Pattern

Convert to Streaming Execution Region

Lower to Kernel IR

6. The Role of Memory-Centric IR

7. Why This MAtters

8. Outlook

'Memory-Centric IR for AICF' 카테고리의 다른 글

'Memory-Centric IR for AICF' Related Articles

티스토리툴바