Streaming Algorithms in Deep Learning Operators

Online Reduction 과 Rescaling 성질이 다른 딥러닝 연산에서 어떻게 일반화되는지 분석한다.

1. The Core Principle: Associative Reduction

모든 스트리밍 알고리즘의 기초는 결합법칙이다. 데이터 집합을 나누었을 때, 전역 통계량을 그 조합으로 구할수 있다면 그 연산은 스트리밍 가능하다.

1.1 Representative Mathematical Structure

연산 타입	수학적 구조	스트리밍 상태 (State)
Simple Sum	∑xi	sum
Mean / Variance	1/N ∑xi, 1/N ∑x_i^2	sum,sum_sq,count
Softmax	e^(x_i) / ∑e^(x_j)	max,exp_sum
L2 Norm	root(∑x_i^2)	sum_sq

2. Expansion 1 : Layer Normalization (Online Moments)

LayerNorm 은 각 샘플의 모든 피처에 대해 평균과 분산을 구해야 한다.

2.1 The Traditional Way (Two-pass)

첫 번째 pass 에서 모든 x_i 를 읽어 mu 와 sigma^2 를 계산하고 메모리에 씀
두 번째 pass 에서 x_i 를 다시 읽어 정규화 수행 ( High Memory Traffic )

2.2 The Streaming Way ( Welford's Algorithm)

Welford 알고리즘을 사용하면 데이터를 한 번만 읽으면서 평균과 분산을 업데이트할 수 있다.

State : count, mean, M_2(sum of squareds of differences)
Update : 새로운 x 가 들어오면 기존 mean 과의 차이를 이용해 M_2 의 보정
LayerNorm 패턴을 발견하면 Single-pass 커널로 퓨전 가능

3. Expansion 2 : Mixture-of-Experts (MoE)Gating

MoE 의 Gating 매커니즘은 수천 개의 전문가 중 상위 k 개를 선택하고 Softmax 를 적용한다.

3.1 The Memory Challenge

전문가 점수를 모두 정렬, softmax 를 취하는 것은 메모리 오버헤드 발생

3.2 Streaming Top-K Softmax

데이터를 타일 단위로 읽으며 Local Top-K 를 유지함
각 타일의 최대값을 기준으로 online softmax 의 m 과 l 을 업데이트함
최종적응로 전역 top-k 에 해당하는 전문가에게만 가중치 배분

4. Expansion 3 : LogSumExp (LSE)

LSE 는 어텐션의 역전파나 강화학습의 Value function 계산에서 자주 쓰인다.

4.1 Streaming LSE

oneline softamx 에서 사용하는 m 과 l 상태만 있으면 LSE 는 자동으로 계산된다.

generalization : softmax 를 계산하는 모든 곳에서 LSE 는 추가 메모리 접근 없이 덤으로 얻을 수 있는 상태 정보이다.

5. Summary of Generalization Properties

Flash-style 최적화가 가능한 연산들의 공통 성질을 Streaming Compatibility 로 정의

Monotonicity (단조성) : 데이터가 추가될 때 상태가 단조적으로 변하는가
Linear Composability (선형 합성성) : 부분 결과의 선형 결합으로 전체 결과를 만들 수 있는가.
Recale-ability (재조정 가능성) : 지수 함수처럼 기준점 변화에 따라 기존 결과값을 수학적으로 보정할 수 있는가

연산의 이름에 집중하면 최적화는 파편화된다 하지만 수학적 성질에 집중하면 새로운 사용자 정의 연산에 대해서도 메모리 최적화를 적용할 수 있다.

'Memory-Centric IR for AICF' 카테고리의 다른 글

Re-materializable Intermediate (0)	2026.03.12
Online Reducible Norm - Welford 기반 Streaming Statistics (0)	2026.03.12
Mathematical Properties Behind FlashAttention - Streaming Transformations for Memory-Efficient Computation (0)	2026.03.11
MCIR Minimal Implementation Design (Python) (0)	2026.03.11
Attention MCIR Example - Full IR Walkthough in AICF (0)	2026.03.10

뜻 지, 가르칠 훈

Streaming Algorithms in Deep Learning Operators

1. The Core Principle: Associative Reduction

1.1 Representative Mathematical Structure

2. Expansion 1 : Layer Normalization (Online Moments)

2.1 The Traditional Way (Two-pass)

2.2 The Streaming Way ( Welford's Algorithm)

3. Expansion 2 : Mixture-of-Experts (MoE)Gating

3.1 The Memory Challenge

3.2 Streaming Top-K Softmax

4. Expansion 3 : LogSumExp (LSE)

4.1 Streaming LSE

5. Summary of Generalization Properties

'Memory-Centric IR for AICF' 카테고리의 다른 글

티스토리툴바

Streaming Algorithms in Deep Learning Operators

1. The Core Principle: Associative Reduction

1.1 Representative Mathematical Structure

2. Expansion 1 : Layer Normalization (Online Moments)

2.1 The Traditional Way (Two-pass)

2.2 The Streaming Way ( Welford's Algorithm)

3. Expansion 2 : Mixture-of-Experts (MoE)Gating

3.1 The Memory Challenge

3.2 Streaming Top-K Softmax

4. Expansion 3 : LogSumExp (LSE)

4.1 Streaming LSE

5. Summary of Generalization Properties

'Memory-Centric IR for AICF' 카테고리의 다른 글

'Memory-Centric IR for AICF' Related Articles

티스토리툴바