Operator Semantics, IR Representation, and Kernel Fusion - Case Study : BatchNorm Stats & CrossEntropy

딥러닝 프레임워크에서 연산은 일반적으로 단일 레이어 형태로 인식된다.

BatchNorm
CrossEntropy
Softmax
LayerNorm

하지만 실제 GPU 실행 단계에서는 이러한 연산이 단일 커널로 구현되는 것이 아니라, 이러한 계산 단계가ㅏ 하나의 fusion kernel 형태로 구현되는 경우가 많다.

다음 두 연산을 중시믕로 이를 분석한다.

BatchNorm statistics extraction
CrossEntropy forward loss

목표는 다음 세 가지를 명확히 하는 것

연산의 수학적 의미 Semantic Operator
IR 에서의 알고리즘 분해 표현
실제 CUDA 구현에서 나타나는 kernel-level fusion

1. Layer Abstraction vs Kernel Reality

프렐임워크 API 에서의 연산

y = BatchNorm(x, gamma, beta)
loss = CrossEntropy(logits, target)

이 추상화는 수학적 의미를 표현한다.

하지만 실제 GPU 실행에서는 다음과 같은 구조가 된다.

BatchNorm
 ├─ Stats kernel
 ├─ Mean/Var finalize
 └─ Normalize kernel

CrossEntropy
 └─ LogSumExp reduction kernel

즉

레이어 단위 추상화화 커널 단위 실체는 다르다.

특히 성능 관점에서 중요한 부분은 레이어 전체가 아니라 특정 sub-stage kernel 이다.

BatchNorm : stats accumulation
CrossEntropy : row reduction

2. IR Representation Levels

연산을 IR 에서 표현할 때 중요한 점은 어떤 레벨에서 표현할 것인가이다.

실제로는 최소 3개의 레벨이 필요하다.

Semantic IR
    ↓
Algorithm IR
    ↓
Kernel IR

각 레벨은 서로 다른 역할을 가진다.

3. Semantic IR (Operator Meaning)

Semantic IR 은 연산의 의미 계약을 표현한다.

BatchNorm(x, gamma, beta)
CrossEntropy(logits, target)

Semantic IR 의 역할은 연산 의미 보존이다.

4. Algorithm IR (Computation Structure)

Algorithm IR 에서는 연산이 primitive computation pattern 으로 분해된다.

Map
Reduce
Transform
Gather
Scatter

이 레벨에서 intra-op fusion 가능성이 나타난다.

4.1 BatchNorm Stats Algorithm

크게 두 단계로 나뉜다.

통계 계산

mean = reduce(x)
var = reduce(x^2)

normalization

y = (x - mean) / sqrt(var + eps)

현재 분석 대상은 stats extraction 단계이다.

Algorithm IR 로 표현하면

for x_i in tensor:
    c = channel_index(i)

    sum[c]   += x_i
    sumsq[c] += x_i * x_i

이 구조는 다음 패턴을 가진다.

Map + Dual Reduction

특징

reduction target : channel
accumulator : sum / sumsq

4.2 CrossEntropy Algorithm

CrossEntropy forward 는 log-sum-exp stabilization 을 사용한다.

Algorithm IR

for each sample row:

    m = reduce_max(logits)

    s = reduce_sum(exp(logits - m))

    loss = log(s) + m - logits[target]

구조적으로 보면

ReduceMax
MapExp
ReduceSum
ScalarTransform

즉

map + reduction pipeline

형태이다.

5. Kernel Realization

Kernel IR 에서는 실제 CUDA 구현이 나타난다.

memory layout
warp / block reduction
atomic operations

같은 요소가 등장한다.

5.1 BatchNorm Statts Kernel

const int c = (i / HW) % C;

float v = __half2float(x[i]);

atomicAdd(&sum[c], v);
atomicAdd(&sumsq[c], v * v);

이 커널은 다음 연산들이 fused 되어 있다.

load
type cast
square
dual reduction

즉

cast + transform + reduction fusion

5.2 CrossEntropy Kernel

m = block_max(m);
s = block_sum(expf(row[c] - m));

atomicAdd(out_loss, logf(s) + m - row[t]);
atomicAdd(out_valid, 1);

다음 연산이 fused 된다.

max reduction
exp transform
sum reduction
loss computation
global accumulation

즉

reduction pipeline fusion

6. Types of Kernel Fusion

GPU 커널에서 나타나는 fusion 은 크게 세 가지

6.1 inter-Operator Fusion

서로 다른 연산을 합친다.

GEMM + Bias + GELU
Conv + BatchNorm + ReLU

주로 framework level fusion 이다.

6.2 intra-Operator Fusion

하나의 연산 내부 단계들을 합친다.

예

CrossEntropy

max
exp
sum
loss

BatchNorm stats

cast
square
reduction

문서에서 분석한 두 커널이 해당 유형

6.3 Epilogue Fusion

주 계산 이후 연산을 붙인다.

GEMM
  → Bias
  → Activation
  → Loss

7. Implicaions for IR Design

이 분석이 중요한 이유는 IR 설계에서 다음 문제가 발생하기 때문

IR 이 operator 만 표현하면 실제 커널 구조를 설명할 수 없다.

따라서 IR 은 다음을 지원해야 한다.

Semantic Operator
↓
Algorithm Decomposition
↓
Kernel Realization

특히 Algorithm IR 에서

Map
Reduce
Transform

패턴이 나타나며

이 단계에서 intra-op fusion 가능성이 드러난다.

8. Key Insight

BatchNorm 과 CrossEntropy 는 일반적으로 단일 레이어로 인식되지만, 실제 GPU 커널은 다음과 같은 특징을 가진다.

연산 내부 단계가 이미 fused kernel 형태로 구현된다
reduction 중심 연산은 algorithm pipeline fusion 을 가진다.
kernel 은 종종 영산의 부분 단계만 담당한다.

BatchNorm stats kernel
CrossEntropy reduction kernel

'operator 의 연산 의미 분석' 카테고리의 다른 글

Operator Semantic Properties Catalog ( AICF Semantic Optimization Rules) (1)	2026.03.16
AICF: Semantic-Preserving Optimization Architecture (0)	2026.03.02
Add Emitter 변경 문서 (0)	2026.02.27
AdamStep Emitter 변경 문서 (0)	2026.02.27
ReLU Semantic Specification - 비선형 게이팅 / 반공간 정류 (0)	2026.02.19

뜻 지, 가르칠 훈

Operator Semantics, IR Representation, and Kernel Fusion - Case Study : BatchNorm Stats & CrossEntropy

1. Layer Abstraction vs Kernel Reality

2. IR Representation Levels

3. Semantic IR (Operator Meaning)

4. Algorithm IR (Computation Structure)

4.1 BatchNorm Stats Algorithm

통계 계산

normalization

4.2 CrossEntropy Algorithm

5. Kernel Realization

5.1 BatchNorm Statts Kernel

5.2 CrossEntropy Kernel

6. Types of Kernel Fusion

6.1 inter-Operator Fusion

6.2 intra-Operator Fusion

6.3 Epilogue Fusion

7. Implicaions for IR Design

8. Key Insight

'operator 의 연산 의미 분석' 카테고리의 다른 글

티스토리툴바

Operator Semantics, IR Representation, and Kernel Fusion - Case Study : BatchNorm Stats & CrossEntropy

1. Layer Abstraction vs Kernel Reality

2. IR Representation Levels

3. Semantic IR (Operator Meaning)

4. Algorithm IR (Computation Structure)

4.1 BatchNorm Stats Algorithm

통계 계산

normalization

4.2 CrossEntropy Algorithm

5. Kernel Realization

5.1 BatchNorm Statts Kernel

5.2 CrossEntropy Kernel

6. Types of Kernel Fusion

6.1 inter-Operator Fusion

6.2 intra-Operator Fusion

6.3 Epilogue Fusion

7. Implicaions for IR Design

8. Key Insight

'operator 의 연산 의미 분석' 카테고리의 다른 글

'operator 의 연산 의미 분석' Related Articles

티스토리툴바