Tile-Compatible Compute

Method Overview

GPU 연산의 성능은 대부분 data reuse 에 의해 결정된다.

특히 shared memory / L1 / registers 에 데이터가 머무는 동안 얼마나 많은 연산을 수행할 수 있는지가 중요하다.

Tile-compatile compute 는 다음 질문을 다룬다.

이 연산은 tile 단위로 닫힌 실행 구조를 만들 수 있는가?

즉

tile 내부에서 dependency 가 해결되는가?

tile compatibility 는 다음 성질을 요구한다.

partial state accumulation 가능
dependency closure

예

GEMM

C_ij += A_ik + B_kj

이 연산은 k dimension 을 따라 partial accumulation 이 가능하다.

즉

tile(A) x tile(B)

로 계산할 수 있다.

tile-compatible 구조는 다음 특징을 가진다.

shared memory reuse
register accumulation
reduced global traffic

대표 사례

Property

tile_compatible_compute

Legality

working-set-fit
dependency closure

Rewrite

naive loops
-> tiled schedule

Kernel mapping

shared-memory resident kernel

Memory Optimization Pattern Catalog (1)	2026.03.12
Streaming Weighted Reduction - FlashAttention generalization (1)	2026.03.12
Re-materializable Intermediate (0)	2026.03.12
Online Reducible Norm - Welford 기반 Streaming Statistics (0)	2026.03.12
Streaming Algorithms in Deep Learning Operators (0)	2026.03.11