End-to-End Train Step Verification - Custom CUDA Ops 기반 1-step Training Pipeline

직접 구현한 CUDA 커널만으로 학습 1 step 이 정상적으로 수행됨을 검증한 과정 정리

Python - C++ - CUDA 로 이어지는 op_call 기반 실행 경로 검증
Registry / Variant / Dispatch 구조가 실제 학습 시나리오에서 정상 작동하는지확인
Forward / Backward / Update 단계가 완전히 분리된 custom op 조합으로 구성 가능함을 증명
CUDA Graph capture-safe + Nsight Compute profiling 가능 상태 확인

전체 Train Step 구성

수식 기준 ( Linear + Bias + MSE + SGD )

Forward:
  Z = X · W
  Y = Z + b
  L = mean((Y - T)^2)

Backward:
  dY = d/dY MSE(Y, T)
  dZ = dY
  dW = Xᵀ · dZ
  db = reduce_sum(dZ, axis=0)

Update:
  W -= lr * dW
  b -= lr * db

사용된 Custom CUDA Operations

GEMM, BiasAdd, MseGrad, GEMM( TN / NT ), ReduceSum, SGDStep

Dispatch / Variant 구조 검증

OpKind 기반 Registry

enum class OpKind {
  EltwiseAdd,
  EltwiseRelu,
  Gemm,
  BiasAdd,
  ReduceSum,
  MseGrad,
  ReluBwd,
  SgdStep
};

각 OpKind 에 대해 여러 KernelVariant 를 등록하고, TensorDesc ( dtype / shape / contiguous / stride ) + attr 기반으로 자동 선택

SGDStep Variant 선택

f32 param / grad - sgd_step_f32_kernel

f16 + odd numel - sgd_step_f16_kernel

f16 + even numel + alignment - sgd_step_f16_half2_kernel

op_call 에서는 SgdStep 하나만 호출, dtype / 조건에 따라 최적 커널 자동 선택

Python - CUDA 실행 흐름

Python (op_call)
  ↓
TensorDesc 생성 (dtype / shape / stride)
  ↓
dispatch_v0(OpKind, inputs, outputs, attrs)
  ↓
Registry에서 supported() == true variant 선택
  ↓
launch() → CUDA kernel 실행

Python 에서는 torch.Tensor 만 전달
Autograd / Optimizer / Torch ops 사용 없음
CUDA Stream 은 PyTorch current stream 그대로 사용
CUDA Graph capture / replay 가능

End-to-End 실행 검증

테스트 시나리오

Random input X, target T
동일한 dW, db 로 여러 step 반복
Loss 감소 여부 확인
각 단계 컨러별 Nsight Compute 프로파일링

실행 결과 예시

[loss] before = 510.399719
[loss] after  = 505.640900
DONE train steps: 10
LOSS delta: -4.758819
ALL OK

Forward / Backward / Update 전체 파이프라인 정상 작동

CUDA Kernel Profiling 확인

Nsight Compute 로 각 커널 개별 프로파일링 성공
- gemm_f32_naive_kernel
- gemm_f16_tc_wmma_kernel
- reduce_sum_lastdim_f32_kernel
- sgd_step_f16_half2_kernel

variant 선택이 실제 실행 커널로 반영됨을 확인

핵심 설계 포인트 정리

Op 분리 원칙
- Forward / Backward / Update 을 오나전히 독립된 op 로 분리
- GEMM 과 BiasAdd, ReduceSum 은 서로상태 공유 없음
Variant 기반 확장성
- dtype / layout / alignment 조건만 추가하면
- 새 커널을 기존 OpKind 에 안전하게 확장 가능
CUDA Graph 친화 설계
- 동적할당 없음
- kernel launc 만 포함
- Capture / Replay 정상 동작
학습 가능성의 최소 증명
- 이 프레임 워크는
  - 단순 inference 엔진이 아니라
  - 훈련 그래프 실행 엔진으로 확장 가능함을 증명

현재 상태 요약

Custom CUDA op 만으로 1 train step 완주
dtype / variant 자동 선택 구조 검증
Nsight Compute profiling 성공
CUDA Graph capture-safe
실제 loss 감소 확인

다음 확장 방향

Optimizer 확장
Fused op
Workspace 기반 dtype transform
IR 레벨에서 backward graph 자동 생성

'AI Compiler framework' 카테고리의 다른 글

AI Framework 전체 실행 구조 문서 (8)	2025.12.26
GEMM 출력 f16 유지 + Elementwise Rank 의미 제거 ( Dispatch 계약 정리 ) (0)	2025.12.26
ge_v2 와는 다르게 forward / backward 를 같은 operation 내에 묶지 않는다? (0)	2025.12.22
naive GEMM 을 TC GEMM 으로 수정하자 ( NN, TN, NT 지원 - T 는 transpose 야 ) (0)	2025.12.22
Backward 에서의 GEMM 의 차이 - TensorCore 구현의 방식으로 수정해야 겠다. transpose 의 방법론에서의 큰 차이, 이 때 TensorCore 고정해두고 사용하는 것이 추후 확장에 유리할 듯 (0)	2025.12.22

뜻 지, 가르칠 훈

End-to-End Train Step Verification - Custom CUDA Ops 기반 1-step Training Pipeline

전체 Train Step 구성

사용된 Custom CUDA Operations

Dispatch / Variant 구조 검증

OpKind 기반 Registry

SGDStep Variant 선택

Python - CUDA 실행 흐름

End-to-End 실행 검증

테스트 시나리오

실행 결과 예시

CUDA Kernel Profiling 확인

핵심 설계 포인트 정리

현재 상태 요약

다음 확장 방향

'AI Compiler framework' 카테고리의 다른 글

티스토리툴바

End-to-End Train Step Verification - Custom CUDA Ops 기반 1-step Training Pipeline

전체 Train Step 구성

사용된 Custom CUDA Operations

Dispatch / Variant 구조 검증

OpKind 기반 Registry

SGDStep Variant 선택

Python - CUDA 실행 흐름

End-to-End 실행 검증

테스트 시나리오

실행 결과 예시

CUDA Kernel Profiling 확인

핵심 설계 포인트 정리

현재 상태 요약

다음 확장 방향

'AI Compiler framework' 카테고리의 다른 글

'AI Compiler framework' Related Articles

티스토리툴바