AICF Execution Architecture Overview - Python-driven Graph & Scheduling + CUDA Primitive Execution

1. Design Philosophy

이 프레임워크는 다음 원칙을 중심으로 설계되었다.

모델 의미, 실행 스케줄, 메타 제어는 Python에서
CUDA는 단순하고 안정적인 primitive 연산 실행만 담당
커널 구현은 backend에 있지만, “무엇을 실행할지”는 Python이 결정
CUDA Graph를 활용해 실행 시퀀스를 캡처/재생 가능

2. High-Level Layering

[ Python ]
  ├─ Model / Train Step Definition
  ├─ IRGraph (semantic graph)
  ├─ Lowering (IR → primitive ops)
  ├─ BindingPlan (tensor ↔ vid mapping)
  ├─ Executor (schedule + op_call)
  └─ CUDA Graph capture / replay control

[ CUDA Backend ]
  ├─ op_call ABI (OpKind + TensorDesc + AttrBlob + stream)
  ├─ KernelRegistry / dispatch
  ├─ KernelVariant selection
  └─ CUDA kernel launch

3. Python Side Responsibilities

3.1 Model & Training Step Definition

Python에서 forward / backward / optimizer step을 명시적으로 구성한다.

linear → relu → mse_grad → linear_bwd → relu_bwd → adam_step

이 단계에서:

모델 구조
학습 단계
옵티마이저 로직

모두 명시적인 연산 그래프로 표현된다.

3.2 IRGraph (Semantic Graph)

각 연산은 노드
각 텐서는 SSA-style value (vid) 로 관리
shape / dtype / device는 IR에 고정

IR은 **“무엇을 계산하는가”**에만 집중하고,
어떻게 실행할지는 아직 결정하지 않는다.

3.3 Lowering: IR → Backend Ops

IR 노드를 backend가 이해할 수 있는 primitive op 시퀀스로 변환한다.

예:

Linear → gemm + bias_add
AdamStep → adam_step
Save → copy_saved

결과물:

lowered = [
  {op: "gemm", inputs: [...], outputs: [...], attrs: {...}},
  {op: "bias_add", ...},
  ...
]

이 단계에서 연산 순서가 완전히 고정된다.

3.4 BindingPlan (Tensor ↔ vid Mapping)

BindingPlan은 실행 시 다음을 보장한다.

어떤 vid가
- input인지
- parameter인지
- static workspace인지
env[vid]가 반드시 host가 제공한 tensor의 storage를 가리키도록 보장

이를 통해:

optimizer state
meta tensor (예: bc1_inv, bc2_inv)
파라미터

를 Python이 직접 관리할 수 있다.

3.5 Executor: Schedule + ABI Invocation

PlannedExecutor의 역할은 단순하다.

lowered ops를 순서대로 순회
vid → torch.Tensor lookup
attrs(dict)를 schema_id + bytes로 패킹
_C.op_call(...) 호출

_C.op_call(
    kind,
    inputs, outputs,
    schema_id,
    attrs_bytes,
    stream
)

Executor는:

연산을 구현하지 않는다
수학을 계산하지 않는다

오직 실행 순서와 ABI 호출만 담당한다.

4. op_call ABI (핵심 인터페이스)

4.1 ABI 형태

op_call(
  OpKind kind,
  TensorDesc[] inputs,
  TensorDesc[] outputs,
  uint32 schema_id,
  bytes attrs_bytes,
  cudaStream_t stream
)

TensorDesc: 포인터 + shape + stride + dtype
AttrBlob: {schema_id, bytes, data}
stream:
- 0 → current CUDA stream
- non-zero → explicit stream (CUDA Graph용)

❗ Python dict attrs는 더 이상 사용하지 않는다

4.2 AttrBlob 설계 의도

ABI 안정성
Python/CUDA 경계 단순화
CUDA Graph 캡처 안정성

예:

Gemm: attrs_bytes = pack(transA, transB)
AdamStep: schema='ADAM' + pack(lr, beta1, beta2, eps)

5. CUDA Backend Responsibilities

5.1 KernelRegistry & Dispatch

CUDA backend는 다음만 수행한다.

OpKind에 등록된 KernelVariant 목록 조회
supported 조건 검사
priority 기반 variant 선택
커널 launch

CUDA는 모델/학습 개념을 전혀 모른다.

5.2 KernelVariant

각 variant는 다음을 가진다.

supported(inputs, outputs, attr)
launch(...)
expected_attr_schema_id
priority

즉:

Python이 무슨 op를 실행할지 정하면
CUDA는 그 op를 어떻게 가장 잘 실행할지 고른다

6. CUDA Graph Integration

6.1 Capture

Python에서 _C.graph_begin()
dedicated stream에서 op_call 실행
_C.graph_end() → CUDA Graph instantiate

6.2 Replay

_C.graph_launch() 호출만 반복
포인터 주소는 고정
포인터가 가리키는 메모리 내용은 replay 시점에 읽힘

→ meta tensor 값을 Python에서 바꾸면 replay에 즉시 반영됨.

7. 핵심 특성 요약

Python이 프로그램을 만든다
CUDA는 프로그램을 실행한다
커널은 backend에 있지만
- 실행 순서
- 데이터 흐름
- 하이퍼파라미터
- 메타 제어
  는 전부 Python이 담당
CUDA Graph replay에서도 host-managed meta tensor가 동작

8. 한 문장 요약

“Python이 그래프·스케줄·ABI를 생성하고, CUDA는 선택된 primitive 커널을 실행만 하는 구조”

'AI Compiler framework' 카테고리의 다른 글

기존의 구조가 컴파일러로서 부족한 이유, 새롭게 제안한 구조 - IR 표현 재구성 단계 추가 (1)	2026.01.20
ISA vs Compiler (0)	2026.01.20
CUDA Backend v2 (core-free) Ops 마이그레이션 규칙 (0)	2026.01.18
Core v2 실행/캡처 문서 (Direct aicf_cuda._C, op_call only) (0)	2026.01.16
core_v2 Stage 2 : Lowering 생성 + Dump 검증 문서 (1)	2026.01.14

뜻 지, 가르칠 훈

AICF Execution Architecture Overview - Python-driven Graph & Scheduling + CUDA Primitive Execution

1. Design Philosophy

2. High-Level Layering

3. Python Side Responsibilities

3.1 Model & Training Step Definition

3.2 IRGraph (Semantic Graph)

3.3 Lowering: IR → Backend Ops

3.4 BindingPlan (Tensor ↔ vid Mapping)

3.5 Executor: Schedule + ABI Invocation

4. op_call ABI (핵심 인터페이스)

4.1 ABI 형태

4.2 AttrBlob 설계 의도

5. CUDA Backend Responsibilities

5.1 KernelRegistry & Dispatch

5.2 KernelVariant

6. CUDA Graph Integration

6.1 Capture

6.2 Replay

7. 핵심 특성 요약

8. 한 문장 요약

'AI Compiler framework' 카테고리의 다른 글

티스토리툴바

AICF Execution Architecture Overview - Python-driven Graph & Scheduling + CUDA Primitive Execution

1. Design Philosophy

2. High-Level Layering

3. Python Side Responsibilities

3.1 Model & Training Step Definition

3.2 IRGraph (Semantic Graph)

3.3 Lowering: IR → Backend Ops

3.4 BindingPlan (Tensor ↔ vid Mapping)

3.5 Executor: Schedule + ABI Invocation

4. op_call ABI (핵심 인터페이스)

4.1 ABI 형태

4.2 AttrBlob 설계 의도

5. CUDA Backend Responsibilities

5.1 KernelRegistry & Dispatch

5.2 KernelVariant

6. CUDA Graph Integration

6.1 Capture

6.2 Replay

7. 핵심 특성 요약

8. 한 문장 요약

'AI Compiler framework' 카테고리의 다른 글

'AI Compiler framework' Related Articles

티스토리툴바