TensorCore K 방향 타일 루프 + cp.async 2 stage pipeline

결과 해석

Device 0: NVIDIA GeForce RTX 3060 (SM 8.6)

[naive_ldg_kernel]
  Time   = 0.909 ms
  BW     = 92.25 GB/s
  Sample out[0] = 2048.062500

[cp_async_kernel]
  Time   = 0.507 ms
  BW     = 165.49 GB/s
  Sample out[0] = 2048.062500

1. 공통 인덱싱 / 워크로드 구조

int tid_in_block = threadIdx.x;
int global_tid   = blockIdx.x * blockDim.x + tid_in_block;
int stride       = blockDim.x * gridDim.x;

block 당 128 thread, grid 80 block, 전체 thread 10,240
global_tid = 0...10239
stride = 10240

각 thread 는

for (int tile_idx = 0; tile_idx < NUM_TILES; ++tile_idx) {
    int idx = global_tid + tile_idx * stride;
    // in[idx] 사용
}

thread 당 2048 개 원소
index 들이 NUM_ELEMS 를 한 번씩 컵거

즉, K 방향에서 thread 하나가 자기 K 방향 스트리드 구간을 따라 데이터를 모으는구조

2. naive_idg_kernel : 순차적 load + compute

for (int tile_idx = 0; tile_idx < NUM_TILES; ++tile_idx) {
    int idx = global_tid + tile_idx * stride;

    // 1) global → shared
    tile[tid_in_block] = in[idx];
    __syncthreads();

    // 2) shared → register 후 더미 compute
    float v = tile[tid_in_block];

    #pragma unroll 8
    for (int k = 0; k < 32; ++k) {
        v = v * 1.000001f + 0.000001f;
    }

    acc += v;
    __syncthreads();
}

한 타일에서 일어나는 일

global -> shared
- block 의 128 thread 가
- in[idx .. idx + 127] 처럼 연속 구간을 shared 에 채움 coalesced
__syncthreads()
- 공유 메모리에 모두 쓰기를 끝났다는 보장
shared -> register + compute
- 각 thread 는 자기 tile 을 읽어와서 v 에 넣고,
- 32 회 FMA 비슷한 연산 수행 - 약간의 compute 섞음
acc += v
마지막 __syncthreads()

for each tile:
  ld.global (global → L2/L1 → SM)  ← latency 그대로 기다림
  ↓
  __syncthreads()
  ↓
  compute 32xFMA
  ↓
  __syncthreads()

3. cp_asyc_kernel: load/compute 겹치지

3.1 shared double-buffer 구성

extern __shared__ float smem[];
float* buf0 = smem;
float* buf1 = smem + blockDim.x; // double-buffer

block 당 shared floats : BLOCK_SIZE * 2 -> buf0, buf1 두 개
tile 0 / tile 1 / tile 2 / ...를 buf0, buf1 에 번갈아 넣으면서 파이프라인

3.2 cp.async wrapper

static __device__ __forceinline__
void cp_async_ca_shared_global(void* smem_ptr, const void* global_ptr, int bytes)
{
    unsigned smem_addr = (unsigned)__cvta_generic_to_shared(smem_ptr);
    asm volatile(
        "cp.async.ca.shared.global [%0], [%1], %2;\n" :: 
        "r"(smem_addr), "l"(global_ptr), "n"(sizeof(float))
    );
}

__cvta_generic_to_shard -> generic 포인터 -> shared 주소
PTX
- cp.async.ca.shared.global [smem], [gmem], N;
- asynchronous copy : 스레드가 바로 stall 안 하고 넘어갈 수 있음
- .ca -> L2/L1 에 캐시 허용 모드

3.3 warm-up 단계

{
    int idx0 = global_tid + 0 * stride;
    cp_async_ca_shared_global(&buf0[tid_in_block], in + idx0, sizeof(float));
    asm volatile("cp.async.commit_group;\n" ::);

    if (NUM_TILES > 1) {
        int idx1 = global_tid + 1 * stride;
        cp_async_ca_shared_global(&buf1[tid_in_block], in + idx1, sizeof(float));
        asm volatile("cp.async.commit_group;\n" ::);
    }
}

tile 0 = buf0
tile 1 = buf1
둘 다 cp.async 로 걸어두고 comit_group 으로 파이프라인에 올림
두 타일은 로드 중 in-flight 상태, 아직 compute 전

3.4 메인 루프 2-stage pipeline

for (int tile = 0; tile < NUM_TILES; ++tile) {
    asm volatile("cp.async.wait_group 1;\n" ::);
    __syncthreads();

cp.async.wait_group 1
- 진행 중인 cp.assync 그룹의 수가 1 이하가 될 때까지 대기
- warm-up 에서 그룹 두 개를 날림
  - tile = 0 시점에는 최소한 tile 0 데이터는 shared 에 도착
- 이후 iteration 에선
  - 지금 쓸 타일 current tile 데이터는 준비 완료 보자
__syncthreads()
- block 전체가 shared 에 온 데이터를 볼 수 있도록 동기화

이렇게 해서

global load ( cp.async) <-> compute 가 겹쳐지는 2-stage 파이프라인 완성

warm-up: load tile0 → commit, load tile1 → commit

for tile = 0:
    wait_group 1    → tile0 ready
    compute tile0   (동시에 tile1,2 load in-flight)
    preload tile2

for tile = 1:
    wait_group 1    → tile1 ready
    compute tile1   (동시에 tile2,3 load in-flight)
    preload tile3
...

4. BW 가 다른 이유

두 커널 모두

같은 양의 데이터를 읽고
같은 양의 compute

naivie

각 tile 에서 global load latency 를 기다림
DRAM burst 는 빠른데, 기다리느라 SM 이 쉬는 구간이많음

cp.async

global load 를 compute 와겹쳐서 DRAM 이 데이터를 끌어오는 동안 SM 이 쉬지 않고 연산
같은 데이터량을 더 빨리 처리, BW 가 높게 측정 - 유효 BW 가 증가

'GPU-KERNEL' 카테고리의 다른 글

GPU Architecture & Execution Model (0)	2025.11.29
Optimization Principles 와 Tensor Core Optimization 의 분리 (0)	2025.11.28
Shared memory, fragment 의 차이 이해 (0)	2025.11.27
fragment = Tensor Core 가 처리할 타일 조각을 warp 가 레지스터에 담아놓은 형태 (0)	2025.11.27
스칼라 FMA, WMMA 방식의 GEMM 연산 분해 방식의 차이 이해 (0)	2025.11.27

뜻 지, 가르칠 훈

TensorCore K 방향 타일 루프 + cp.async 2 stage pipeline

결과 해석

1. 공통 인덱싱 / 워크로드 구조

2. naive_idg_kernel : 순차적 load + compute

3. cp_asyc_kernel: load/compute 겹치지

3.1 shared double-buffer 구성

3.2 cp.async wrapper

3.3 warm-up 단계

3.4 메인 루프 2-stage pipeline

4. BW 가 다른 이유

'GPU-KERNEL' 카테고리의 다른 글

티스토리툴바

TensorCore K 방향 타일 루프 + cp.async 2 stage pipeline

결과 해석

1. 공통 인덱싱 / 워크로드 구조

2. naive_idg_kernel : 순차적 load + compute

3. cp_asyc_kernel: load/compute 겹치지

3.1 shared double-buffer 구성

3.2 cp.async wrapper

3.3 warm-up 단계

3.4 메인 루프 2-stage pipeline

4. BW 가 다른 이유

'GPU-KERNEL' 카테고리의 다른 글

'GPU-KERNEL' Related Articles

티스토리툴바