wmma_fragment_layout

1. accumulator fragment 레이아웃 커널

__global__ void wmma_acc_frag_layout_kernel(float* out) {
    int lane = threadIdx.x % 32;   // 0 ~ 31

__global__ : 알지? GPU 실행 커널 함수
float* out : 16 * 16 타일 하나를 저장할 글로벌 메모리 버퍼
lane : 현재 스레드의 warp 내 번호, blockDim.x = 32 로 실행하므로, threadIdx.x 가 곧 lane id

#if __CUDA_ARCH__ >= 700
    wmma::fragment<
        wmma::accumulator,
        M, N, K,
        float
    > c_frag;

아키텍쳐에서만 WMMA 코드 활성화
wmma::fragment<wmma::accumulator, 16, 16, 16, float>
- warp 전체가 공유하는 16 16 float accumulator 타일을 나타내는추상 타입
- 내부적으로는 warp 32 lanes 의 레지스터에 분산 저장

wmma:fill_fragment(c_frag, -1.0f);

fragment 전체 요소를 초기화

    #pragma unroll
    for (int i = 0; i < c_frag.num_elements; ++i) {
        c_frag.x[i] = static_cast<float>(lane);
    }

fragment 내부는 x[] 배열로 접근 가능
num_elements 는 각 lane 이 들고 있는 요소 개수
이 루프는
- 현재 lane 이 자기 레지스터에 들고 있는 모든 요소에 lane id 를 써 넣는 것
이후 stroe_matric_sync 할 때, 이 값이 실ㅈ 메모리 상의 (row, col) 위치에 깔림 - lane mapping 확인용

    // row-major 로 16x16 타일 저장
    wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
#else
    (void)out;
#endif
}

store_matrix_sync(out, c_frag, N, mem_row_major)
- warp 전체가 가진 fragment 내용을 전역 메모리 16, 16 표로 덤프
- out : 시작 주소
- N : leading dimension
- mem_row_major : row-major 로 저장
이렇게 저장된 값을 host 에서 출력하면, (i, j) 위치를 담당하는 lane id 를 볼 수 있음

2. matrix_a fragment 레이아웃 커널

__global__ void wmma_a_frag_layout_kernel(float* out, const __half* B_identity) {
    int lane = threadIdx.x % 32;   // 0 ~ 31

out : 최종적으로 lane map 을 저장할 16, 16 float 타일
B_identity : 16, 16 half 행렬, identity(i) 가 들어 있는 버퍼
lane 은 위와 동일

#if __CUDA_ARCH__ >= 700
    wmma::fragment<
        wmma::matrix_a,
        M, N, K,
        __half,
        wmma::row_major
    > a_frag;

A operand fragment
- 역할 : GEMM 에서 왼쪽 행렬 A 의 16, 16 조각
- 타입 : half
- layout : row-major

    wmma::fragment<
        wmma::matrix_b,
        M, N, K,
        __half,
        wmma::row_major
    > b_frag;

B operand fragment
- 역할 : GEMM 에서 오른쪽 행렬, B 의 16, 16 조각
- row-major 설정
- B_frag 는 identity matirx 조각을 담는 용도

    wmma::fragment<
        wmma::accumulator,
        M, N, K,
        float
    > c_frag;

결과 accumulator fragment

    // A fragment: 각 lane 이 자기 fragment 요소를 lane id 로 채움
    #pragma unroll
    for (int i = 0; i < a_frag.num_elements; ++i) {
        a_frag.x[i] = __float2half(static_cast<float>(lane));
    }

핵심
- A fragment 내부의 모든 요소
- A 의 각 요소는 그걸 담당하는 lane 번호가 값으로 들어 있음

    // B fragment: 글로벌 메모리의 identity matrix 로딩
    wmma::load_matrix_sync(b_frag, B_identity, N);

B_identity 를 B fragment 에 로드
즉, B = I

    // C = 0 초기화 후 C = A * B
    wmma::fill_fragment(c_frag, 0.0f);
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

C fragment 를 0으로 초기화
mma_sync(c_frag, a_frag, b_frag, c_frag)
즉, 연산 결과에서는 A 의 값이 그대로 복사된 상태

    // C를 row-major 로 저장
    wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
#else
    (void)out;
    (void)B_identity;
#endif
}

accumulator fragment 는 store_matrix_sync 지원
A 에 lane id 를 넣고, C = A I 로 흘려서 out 에 찍힌 값 = A fragment 의 레이아웃 = matrix_a row_major 레이아웃 + lane mapping

결과적으로 [matrix_a fragment (row_major)] 출력이 accumulator 와 완전히 동일하게 나옴

3. matrix_b fragment 레아이웃 커널

__global__ void wmma_b_frag_layout_kernel(float* out, const __half* A_identity) {
    int lane = threadIdx.x % 32;   // 0 ~ 31

out 최종 lane map
A_identity : 16, 16

#if __CUDA_ARCH__ >= 700
    wmma::fragment<
        wmma::matrix_a,
        M, N, K,
        __half,
        wmma::row_major
    > a_frag;

    wmma::fragment<
        wmma::matrix_b,
        M, N, K,
        __half,
        wmma::row_major
    > b_frag;

    wmma::fragment<
        wmma::accumulator,
        M, N, K,
        float
    > c_frag;

A fragment 에 identity, B fragment 에 lane id 를채워서 C = I x B -> B 의 레이아웃을 C 로 복사하는 구조

    // A fragment: identity matrix 로딩
    wmma::load_matrix_sync(a_frag, A_identity, N);

A_identity (16, 16) 를 A fragment 에 로드
A = I

    // B fragment: lane id 로 채움
    #pragma unroll
    for (int i = 0; i < b_frag.num_elements; ++i) {
        b_frag.x[i] = __float2half(static_cast<float>(lane));
    }

B fragment 내부의 모든 요소에 lane id 를 half 로 기록

lane 개념 정리, lane = warp 안에서의 스레드 번호

lane = warp 내부에서 0~31 까지 번호가 붙은 스레드 ( threadIdx 와 다르게 warp 단위로 자른 번호 )

CUDA 의 근본..!

한 번에 32개 스레드를 묶어서
완전히 동시에
같은 명령어를 실행시키는 구조 (SIMT)

lane 개념이 필요한 이유...!

warp 내부의 스레드들은 lockstep(완전 동기)로 동작하므로, warp 내부에서 데이터교환이 가능한기 때문

lane 관점으로 보면 연속적인 타일이 절대 아님

lane 0 이 맡은 셀 8개는 타일 전체에 흩어져 있음
2x2 블록 x 상/하 반복 형태, 연속 영역이 아님
타일 전체의 서로 다른 8칸씩 채임지낟.

WMMA fragment = 연속한 sub-tile 1 개를 lane 이 갖는 구조가 아님, 타일 전체에 흩어진 여러 셀을 맡는다.

lane 0 → 타일 전체에 흩어진 8셀  
lane 1 → 타일 전체에 흩어진 8셀  
lane 2 → 타일 전체에 흩어진 8셀  
...
lane31 → 타일 전체에 흩어진 8셀

왜 이런 형태인지

하드웨어 입장에서의 목적

warp 전체가 동시에 같은 형태의 FMA 를 때릴 수 있도록, operand A/B/C 의 레지스터 배치를 균등하게 분리해야 한다.

lane 이 연속해서 각 부분을 들고 있으면

warp 레벨의 병렬성
warp 레벨 load / store alignment
warp-level MMA pipeline 스케줄링

이 깨져버림

실제 WMMA fragment 는

연속한 메모리를 그대로 들고 있지 않고
하드웨어가 요구하는 방식으로 조각내어
lane 별 레지스터 그룹에 스프레딩 swizzling 해놓는다.

'GPU-KERNEL' 카테고리의 다른 글

fragment = Tensor Core 가 처리할 타일 조각을 warp 가 레지스터에 담아놓은 형태 (0)	2025.11.27
스칼라 FMA, WMMA 방식의 GEMM 연산 분해 방식의 차이 이해 (0)	2025.11.27
Tensorcore_wmma_vs_fp32 test code (0)	2025.11.27
Tensor Cores & MMA Pipeline (0)	2025.11.27
ncu 리포트 (.ncu-rep) 생성 & 활용 매뉴얼 (0)	2025.11.25

뜻 지, 가르칠 훈

wmma_fragment_layout_test

1. accumulator fragment 레이아웃 커널

2. matrix_a fragment 레이아웃 커널

3. matrix_b fragment 레아이웃 커널

lane 개념 정리, lane = warp 안에서의 스레드 번호

lane 개념이 필요한 이유...!

lane 관점으로 보면 연속적인 타일이 절대 아님

왜 이런 형태인지

'GPU-KERNEL' 카테고리의 다른 글

티스토리툴바

wmma_fragment_layout_test

1. accumulator fragment 레이아웃 커널

2. matrix_a fragment 레이아웃 커널

3. matrix_b fragment 레아이웃 커널

lane 개념 정리, lane = warp 안에서의 스레드 번호

lane 개념이 필요한 이유...!

lane 관점으로 보면 연속적인 타일이 절대 아님

왜 이런 형태인지

'GPU-KERNEL' 카테고리의 다른 글

'GPU-KERNEL' Related Articles

티스토리툴바