Register Pressure test

thread 당 사용 가능한 개수는 한정, 레지스터 사용량이 많은 커널은 occupancy 감소, spill 을 유발

두 가지 커널을 통해 비교

low_reg_kernel

적은 레지스터 사용
가벼운 연산, memory-bound

high_reg_kernel

레지스터 사용 증가
많은 accumulator + unrolled loop
compute-bound

Nsight Compute 결과 요약

1) low_rege_kernel - memory_bound 특성

Registers Per Thread	16
Achieved Occupancy	23.50%
Active Warps per SM	11.28 / 48
Waves per SM	0.23

레지스터가 여유롭고, occupancy 는 워크로드 크기 때문에 낮음

DRAM Throughput	45.92%
SM Compute Throughput	16.92%
L1/TEX Hit Rate	0%
L2 Hit Rate	48.33%

명확한 memory-bound kernel

Warp Stall 분석

warp Cycles Per Issued Instruction = 13.79 cycles
주요 stall

각 warp가 평균 9.2 cycles 동안
L1TEX scoreboard dependency 대기
→ 전체 stall의 약 66.4%

메모리 대기가 실행 병목

Local Memory Spill

Local Memory Spiling Requests = 0, 레지스터 부족으로 spill 이 발생하는 상황은 아님

2) high_reg_kernel - compute_bound 특성

Launch & Occupancy

Registers Per Thread	18
Achieved Occupancy	22.00%
Active Warps per SM	10.56 / 48

low_reg 대비 소폭 감소, occupancy 가 크게 깎인 상황은 아님

Memory / Compute Throughput

DRAM Throughput	3.34%
SM Compute Throughput	65.36%
L1/TEX Hit Rate	0%
L2 Hit Rate	51.35%

compute 파이프라인이 매우 높은 비중으로 사용됨

Warp IPC / Stall

Executed IPC Active = 3.75 inst/cycle
Warp Cycles Per Issued Instruction = 2.82 cycles

warp 가 거의 compute 로 빡빡하게 채워져 있음

Local Memory Spill

Local Memory Spiling Requests = 0

아직 spill 이 발생하는 reg-pressure 구간에 도달하지 않음

Memroy-bound <-> Compute-bound 전환!!

거의 같은 launch 구조로 두 커널을 비교했는데도

레지스터 live-range 증가 + 연산량 증가는 커널의 성격을 완전히 바꿔버린다.

레지스터의 증간느 더 많은 연산을 레지스터에 유지 가능하지만, 일정 이상 증가하면 occupancy 감소, spiling 발생으로 local/global memory 접근 증가로 성능 급락

실제 GEMM / Tensor Core 커널 최적화에서는

레지스터 수 / Shared Memory tile 크기 / block size 가 모두 얽혀 있고, 이 balance 가 정확히 맞아야 한다.

'GPU-KERNEL' 카테고리의 다른 글

Warp Stall Reason Breakdown (0)	2025.11.29
Occupancy vs Peformance test ( Block Size Sweep ) (0)	2025.11.29
L1 / L2 Cache and Access Locality test (0)	2025.11.29
Shared Memory Bank Conflict test (0)	2025.11.29
Global Memory Coalescing ( 연속 접근 vs Stride 접근) test (0)	2025.11.29

뜻 지, 가르칠 훈

Register Pressure test

low_reg_kernel

high_reg_kernel

Nsight Compute 결과 요약

1) low_rege_kernel - memory_bound 특성

Warp Stall 분석

Local Memory Spill

2) high_reg_kernel - compute_bound 특성

Launch & Occupancy

Memory / Compute Throughput

Warp IPC / Stall

Local Memory Spill

Memroy-bound <-> Compute-bound 전환!!

'GPU-KERNEL' 카테고리의 다른 글

티스토리툴바

Register Pressure test

low_reg_kernel

high_reg_kernel

Nsight Compute 결과 요약

1) low_rege_kernel - memory_bound 특성

Warp Stall 분석

Local Memory Spill

2) high_reg_kernel - compute_bound 특성

Launch & Occupancy

Memory / Compute Throughput

Warp IPC / Stall

Local Memory Spill

Memroy-bound <-> Compute-bound 전환!!

'GPU-KERNEL' 카테고리의 다른 글

'GPU-KERNEL' Related Articles

티스토리툴바