occupancy 가 아닌, 더 정확한 GPU 성능 측정 도구, 단위가 필요

낮은 occupancy 가 최적의 성능을 보이는 이유는 occupancy 가 메모리에 대한 정보는 담고 있지 않아서,

단일 SM 에서의 사용 warp 수 / 사용 가능한 warp 의 단위,

특정 작업에서 SM 내의 모든 warp 를 사용 가능하게 구성한다면, 각 warp 에 위치하는 메모리, 데이터의 수가 적은 문제, data resuse 가 떨어짐,

register / shared mem 을 warp 에게 리소스 분할
register per thread 감소 - local memory spill 발생
shared memory residency 감소 : tile 들이 금방 eviction
reuse distance 증가 : 데이터가 재사용되기 전에 캐시에서 밀려남

Optimal occupancy is the point where arithmetic intensity is maximized - occupancy 자체가 아니라 reuse 되는 데이터의 lifespan 이 더 중요

더결정적이고 일반적인 성능 측정 척도들을 알아보자

1. Arithmetic Instensity (연산 밀도, Ops/Byte)

GPU 성능을 가장 정확하게 예측하는 척도

2. Roofine Model

GPU 성능평가에서 가장 신뢰되는 도구

축:

x-axis : arithmetic intensity
y-axis : attainalbe GFLOPS
두 개의 경게가 존재
- memory bandwidth roof
- compute peak roof

커널이 graph 에서 어디에 위치하는지가 성능을 결정

3. Warp Stall Breakdown

왜 느린지에 대해 말재후는 지표

Memory Dependency Stall
Long Scoreboard Stall
Barrier Stall
Dispatch Stall
Math Pipe Busy

원인을 알려주는 지표

4. SM Utilization (실제연산기 가동률)

sm__throughput.acg.pct_of_peak_sustained_active
sm__inst_executed_pipe_fma

이 값이 높은지 확인

즉, ALU 가 얼마나 바쁘게 일하는지,

높으면 compute_bound, 낮으면 memeory stall

5. Memory BW Utilization

dram__throughput
l1tex__throughput
l2__throughput

높으면 memory-bound, 낮으면 compute-bound

6. Instructions-per-cycle (IPC) & Issue Rate

sm__ipc

inst_issued

1에 가까우면 좋음 낮으면 stall 존재,

'GPU-KERNEL' 카테고리의 다른 글

shared memory 개념 재적립, alias 개념이 아닌, global memory 부터의 call? - coalescing mapping 중요 (shared memory 와 register 는 별도의 저장 공간, 해당 영역이 증가해도 직접적으로 register 가 증가하진 않음) (0)	2025.11.23
compute-bound, memory-bound, memory-stall - 문제가 아닌 커널의 본질적 성질 (0)	2025.11.23
스레드 - 워프 - 블록 - 그리드 -SM - 스케줄링 (0)	2025.11.23
GEMM kernel 개선하기 (1) (0)	2025.11.22
Register Pressure vs Occupancy Trade-off (0)	2025.11.22

뜻 지, 가르칠 훈

occupancy 가 아닌, 더 정확한 GPU 성능 측정 도구, 단위가 필요

1. Arithmetic Instensity (연산 밀도, Ops/Byte)

2. Roofine Model

3. Warp Stall Breakdown

4. SM Utilization (실제연산기 가동률)

5. Memory BW Utilization

6. Instructions-per-cycle (IPC) & Issue Rate

'GPU-KERNEL' 카테고리의 다른 글

티스토리툴바

occupancy 가 아닌, 더 정확한 GPU 성능 측정 도구, 단위가 필요

1. Arithmetic Instensity (연산 밀도, Ops/Byte)

2. Roofine Model

3. Warp Stall Breakdown

4. SM Utilization (실제연산기 가동률)

5. Memory BW Utilization

6. Instructions-per-cycle (IPC) & Issue Rate

'GPU-KERNEL' 카테고리의 다른 글

'GPU-KERNEL' Related Articles

티스토리툴바