Occupancy vs Peformance test ( Block Size Sweep )

block 크기만 바꿔서
- Achieved Occupancy
- DRAM Throughput / 실행 시간
위 세가지가 동시에 어떻게 움직이는지 보는 실험

Occupancy 가 높아도 선형으로 높은 성능을 보이는 것이 아님을 확인하는 테스트

동일한 커널 구조로 Block Size 만 다르게 지정 ( 64, 256, 1024 )

template<int BLOCK_SIZE>
__global__
void occupancy_kernel(float* __restrict__ out,
                      const float* __restrict__ in,
                      int n)
{
    int tid    = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    float acc = 0.f;

    for (int i = tid; i < n; i += stride) {
        float v = in[i];

        #pragma unroll
        for (int k = 0; k < ITERS; ++k) {
            acc = acc * 1.0000001f + v;   // FMA 의존성
        }
    }

    if (tid < n)
        out[tid] = acc;
}

template<int BLOCK_SIZE>
float run_case(const char* label, float* d_out, const float* d_in, int n)
{
    int grid = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
    if (grid > 80) grid = 80;

    printf("[%s] BLOCK_SIZE = %d, grid = %d\n", label, BLOCK_SIZE, grid);

    // ... cudaEvent 로 시간 측정, BW 계산 ...
}

== Occupancy vs Performance Test ==

[LOW  occupancy-ish] BLOCK_SIZE = 64, grid = 80
  Time   = 896.889 ms
  BW     = 0.15 GB/s

[MID  occupancy-ish] BLOCK_SIZE = 256, grid = 80
  Time   = 317.417 ms
  BW     = 0.42 GB/s

[HIGH occupancy-ish] BLOCK_SIZE = 1024, grid = 80
  Time   = 325.685 ms
  BW     = 0.41 GB/s

Occupancy 를 어느정도 올리면 효과가 크지만, 평탄해지는 것을 확인

Occupancy vs Performance — 정리 표

BS	Occupancy	Warp/SM	DRAM Throughput	Time	BW
64	5.79%	2.78	28.11%	896.9	0.15	warp 너무 적어서 latency hiding 실패
256	23.14%	11.11	81.93%	317.4	0.42	메모리 대역폭 거의 활용, sweet spot
1024	65.27%	31.33	89.73%	325.7	0.41	occupancy ↑↑ 하지만 성능은 거의 동일

'GPU-KERNEL' 카테고리의 다른 글

최종 도달 목표 (0)	2025.11.30
Warp Stall Reason Breakdown (0)	2025.11.29
Register Pressure test (0)	2025.11.29
L1 / L2 Cache and Access Locality test (0)	2025.11.29
Shared Memory Bank Conflict test (0)	2025.11.29

뜻 지, 가르칠 훈

Occupancy vs Peformance test ( Block Size Sweep )

동일한 커널 구조로 Block Size 만 다르게 지정 ( 64, 256, 1024 )

Occupancy vs Performance — 정리 표

'GPU-KERNEL' 카테고리의 다른 글

티스토리툴바

Occupancy vs Peformance test ( Block Size Sweep )

동일한 커널 구조로 Block Size 만 다르게 지정 ( 64, 256, 1024 )

Occupancy vs Performance — 정리 표

'GPU-KERNEL' 카테고리의 다른 글

'GPU-KERNEL' Related Articles

티스토리툴바