필수 (강력 추천, 당장 익히면 커널 품질이 달라짐)
- ILP (Instruction-Level Parallelism)
- Shared Memory Tiling / Double Buffering
- cp.async (async global → shared copy)
- Tensor Core MMA (warp-level matrix multiply)
- Warp-level primitives (shfl_sync, ballot_sync)
- Latency vs Throughput 모델링
- Register Spill (local memory)
- Launch Bounds / Occupancy Tuning
- Cache Hinting (.cg, .ca, .cs)
- Memory Access Locality (sliding window, reuse)
중급 + 알아두면 커널 품질 올라가는 개념
- SM Scheduling / persistent threads
- Data Prefetch / Pipeline staging
- Block shape tuning
- Pipeline hazards / dependency chain
- Branch predication
'GPU-KERNEL' 카테고리의 다른 글
| GEMM 커널 테스트 코드 작성 및 최적화 (1024^3 크기 실험) (0) | 2025.11.18 |
|---|---|
| ILP ( Instruction - Level Parallelism ) (0) | 2025.11.18 |
| l1_l2_temporal_locality_test (0) | 2025.11.18 |
| 퀴즈 — Register Pressure & Occupancy & Compiler Behavior (0) | 2025.11.18 |
| occupancy_reg_pressure_test_v3 - 실제 register pressure 에 따른 occupancy 감소 확인, (0) | 2025.11.18 |