Every number on this page was measured on physical hardware in our lab. No simulations, no projections, no cherry-picked runs. Full methodology and hardware specs are disclosed below so you can reproduce these results.
We believe performance claims require context. Here's exactly how we measured.
Default compiler flags, default OS scheduler settings, default block sizes. No manual tuning. This represents what most users get out of the box.
Parameters selected by Mesh Optimizer's behavior atlas after profiling 83,000+ configurations. The optimizer discovers optimal block sizes, ILP chains, tile dimensions, and memory access patterns per kernel type.
Each benchmark runs 20+ iterations after warmup. We report the median result. GPU benchmarks use hardware event counters (rocprofv3 on AMD, nvprof on NVIDIA) rather than wall-clock timing where possible.
All probe kernels are open source. The atlas database files used for these measurements contain the raw data points. Results will vary by hardware, driver version, and system load.
Important: Percentage improvements are measured against unoptimized defaults on the same hardware. Larger improvements (e.g., atomics, WMMA) reflect cases where default parameters are far from optimal for the specific operation. Smaller improvements (e.g., overlap, reduction) reflect operations where defaults are already reasonable. Your results will depend on your workloads, hardware, and baseline configuration.
| Architecture | RDNA 3 (gfx1100) |
| Compute Units | 48 (96 shader arrays) |
| VRAM | 24 GB GDDR6X |
| Peak BW | 960 GB/s |
| Peak FP32 | 61 TFLOPS |
| Driver | ROCm 7.2.0 |
| Data Points | 81,305 |
| Architecture | Turing (sm_75) |
| CUDA Cores | 3,072 |
| VRAM | 16 GB GDDR6 |
| Peak BW | 448 GB/s |
| Peak FP32 | 11.2 TFLOPS |
| Driver | CUDA 13.0 |
| Data Points | 266 |
| Cores / Threads | 8C / 16T |
| Frequency | 1.8–3.0 GHz |
| RAM | 96 GB DDR4 |
| NUMA Nodes | 1 |
| Data Points | 410 |
| Cores / Threads | 16C / 32T (Zen 5) |
| RAM | 124 GB DDR5 |
| Accelerator | Xilinx FPGA |
| iGPU | RDNA 3.5 (2 GB) |
| Data Points | 285 |
| Cores / Threads | 16C / 32T (2-socket) |
| RAM | 220 GB DDR3 |
| Platform | Dell PowerEdge R620 |
| Role | FPGA synthesis host |
| Data Points | 283 |
| Architecture | Pascal (sm_61) |
| CUDA Cores | 640 |
| VRAM | 2 GB GDDR5 |
| Peak FP32 | 1.86 TFLOPS |
| Driver | CUDA 12.6 |
| Data Points | 451 |
Total atlas: 83,174 data points · 95,520 invariants · 29 kernel types · 8 hardware sources
Before/after comparison across 11 kernel types. "Before" uses default parameters (block_size=256, no ILP, stride=1). "After" uses atlas-discovered optimal parameters.
| Kernel Type | Before (Default) | After (Optimized) | Improvement | Key Optimization |
|---|---|---|---|---|
| Atomic Operations | 2.9 GOPS | 431 GOPS | +5,100% | Uncontended partitioning, optimal block size |
| WMMA (Tensor Ops) | ~1.8 TFLOPS | 88.8 TFLOPS | +4,776% | 6 WMMA chains, k_tiles=34 |
| Tiled GEMM | ~0.3 TFLOPS | ~9 TFLOPS | +2,831% | Tile size + shared memory padding |
| Wave Scheduling | ~0.5 TFLOPS | ~11 TFLOPS | +2,106% | Wavefront occupancy balancing |
| Strided Access | ~55 GB/s | ~884 GB/s | +1,506% | stride=1, coalesced access |
| Memory Bandwidth | ~148 GB/s | 867 GB/s | +484% | Vectorized loads, block=128 |
| LDS (Shared Memory) | ~237 GB/s | 878 GB/s | +271% | stride=22 (avoids bank conflicts) |
| FP32 Compute | ~16 TFLOPS | 50 TFLOPS | +208% | 7 FMA chains, 2.33x ILP speedup |
| FP16 Compute | ~28 TFLOPS | 57.9 TFLOPS | +102% | 11 half2 chains (less ILP-sensitive) |
| Reduction | ~450 GB/s | ~780 GB/s | +73% | Block=512, warp-level reduction |
| Overlap (Roofline) | ~595 GB/s | 781 GB/s | +31% | compute_intensity=2 saturation point |
Note on large improvements: Atomic (+5,100%) and WMMA (+4,776%) show the largest gains because their default configurations are particularly suboptimal. Atomics default to fully contended access patterns; WMMA defaults don't utilize instruction-level parallelism. These represent the upper bound of what optimization can achieve. Workloads that are already memory-bandwidth-bound (reduction, overlap) show more modest but still significant gains.
Oracle-verified measurements against vendor library implementations. These validate that our optimization framework produces correct results at near-peak performance.
| Matrix Size | GFLOPS | % of Peak |
|---|---|---|
| 128 × 128 | 152 | 1.4% |
| 1024 × 1024 | 7,836 | 70.0% |
| 4096 × 4096 | 10,821 | 96.6% |
| Batched (64×1024) | 9,652 | 86.2% |
| LLaMA FFN shape | 10,386 | 92.7% |
Peak: 11.2 TFLOPS FP32. 31/31 correctness tests pass (100%).
| Operation | GFLOPS / GB/s | % of Peak |
|---|---|---|
| Conv2d (ResNet res4a) | 14,391 GFLOPS | 128%* |
| Conv2d (ResNet res3a) | 12,899 GFLOPS | 115%* |
| Conv2d (ResNet conv1) | 4,934 GFLOPS | 44% |
| Linear (LLaMA up) | 9,826 GFLOPS | 87.7% |
| BatchNorm | 290 GB/s | 64.8% |
| Softmax | 354 GB/s | 79.0% |
*Exceeds theoretical FP32 peak via Tensor Core acceleration. 44/48 tests pass; 2 disabled (MIOpen JIT bug), 2 flaky timing.
Even a 7-year-old 2GB GPU benefits from atlas-driven optimization. Measured with CUDA 12.6 (sm_61).
| Metric | Measured | % of Peak | Key Finding |
|---|---|---|---|
| Memory Bandwidth | 89 GB/s | 119%* | Cache-line hits exceed DRAM peak; block=256 optimal |
| FP32 Compute | 581 GFLOPS | 31% | 15 ILP chains at block=32 |
| GEMM (cuBLAS) | 1,772 GFLOPS | 95% | 2048×2048 optimal for 2GB VRAM |
| Shared Memory | 311 GB/s | — | stride=15 best; 79% of configs hit bank conflict cliffs |
| Occupancy | 2,000 GFLOPS | — | block=1024 with 205 blocks; steep cliff at low block counts |
*Bandwidth exceeds DRAM peak (75 GB/s) because L2 cache serves repeated accesses. 451 data points, 209 invariants, 70 performance cliffs detected.
Mesh Optimizer doesn't just tune GPUs. System-level optimizations target scheduler, hugepages, NUMA, and memory hierarchy.
Generated per-node based on hardware inventory and JEPA confidence data:
performance (vs. default powersave)| Metric | Value |
|---|---|
| Infinity Cache (7900 XTX) | 2,562 GB/s at <96MB working set (2.7× DRAM) |
| IC cliff point | 96 MB (beyond this, falls to DRAM speeds) |
| LDS optimal stride | 22 (avoids 32-bank conflicts) |
| LDS bank conflict penalty | 60% throughput loss at stride=32 |
| PCIe throughput | 13–14 GB/s (43–45% of PCIe 4.0 x16) |
| Launch latency | 2.6 µs per kernel dispatch |
Geometric mean across all workload types on the RX 7900 XTX. Individual results vary significantly by workload type.
Interpreting the 3.15x average: This is the geometric mean across all 11 GPU kernel types. It is heavily influenced by kernels with large improvements (atomics, WMMA). For real-world applications that primarily use GEMM, convolution, and memory-bound operations, expect improvements in the 1.5–3x range depending on how far your current configuration is from optimal. Workloads that are already well-tuned will see smaller gains.
The Joint-Embedding Predictive Architecture model that drives continuous optimization.
| Property | Value |
|---|---|
| Parameters | 249K |
| Input dimensions | 28 (workload features) |
| Latent space | 256D |
| Output | 6D (performance prediction) |
| Training loss | 0.136 → 0.020 (60 epochs) |
| Inference latency | <1 ms (CPU) |
| Online learning | <1 ms per feedback call |
| Hardware Source | Data Points |
|---|---|
| AMD RX 7900 XTX | 81,305 |
| NVIDIA GTX 1050 | 451 |
| Intel Xeon 4108 (CPU) | 410 |
| AMD Ryzen 9 9950X | 285 |
| Intel E5-2640v2 (2-socket) | 283 |
| NVIDIA Quadro RTX 5000 | 266 |
| DDR4 Memory | 208 |
| Total | 83,174 |
Model continues learning from live workload observations via online feedback. Sample count grows as nodes run real workloads.
Exact versions used for all benchmarks on this page.
Pop!_OS 24.04 LTS
Kernel 6.18.7
ROCm 7.2.0
rocprofv3, MIOpen, rocBLAS
CUDA 13.0 (RTX 5000)
CUDA 12.6 (GTX 1050, sm_61)
PyTorch 2.10.0+cu128
GCC 13, hipcc (ROCm)
nvcc (CUDA)
rocprofv3 (AMD)
nvprof / nsight (NVIDIA)
Deploy Mesh Optimizer and let it discover your hardware's actual performance profile. Free tier includes one-time probing and behavior atlas.