Benchmarks — Real Hardware, Real Results

Methodology & Transparency

We believe performance claims require context. Here's exactly how we measured.

How We Test

Baseline

Default compiler flags, default OS scheduler settings, default block sizes. No manual tuning. This represents what most users get out of the box.

Optimized

Parameters selected by Mesh Optimizer's behavior atlas after profiling 83,000+ configurations. The optimizer discovers optimal block sizes, ILP chains, tile dimensions, and memory access patterns per kernel type.

Measurement

Each benchmark runs 20+ iterations after warmup. We report the median result. GPU benchmarks use hardware event counters (rocprofv3 on AMD, nvprof on NVIDIA) rather than wall-clock timing where possible.

Reproducibility

All probe kernels are open source. The atlas database files used for these measurements contain the raw data points. Results will vary by hardware, driver version, and system load.

Important: Percentage improvements are measured against unoptimized defaults on the same hardware. Larger improvements (e.g., atomics, WMMA) reflect cases where default parameters are far from optimal for the specific operation. Smaller improvements (e.g., overlap, reduction) reflect operations where defaults are already reasonable. Your results will depend on your workloads, hardware, and baseline configuration.

Test Hardware

Primary GPU

AMD Radeon RX 7900 XTX

Architecture	RDNA 3 (gfx1100)
Compute Units	48 (96 shader arrays)
VRAM	24 GB GDDR6X
Peak BW	960 GB/s
Peak FP32	61 TFLOPS
Driver	ROCm 7.2.0
Data Points	81,305

NVIDIA GPU

Quadro RTX 5000

Architecture	Turing (sm_75)
CUDA Cores	3,072
VRAM	16 GB GDDR6
Peak BW	448 GB/s
Peak FP32	11.2 TFLOPS
Driver	CUDA 13.0
Data Points	266

CPU

Intel Xeon Silver 4108

Cores / Threads	8C / 16T
Frequency	1.8–3.0 GHz
RAM	96 GB DDR4
NUMA Nodes	1
Data Points	410

Remote Node

AMD Ryzen 9 9950X

Cores / Threads	16C / 32T (Zen 5)
RAM	124 GB DDR5
Accelerator	Xilinx FPGA
iGPU	RDNA 3.5 (2 GB)
Data Points	285

Remote Node

2× Intel Xeon E5-2640 v2

Cores / Threads	16C / 32T (2-socket)
RAM	220 GB DDR3
Platform	Dell PowerEdge R620
Role	FPGA synthesis host
Data Points	283

Budget GPU

NVIDIA GeForce GTX 1050

Architecture	Pascal (sm_61)
CUDA Cores	640
VRAM	2 GB GDDR5
Peak FP32	1.86 TFLOPS
Driver	CUDA 12.6
Data Points	451

Total atlas: 83,174 data points · 95,520 invariants · 29 kernel types · 8 hardware sources

AMD RX 7900 XTX — Kernel-Level Results

Before/after comparison across 11 kernel types. "Before" uses default parameters (block_size=256, no ILP, stride=1). "After" uses atlas-discovered optimal parameters.

Kernel Type	Before (Default)	After (Optimized)	Improvement	Key Optimization
Atomic Operations	2.9 GOPS	431 GOPS	+5,100%	Uncontended partitioning, optimal block size
WMMA (Tensor Ops)	~1.8 TFLOPS	88.8 TFLOPS	+4,776%	6 WMMA chains, k_tiles=34
Tiled GEMM	~0.3 TFLOPS	~9 TFLOPS	+2,831%	Tile size + shared memory padding
Wave Scheduling	~0.5 TFLOPS	~11 TFLOPS	+2,106%	Wavefront occupancy balancing
Strided Access	~55 GB/s	~884 GB/s	+1,506%	stride=1, coalesced access
Memory Bandwidth	~148 GB/s	867 GB/s	+484%	Vectorized loads, block=128
LDS (Shared Memory)	~237 GB/s	878 GB/s	+271%	stride=22 (avoids bank conflicts)
FP32 Compute	~16 TFLOPS	50 TFLOPS	+208%	7 FMA chains, 2.33x ILP speedup
FP16 Compute	~28 TFLOPS	57.9 TFLOPS	+102%	11 half2 chains (less ILP-sensitive)
Reduction	~450 GB/s	~780 GB/s	+73%	Block=512, warp-level reduction
Overlap (Roofline)	~595 GB/s	781 GB/s	+31%	compute_intensity=2 saturation point

Note on large improvements: Atomic (+5,100%) and WMMA (+4,776%) show the largest gains because their default configurations are particularly suboptimal. Atomics default to fully contended access patterns; WMMA defaults don't utilize instruction-level parallelism. These represent the upper bound of what optimization can achieve. Workloads that are already memory-bandwidth-bound (reduction, overlap) show more modest but still significant gains.

NVIDIA Quadro RTX 5000 — cuBLAS & cuDNN

Oracle-verified measurements against vendor library implementations. These validate that our optimization framework produces correct results at near-peak performance.

cuBLAS (GEMM)

Matrix Size	GFLOPS	% of Peak
128 × 128	152	1.4%
1024 × 1024	7,836	70.0%
4096 × 4096	10,821	96.6%
Batched (64×1024)	9,652	86.2%
LLaMA FFN shape	10,386	92.7%

Peak: 11.2 TFLOPS FP32. 31/31 correctness tests pass (100%).

cuDNN (Convolution & Normalization)

Operation	GFLOPS / GB/s	% of Peak
Conv2d (ResNet res4a)	14,391 GFLOPS	128%*
Conv2d (ResNet res3a)	12,899 GFLOPS	115%*
Conv2d (ResNet conv1)	4,934 GFLOPS	44%
Linear (LLaMA up)	9,826 GFLOPS	87.7%
BatchNorm	290 GB/s	64.8%
Softmax	354 GB/s	79.0%

*Exceeds theoretical FP32 peak via Tensor Core acceleration. 44/48 tests pass; 2 disabled (MIOpen JIT bug), 2 flaky timing.

GTX 1050 (Budget GPU) — Optimization Still Matters

Even a 7-year-old 2GB GPU benefits from atlas-driven optimization. Measured with CUDA 12.6 (sm_61).

Metric	Measured	% of Peak	Key Finding
Memory Bandwidth	89 GB/s	119%*	Cache-line hits exceed DRAM peak; block=256 optimal
FP32 Compute	581 GFLOPS	31%	15 ILP chains at block=32
GEMM (cuBLAS)	1,772 GFLOPS	95%	2048×2048 optimal for 2GB VRAM
Shared Memory	311 GB/s	—	stride=15 best; 79% of configs hit bank conflict cliffs
Occupancy	2,000 GFLOPS	—	block=1024 with 205 blocks; steep cliff at low block counts

*Bandwidth exceeds DRAM peak (75 GB/s) because L2 cache serves repeated accesses. 451 data points, 209 invariants, 70 performance cliffs detected.

CPU & Memory Subsystem Optimization

Mesh Optimizer doesn't just tune GPUs. System-level optimizations target scheduler, hugepages, NUMA, and memory hierarchy.

System Tuning Recommendations

Generated per-node based on hardware inventory and JEPA confidence data:

CPU governor → performance (vs. default powersave)
Hugepages scaled to RAM (128GB → 4096×2MB = 8GB)
NUMA balancing enabled for multi-socket (E5-2640v2)
Swappiness reduced (60 → 1 on 64GB+ systems)
I/O scheduler selection per device type
Network buffer scaling for FPGA data streaming
Scheduler autogroup disabled, util_clamp raised for 16+ core systems

Memory Hierarchy Discoveries

Metric	Value
Infinity Cache (7900 XTX)	2,562 GB/s at <96MB working set (2.7× DRAM)
IC cliff point	96 MB (beyond this, falls to DRAM speeds)
LDS optimal stride	22 (avoids 32-bank conflicts)
LDS bank conflict penalty	60% throughput loss at stride=32
PCIe throughput	13–14 GB/s (43–45% of PCIe 4.0 x16)
Launch latency	2.6 µs per kernel dispatch

Aggregate Impact — 3.15x Average Speedup

Geometric mean across all workload types on the RX 7900 XTX. Individual results vary significantly by workload type.

3.15x

Average Speedup

Geometric mean, all kernel types

52x

Maximum Speedup

Atomic operations (worst-case default)

1.31x

Minimum Speedup

Overlap/roofline (already near-optimal)

83,174

Data Points

8 hardware sources, 29 kernel types

Interpreting the 3.15x average: This is the geometric mean across all 11 GPU kernel types. It is heavily influenced by kernels with large improvements (atomics, WMMA). For real-world applications that primarily use GEMM, convolution, and memory-bound operations, expect improvements in the 1.5–3x range depending on how far your current configuration is from optimal. Workloads that are already well-tuned will see smaller gains.

JEPA Neural Model Performance

The Joint-Embedding Predictive Architecture model that drives continuous optimization.

Model Architecture

Property	Value
Parameters	249K
Input dimensions	28 (workload features)
Latent space	256D
Output	6D (performance prediction)
Training loss	0.136 → 0.020 (60 epochs)
Inference latency	<1 ms (CPU)
Online learning	<1 ms per feedback call

Training Data Coverage

Hardware Source	Data Points
AMD RX 7900 XTX	81,305
NVIDIA GTX 1050	451
Intel Xeon 4108 (CPU)	410
AMD Ryzen 9 9950X	285
Intel E5-2640v2 (2-socket)	283
NVIDIA Quadro RTX 5000	266
DDR4 Memory	208
Total	83,174

Model continues learning from live workload observations via online feedback. Sample count grows as nodes run real workloads.

Software Environment

Exact versions used for all benchmarks on this page.

Operating System

Pop!_OS 24.04 LTS
Kernel 6.18.7

AMD GPU Stack

ROCm 7.2.0
rocprofv3, MIOpen, rocBLAS

NVIDIA GPU Stack

CUDA 13.0 (RTX 5000)
CUDA 12.6 (GTX 1050, sm_61)

ML Framework

PyTorch 2.10.0+cu128

Compiler

GCC 13, hipcc (ROCm)
nvcc (CUDA)

Profiling

rocprofv3 (AMD)
nvprof / nsight (NVIDIA)

Disclaimers

Results may vary. Performance improvements depend on your specific hardware, driver versions, workload characteristics, and baseline configuration. The numbers on this page represent measurements on the specific hardware listed above.
Micro-benchmarks vs. applications. Kernel-level benchmarks (atomic, WMMA, bandwidth) measure isolated operations. Real application performance depends on the mix of operations, data transfer overhead, and host-device synchronization. Application-level improvements are typically lower than peak kernel improvements.
"Before" baselines. Default parameters (e.g., block_size=256, stride=1, no ILP unrolling) represent common out-of-the-box configurations. If you have already manually tuned your workloads, your baseline will be higher and improvements will be smaller.
Percentage calculations. All percentage improvements are calculated as: ((optimized - baseline) / baseline) × 100. We use geometric mean for aggregate comparisons across different metric types (throughput, bandwidth, latency).
Hardware-specific optimizations. Some optimizations (e.g., LDS stride=22 for RDNA 3, Infinity Cache 96MB working set) are specific to the tested hardware architecture. Mesh Optimizer discovers these parameters per-device, so your optimal values will differ.
Data current as of March 2026. These benchmarks were collected during February–March 2026. Driver updates, firmware changes, and software updates may affect reproducibility.

Real Hardware, Real Results

Methodology & Transparency

How We Test

Baseline

Optimized

Measurement

Reproducibility

Test Hardware

AMD Radeon RX 7900 XTX

Quadro RTX 5000

Intel Xeon Silver 4108

AMD Ryzen 9 9950X

2× Intel Xeon E5-2640 v2

NVIDIA GeForce GTX 1050

AMD RX 7900 XTX — Kernel-Level Results

NVIDIA Quadro RTX 5000 — cuBLAS & cuDNN

cuBLAS (GEMM)

cuDNN (Convolution & Normalization)

GTX 1050 (Budget GPU) — Optimization Still Matters

CPU & Memory Subsystem Optimization

System Tuning Recommendations

Memory Hierarchy Discoveries

Aggregate Impact — 3.15x Average Speedup

JEPA Neural Model Performance

Model Architecture

Training Data Coverage

Software Environment

Operating System

AMD GPU Stack

NVIDIA GPU Stack

ML Framework

Compiler

Profiling

Disclaimers

See What Your Hardware Can Do