Evidence first
Claims tie back to commands, hardware, and artifacts—not slides.
Crucible is Inference Foundry’s open journal for work that pairs mathematical clarity with real hardware: profiling methodology, environment specs, and reproducible procedures—no filler benchmarks.
Claims tie back to commands, hardware, and artifacts—not slides.
CUDA, driver, and toolkit versions stated wherever measurements appear.
Static site, no trackers. Source and template live on GitHub.
Shah et al., “FlashAttention-3,” arXiv:2407.08608, 2024.
Kernel-level deconstruction of the warp-specialization and producer/consumer pipeline in CUTLASS 3.x on Hopper SM90a. WGMMA instruction throughput analysis, TMA async copy pipeline overlap, and causal masking HBM access pattern profiling. Preliminary: ~92% of theoretical H100 FLOP/s at N≥4096.
H100 SXM5 · BF16 · CUDA 12.4
Assran et al., “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,” CVPR 2023. arXiv:2301.08243.
Latent prediction architecture teardown. Mathematical analysis of the context/target block sampling strategy and its effect on representation collapse. EMA target encoder gradient flow analysis. ViT backbone patch-embedding GEMM vs. attention compute ratio profiling queued for A100 80GB.
A100 80GB · BF16 · CUDA 12.1 (queued)
Dao & Gu, “Transformers are SSMs,” ICML 2024. arXiv:2405.21060.
State space model CUDA kernel efficiency and selective scan throughput analysis. SSD formulation and implications for parallelism across the sequence dimension. Target: A100 80GB.
A100 80GB · pendingDeepSeek-AI, “DeepSeek-V3 Technical Report,” arXiv:2412.19437, 2024.
MLA KV-cache compression ratio vs. attention quality degradation. Low-rank projection dimension sweep. Decode throughput vs. cache footprint. Target: H100 NVL.
H100 NVL · pendingCai et al., “Medusa,” arXiv:2401.10774, 2024.
Draft head acceptance rate under temperature variation. Memory overhead of extra heads vs. decode speedup. Tree-attention overhead. Target: A10G.
A10G · pendingSu et al., “RoFormer,” Neurocomputing, 2024. arXiv:2104.09864.
Numerical precision of rotary embeddings at extended context. BF16 vs. FP32 angle accumulation error beyond 32K positions.
CPU baseline + CUDA · pendingAinslie et al., “GQA,” EMNLP 2023. arXiv:2305.13245.
Direct measurement of KV-cache footprint under GQA vs. MHA vs. MQA. Decode throughput per MiB of KV memory. Target: L40S.
L40S · pendingShazeer et al., “Outrageously Large Neural Networks,” ICLR 2017. arXiv:1701.06538.
Expert load imbalance under top-1 vs. top-2 routing. All-to-all overhead on 8× H100 NVLink. Auxiliary loss sweep.
8× H100 SXM5 NVLink · pendingNo articles match your filters. Try another topic or clear the search box.