Inference Foundry logo

Instrumented research, published honestly

Crucible is Inference Foundry’s open journal for work that pairs mathematical clarity with real hardware: profiling methodology, environment specs, and reproducible procedures—no filler benchmarks.

Active teardowns
Queued studies
Paywalled articles

Evidence first

Claims tie back to commands, hardware, and artifacts—not slides.

Reproducible context

CUDA, driver, and toolkit versions stated wherever measurements appear.

Open repository

Static site, no trackers. Source and template live on GitHub.

Filter articles

Topics

In progress

Teardown WIP 2024-08

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Shah et al., “FlashAttention-3,” arXiv:2407.08608, 2024.

Kernel-level deconstruction of the warp-specialization and producer/consumer pipeline in CUTLASS 3.x on Hopper SM90a. WGMMA instruction throughput analysis, TMA async copy pipeline overlap, and causal masking HBM access pattern profiling. Preliminary: ~92% of theoretical H100 FLOP/s at N≥4096.

H100 SXM5 · BF16 · CUDA 12.4
LLM Hardware-Level Math
Profiling in progress — article not yet published
Teardown WIP 2024-10

I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Assran et al., “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,” CVPR 2023. arXiv:2301.08243.

Latent prediction architecture teardown. Mathematical analysis of the context/target block sampling strategy and its effect on representation collapse. EMA target encoder gradient flow analysis. ViT backbone patch-embedding GEMM vs. attention compute ratio profiling queued for A100 80GB.

A100 80GB · BF16 · CUDA 12.1 (queued)
Vision Math
Math formalization complete — hardware profiling queued

Queued

Teardown Queued Priority 1

Mamba-2: Transformers are SSMs — State Space Duality and CUDA Kernel Efficiency

Dao & Gu, “Transformers are SSMs,” ICML 2024. arXiv:2405.21060.

State space model CUDA kernel efficiency and selective scan throughput analysis. SSD formulation and implications for parallelism across the sequence dimension. Target: A100 80GB.

A100 80GB · pending
LLM SSM Hardware-Level
Teardown Queued Priority 2

DeepSeek-V3: Multi-head Latent Attention and KV-Cache Compression Analysis

DeepSeek-AI, “DeepSeek-V3 Technical Report,” arXiv:2412.19437, 2024.

MLA KV-cache compression ratio vs. attention quality degradation. Low-rank projection dimension sweep. Decode throughput vs. cache footprint. Target: H100 NVL.

H100 NVL · pending
LLM Hardware-Level Math
Benchmark Queued Priority 3

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Cai et al., “Medusa,” arXiv:2401.10774, 2024.

Draft head acceptance rate under temperature variation. Memory overhead of extra heads vs. decode speedup. Tree-attention overhead. Target: A10G.

A10G · pending
LLM Hardware-Level
Analysis Queued Priority 4

RoPE: Precision Sensitivity in BF16 vs. FP32 at Long Context (128K+ tokens)

Su et al., “RoFormer,” Neurocomputing, 2024. arXiv:2104.09864.

Numerical precision of rotary embeddings at extended context. BF16 vs. FP32 angle accumulation error beyond 32K positions.

CPU baseline + CUDA · pending
LLM Math
Benchmark Queued Priority 5

GQA / MQA: KV-Cache VRAM Footprint and Decode Throughput per Byte of KV Memory

Ainslie et al., “GQA,” EMNLP 2023. arXiv:2305.13245.

Direct measurement of KV-cache footprint under GQA vs. MHA vs. MQA. Decode throughput per MiB of KV memory. Target: L40S.

L40S · pending
LLM Hardware-Level
Teardown Queued Priority 6

MoE Router Analysis: Expert Load Imbalance and All-to-All Communication Overhead

Shazeer et al., “Outrageously Large Neural Networks,” ICLR 2017. arXiv:1701.06538.

Expert load imbalance under top-1 vs. top-2 routing. All-to-all overhead on 8× H100 NVLink. Auxiliary loss sweep.

8× H100 SXM5 NVLink · pending
LLM MoE Hardware-Level