Memory Wall for AI

Speakers:

Tejas Chopra

Memory Wall for AI

Date:

Wednesday, May 6, 2026

Time:

9:35 am

Summary:

Modern generative AI systems—from LLMs to multimodal models—are no longer compute-bound; they are memory-bound. As model sizes soar, inference latency is dominated by memory bandwidth, memory fragmentation, KV-cache bloat, checkpoint restore time, and PCIe/NVLink bottlenecks. This session breaks down the “Memory Wall” limiting generative model performance and shares practical techniques such as model compression, quantization, memory-efficient attention, sharding, and cold-start optimization. This talk provides actionable insights for practitioners building large-scale generative AI infrastructure.

Speakers: