Speakers:
Memory Wall for AI
Date:
Wednesday, May 6, 2026
Time:
9:35 am
Summary:
Modern generative AI systems—from LLMs to multimodal models—are no longer compute-bound; they are memory-bound. As model sizes soar, inference latency is dominated by memory bandwidth, memory fragmentation, KV-cache bloat, checkpoint restore time, and PCIe/NVLink bottlenecks. This session breaks down the “Memory Wall” limiting generative model performance and shares practical techniques such as model compression, quantization, memory-efficient attention, sharding, and cold-start optimization. This talk provides actionable insights for practitioners building large-scale generative AI infrastructure.