LPUs: Modern Computing Architectures Accelerating AI
AI Inference specialised computer processing architectures
2 min readDec 3, 2024
Ever wonder how LLM inference tools like Groq are so fast? — I sure do…
This paper explains the computer architecture of HyperAccel’s latency processing unit (LPU) — one of the modern LLM-optimised compute engines — and it’s AWESOME.
Self-described as “a latency-optimized and highly scalable architecture that accelerates large language model inference for GenAI.”.
In short, there are a few inefficiencies when using existing GPUs for LLM inference:
- The flow of the computational graph & memory bandwidth limits: Existing GPUs are designed for parallel processing matrix operations. The generative stage of LLM inference is sequential — requiring repeated computation over a single vector. This can result in underutilised cores in standard GPUs.
- Synchronisation across multiple LPUs/GPUs: LLMs are getting so large that fast synchronisation across GPUs is essential. Nvidia’s GPUs do offer high-speed interconnects via NVLink, but the authors highlight “the synchronization overhead in tensor parallelism is significant because computation is stalled during communication.”
Benefits of this Architecture:
- More efficient memory use, designed to match the sequential computational graph of LLMs.
- Better scalability across LPUs using their proprietary “Expandable Synchronization Link” (ESL).
- A custom software layer (HyperDex) making this accessible to developers.
All resulting in faster inference time with lower power consumption.
Their chip was compared to the Nvidia H100 (state-of-the-art).
Worth a read…