LPUs: Modern Computing Architectures Accelerating AI

AI Inference specialised computer processing architectures

Zach Wolpe
2 min readDec 3, 2024

Ever wonder how LLM inference tools like Groq are so fast? — I sure do…

This paper explains the computer architecture of HyperAccel’s latency processing unit (LPU) — one of the modern LLM-optimised compute engines — and it’s AWESOME.

Self-described as “a latency-optimized and highly scalable architecture that accelerates large language model inference for GenAI.”.

In short, there are a few inefficiencies when using existing GPUs for LLM inference:

  1. The flow of the computational graph & memory bandwidth limits: Existing GPUs are designed for parallel processing matrix operations. The generative stage of LLM inference is sequential — requiring repeated computation over a single vector. This can result in underutilised cores in standard GPUs.
  2. Synchronisation across multiple LPUs/GPUs: LLMs are getting so large that fast synchronisation across GPUs is essential. Nvidia’s GPUs do offer high-speed interconnects via NVLink, but the authors highlight “the synchronization overhead in tensor parallelism is significant because computation is stalled during communication.”

Benefits of this Architecture:

  • More efficient memory use, designed to match the sequential computational graph of LLMs.
  • Better scalability across LPUs using their proprietary “Expandable Synchronization Link” (ESL).
  • A custom software layer (HyperDex) making this accessible to developers.

All resulting in faster inference time with lower power consumption.

Their chip was compared to the Nvidia H100 (state-of-the-art).

Worth a read…

--

--

Zach Wolpe
Zach Wolpe

Written by Zach Wolpe

Machine Learning Engineer. Writing for fun.

No responses yet