LPUs: Modern Computing Architectures Accelerating AI

AI Inference specialised computer processing architectures

2 min readDec 3, 2024

Ever wonder how LLM inference tools like Groq are so fast? — I sure do…

This paper explains the computer architecture of HyperAccel’s latency processing unit (LPU) — one of the modern LLM-optimised compute engines — and it’s AWESOME.

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

jvol XX \jnumXX \paper8 \jmonthSeptember/October \jtitleIEEE Micro (Preprint) \sptitle Contemporary Industry Products…

arxiv.org

Self-described as “a latency-optimized and highly scalable architecture that accelerates large language model inference for GenAI.”.

In short, there are a few inefficiencies when using existing GPUs for LLM inference:

The flow of the computational graph & memory bandwidth limits: Existing GPUs are designed for parallel processing matrix operations. The generative stage of LLM inference is sequential — requiring repeated computation over a single vector. This can result in underutilised cores in standard GPUs.
Synchronisation across multiple LPUs/GPUs: LLMs are getting so large that fast synchronisation across GPUs is essential. Nvidia’s GPUs do offer high-speed interconnects via NVLink, but the authors highlight “the synchronization overhead in tensor parallelism is significant because computation is stalled during communication.”

Benefits of this Architecture:

More efficient memory use, designed to match the sequential computational graph of LLMs.
Better scalability across LPUs using their proprietary “Expandable Synchronization Link” (ESL).
A custom software layer (HyperDex) making this accessible to developers.

All resulting in faster inference time with lower power consumption.

Their chip was compared to the Nvidia H100 (state-of-the-art).

Worth a read…

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

jvol XX \jnumXX \paper8 \jmonthSeptember/October \jtitleIEEE Micro (Preprint) \sptitle Contemporary Industry Products…

arxiv.org

Hyper Accelerated solutions for misson critical workloads

At HyperAccel, we create hyper-accelerated hardware solutions for emerging AI applications. Our team consists of…

hyperaccel.ai

LPUs: Modern Computing Architectures Accelerating AI

AI Inference specialised computer processing architectures

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

jvol XX \jnumXX \paper8 \jmonthSeptember/October \jtitleIEEE Micro (Preprint) \sptitle Contemporary Industry Products…

Benefits of this Architecture:

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

jvol XX \jnumXX \paper8 \jmonthSeptember/October \jtitleIEEE Micro (Preprint) \sptitle Contemporary Industry Products…

Hyper Accelerated solutions for misson critical workloads

At HyperAccel, we create hyper-accelerated hardware solutions for emerging AI applications. Our team consists of…

Written by Zach Wolpe

No responses yet