Zing Forum

Reading

PROJECT SAINATH: Implementing a Large Language Model Hardware Accelerator from Scratch Using Verilog

An RTL-level AI hardware accelerator designed from scratch by a VLSI enthusiast, which implements Transformer core computations on FPGA using a systolic array architecture, providing a high-performance solution for edge inference of large models.

硬件加速器脉动阵列VerilogTransformer大语言模型FPGARTL设计边缘AIVLSI
Published 2026-04-28 18:10Recent activity 2026-04-28 18:19Estimated read 7 min
PROJECT SAINATH: Implementing a Large Language Model Hardware Accelerator from Scratch Using Verilog
1

Section 01

PROJECT SAINATH Project Introduction

PROJECT SAINATH is an RTL-level AI hardware accelerator designed from scratch by a VLSI enthusiast. It implements Transformer core computations on FPGA using a systolic array architecture, aiming to provide a high-performance solution for edge inference of large models. The project is fully designed with Verilog RTL without relying on off-the-shelf IP cores, serving both as a technical implementation case and a learning resource for understanding the working principles of AI accelerators.

2

Section 02

Project Background and Motivation

With the rapid development of Large Language Models (LLMs), efficient inference on edge devices has become an industry focus. Traditional CPUs/GPUs struggle to meet edge AI requirements in terms of power consumption and latency. Therefore, a VLSI enthusiast initiated PROJECT SAINATH, aiming to build a hardware accelerator specifically for Transformer inference. The project uses a 'from scratch' RTL design approach without off-the-shelf IP cores, demonstrating hardware design skills and providing a learning case.

3

Section 03

Core Architecture: Systolic Array Design

The core of PROJECT SAINATH is a 2D systolic array architecture inspired by Google TPU. This structure is dataflow-driven parallel computing where data flows like waves—each processing unit performs computations and passes results. In the design, matrix A flows from left to right, matrix B flows from top to bottom, and intermediate results are accumulated in place, avoiding frequent memory accesses and improving efficiency. For the Q×K^T matrix multiplication in Transformer self-attention, the systolic array can complete multiple multiply-accumulate operations in a single cycle, enabling parallel computing.

4

Section 04

Key Modules and Implementation Details

Processing Unit (PE/mac)

Each PE contains a custom multiply-accumulate (MAC) unit with single-cycle accumulation and synchronous data forwarding functions, ensuring uninterrupted data flow.

2x2 Systolic Array Engine

Currently, a fully functional 2x2 array is implemented, where four PEs work collaboratively through an interconnection structure to achieve conflict-free concurrent matrix multiplication. Pipeline design verification has been completed, laying the foundation for subsequent expansion.

Dataflow Controller (valid_fsm)

A Finite State Machine (FSM) is used to manage memory reading, data skewing, and precise cycle data feeding, ensuring data arrives at the correct position at the right time.

5

Section 05

Verification Environment and Technical Toolchain

The project includes a cycle-accurate testbench, using Icarus Verilog (iverilog) for simulation and GTKWave for waveform analysis to verify computational correctness and timing accuracy. Technology stack: Design language Verilog (RTL level), simulation tool Icarus Verilog, waveform analysis GTKWave, target platform FPGA, design paradigm Domain-Specific Architecture (DSA).

6

Section 06

Practical Significance and Application Prospects

Although currently a 2x2 array, the project demonstrates a complete mapping path from algorithm to hardware. For edge AI applications, custom accelerators have advantages: low latency (no instruction decoding overhead of general-purpose processors), high energy efficiency (reducing high-power memory accesses), determinism (predictable timing), and scalability (systolic arrays naturally support scale expansion). Larger-scale arrays in the future can handle actual Transformer models, providing a feasible solution for edge LLM inference.

7

Section 07

Learning Value and Project Summary

Learning Value

Provides learning resources for AI hardware developers, demonstrating algorithm-to-hardware mapping, systolic array design details, RTL design flow and verification methods, and hardware-software co-design thinking.

Summary

PROJECT SAINATH is an important attempt by the open-source hardware community in the AI accelerator field, serving both as a technical project and an educational platform. As AI models grow, such domain-specific architectures will become more important in edge computing and other fields, worthy of continuous attention and learning. The project's motto 'No IP cores. No shortcuts.' embodies the craftsmanship spirit of hardware engineers.