GEMM accelerator

This repository contains a hardware accelerator for General Matrix Multiply (GEMM). The accelerator is written in SystemC and is synthesized to Verilog RTL using the Stratus HLS tool from Cadence. The repository can be added as a submodule in an ESP repository and simulated without any changes.

Highlights of the accelerator

We accelerate matrix multiplication by leveraging interleaving between compute and memory operations, and parallelism within the compute phase using HLS directives. The accelerator's operation is described in the following sections.
For a 64x64 matrix multiplication, our optimized accelerator achieves a speedup on 23x and 430x over a naive GEMM accelerator, and a 64-bit RISC-V Ariane CPU, respectively.
With respect to the area overhead, our optimized accelerator occupies only 1/6th the area of a 64-bit RISC-V Ariane CPU.
Our results have been validated using SystemC behavioural simulations, accelerator-level RTL simulations, full ESP system RTL simulations, and on a Xilinx VCU118 FPGA board.

Steps to evaluate

Clone the ESP repository with all its submodules:

git clone git@github.com:sld-columbia/esp.git --recursive

Add our GEMM accelerator as a submodule:

git submodule add git@github.com:vsuresh95/gemm_accelerator_stratus.git accelerators/stratus_hls/gemm_accelerator_stratus

Set the required tool paths (Stratus HLS, Xcelium simulator, Vivado, etc.) and change directory to the target SoC folder:

cd $ESP_ROOT/socs/<target_SoC_folder>

Compile, synthesize and simulate:

make gemm_accelerator_stratus-sim

This will run a behavioural simulation, followed by HLS and then an RTL simulation.

Accelerator operation

Code reference: /hw/src/gemm_accelerator.cpp

Phases

The accelerator is divided into 3 phases - a load phase (to read input matrices from memory), a compute phase (to perform GEMM) and a store phase (to write the output back to memory). All 3 phases are implemented as parallel processes and handshakes are defined for communicating among themselves.

Memory phases

Matrices are accessed from memory using the DMA, provided in ESP by default. The DMA fetches data into the private local memories (PLM) inside the accelerator. The input PLM is configured to store 8192 integers, and the output PLM is configured to store 4096 integers.
For matrices larger than 64x64, we employed a blocking algorithm that computes a partial GEMM of each 64x64 block in the matrix.
To ensure that the compute phase never waits for input data to be fetched or for output data to be cleared, the PLM's are further duplicated to work in a ping-pong manner.

Compute phase

The compute phase performs a 64x64 GEMM by further blocking the data in input PLM into 16x16 chunks.
We use HLS directives to leverage the inherent parallelism within the 64x64 GEMM:
- We use loop unrolling to parallelize multiplies, accumulates and PLM accesses.
- We sample the memory arrays into a flattened register array before the arithmetic operations
- We pipeline all the operations across different iterations to maximize resource utilization.

SoC design for evaluation

The image below shows the system used to evaluate the accelerator. The CPU, memory and auxiliary tiles are generated using the ESP SoC Generator.

Evaluation on a Xilinx VCU118 FPGA board

Performance

Mode of execution	Time for 64x64 GEMM (cycles)	Speedup
64-bit RISC-V Ariane CPU	15299496	1
Naive GEMM	813753	18.8
Optmized GEMM	35545	430.43

Area

Mode of execution	CLB LUTs	CLB Registers
64-bit RISC-V Ariane CPU	40814	28283
Naive GEMM	2886	2393
Optmized GEMM	6761	3371

Limitations

The accelerator only accepts matrices whose dimensions are a multiple of 64.
The current implementation requires the second matrix in the multiply to be transposed in memory. However, the algorithm can be easily modified to accept non-transposed matrices. To do this, the load phase must be modified to fetch 64x64 blocks in column major order rather than row major order.

Relevant links

Embedded Scalable Platforms (ESP): An open-source research platform for heteregeneous SoC design

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
hw		hw
sw		sw
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gemm_accelerator.png		gemm_accelerator.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMM accelerator

Highlights of the accelerator

Steps to evaluate

Accelerator operation

Phases

Memory phases

Compute phase

SoC design for evaluation

Evaluation on a Xilinx VCU118 FPGA board

Performance

Area

Limitations

Relevant links

About

Releases

Packages

Languages

License

vsuresh95/gemm_accelerator_stratus

Folders and files

Latest commit

History

Repository files navigation

GEMM accelerator

Highlights of the accelerator

Steps to evaluate

Accelerator operation

Phases

Memory phases

Compute phase

SoC design for evaluation

Evaluation on a Xilinx VCU118 FPGA board

Performance

Area

Limitations

Relevant links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages