The software programs described in this document are confidential and proprietary products of Synopsys Corp. or its licensors. The terms and conditions governing the sale and licensing of Synopsys products are set forth in written agreements between Synopsys Corp. and its customers. No representation or other affirmation of fact contained in this publication shall be deemed to be a warranty or give rise to any liability of Synopsys Corp. whatsoever. Images of software programs in use are assumed to be copyright and may not be reproduced.
This document is for informational and instructional purposes only. The ECE 411 teaching staff reserves the right to make changes in specifications and other information contained in this publication without prior notice, and the reader should, in all cases, consult the teaching staff to determine whether any changes have been made.
Table of Contents
After MP2 you should have a working machine that implements the RV32I Instruction Set. Now, you will be augmenting this design with a simple one-level cache.
You will need to design and verify a one-level, unified, 4-way, set-associative cache with the following specifications:
- 16 sets with 4 ways per set
- Each way holds an 8-word (256 bit) cache line
- Write-back with a write allocate policy
- Pseudo-LRU replacement policy
- Read/Write hits must take exactly two clock cycles to complete
- Indexing scheme following Figure 2
Previously, the CPU datapath was interacting with the main memory directly. Now, you will need to modify the interface to implement the memory hierarchy. That is, you will need to insert a cache between the CPU's datapath and the main memory. You may NOT add additional signals between the cache and the CPU datapath. Your cache must work with the same signals that MP2 main memory used to communicate with the CPU; i.e., the datapath must have no knowledge of your memory hierarchy. The signals used are described in the Signal Specifications section below.
In MP3, the main memory code will be provided as burst_memory.sv
. This memory module mimics
the timing characteristics of a real world off-the-shelf 512MiB SDRAM DIMM.
The memory interface is 64 bit, with 4 bursts per access, so that a single load will fill an entire cache line.
You must use OpenRAM for your data and tag arrays. See Appendix A for an overview of how SRAM circuits work and how they fit into the IC design flow. Note that only the data and tag arrays are SRAM. You must implement all other arrays using flip-flops.
Initially, when all your valid bits are zero, you will populate each set in PLRU order. That is, you should not give invalid cachelines priority over whichever cacheline the PLRU logic dictates you use.
Read/Write hits MUST take exactly two clock cycles to complete in this cache. Other operations may take multiple cycles, if necessary. Figure 3 illustrates what exactly is meant by two cycle hit.
These signals define the interface between the CPU datapath and the memory hierarchy. Each of these signals must be present, and no additional signals are allowed.
mem_address[31:0]
- The memory system is accessed using this 32 bit signal. It specifies the address that is to be read or written.
mem_rdata[31:0]
- 32-bit data bus for receiving data from the memory system.
mem_wdata[31:0]
- 32-bit data bus for sending data to the memory system.
mem_read
- Active high signal that tells the memory system that the address is valid and the processor is trying to perform a memory read.
mem_write
- Active high signal that tells the memory system that the address is valid and the processor is trying to perform a memory write.
mem_byte_enable[3:0]
A mask describing which byte(s) of memory should be written on a memory write. The behavior of this signal is summarized in the following table:
mem_byte_enable
Behavior 4'b0000
Don't write to memory even if mem_write
becomes active4'b????
Write only bytes specified in the mask (by a 1) when mem_write
becomes active4'b1111
Write all bytes of a word to memory when mem_write
becomes activemem_resp
- Active high signal generated by the memory system indicating that the memory has finished the requested operation.
Note that your cache requires a "bus adapter" placed between the CPU and the cache to convert the 32-bit interface into a 256-bit interface. This is a provided module, with the following declaration:
module bus_adapter
(
output [255:0] mem_wdata256,
input [255:0] mem_rdata256,
input [31:0] mem_wdata,
output [31:0] mem_rdata,
input [3:0] mem_byte_enable,
output logic [31:0] mem_byte_enable256,
input [31:0] address
);
This module appropriately shifts mem_wdata
and mem_byte_enable
on a write, and selects the
appropriate 32 bits from the 256 bit mem_rdata256
input on a read. You should use it between your cache and the CPU.
pmem_address[31:0]
- Physical memory is accessed using this 32-bit signal. It specifies the physical memory address that is to be read or written.
pmem_rdata[255:0]
- 256-bit data bus for receiving data from physical memory.
pmem_wdata[255:0]
- 256-bit data bus for sending data to physical memory.
pmem_read
- Active high signal that tells the memory interface that the address is valid and the cache is trying to perform a physical memory read.
pmem_write
- Active high signal that tells the memory interface that the address is valid and the cache is trying to perform a physical memory write.
pmem_resp
- Active high signal generated by the memory interface indicating that the memory operation has completed.
The main memory takes multiple cycles to respond to requests. When a response is ready, the memory
will assert the pmem_resp
signal. Once a memory request is asserted, the input signals to memory
should be held constant until a response is received. You may assume in your design that the memory
response will always occur so the processor never has an infinite wait. As before, make sure that
you never attempt to read and write to memory at the same time. Note that these signals have been
defined for you in mp3/hdl/mp3.sv
.
bmem_address[31:0]
- Physical memory is accessed using this 32-bit signal. It specifies the physical memory address that is to be read or written.
bmem_rdata[63:0]
- 64-bit data bus for receiving data from physical memory. Data is sent in bursts over 4 cycles.
bmem_wdata[63:0]
- 64-bit data bus for sending data to physical memory. Data is written in bursts over 4 cycles.
bmem_read
- Active high signal that tells physical memory that the address is valid and the cache is trying to perform a physical memory read.
bmem_write
- Active high signal that tells physical memory that the address is valid and the cache is trying to perform a physical memory write.
bmem_resp
- Active high signal generated by physical memory indicating that the memory operation is executing. This signal will stay high for 4 cycles during a single read or write.
Note that you will require your MP1 cacheline adapter to send 256-bit cachelines to the burst memory. You should refer
to its interface from MP1, and use it in your design between the cache physical memory interface (pmem_*
) and
the burst memory interface (bmem_*
):
module cacheline_adaptor
(
input clk,
input reset_n,
// Port to LLC (Lowest Level Cache)
input logic [255:0] line_i,
output logic [255:0] line_o,
input logic [31:0] address_i,
input read_i,
input write_i,
output logic resp_o,
// Port to memory
input logic [63:0] burst_i,
output logic [63:0] burst_o,
output logic [31:0] address_o,
output logic read_o,
output logic write_o,
input resp_i,
);
The specification for the cacheline adaptor is in the MP1 documentation.
Since MP3 is an extension of the work done in MP2, you should copy your completed MP2 de#to a new folder for MP3. The steps for copying and beginning MP3 are below.
Merge the provided MP3 files into your repository:
$ git fetch release $ git merge --allow-unrelated-histories release/mp3 -m "Merging MP3"
Copy your MP1 cacheline adaptor de#to your mp3/hdl directory:
$ cp -p mp1/cacheline_adaptor/hdl/cacheline_adaptor.sv mp3/hdl/
Copy your MP2 de#to your MP3 directory:
$ cp -p mp2/hdl/* mp3/hdl/cpu $ cp -p mp2/testcode/* mp3/testcode # optional, do this if you wrote your own tests
Rename your MP2 module, located in
mp3/hdl/cpu/mp2.sv
, frommp2
tocpu
. You should rename both the filename and SystemVerilog module name inside the file.$ mv mp3/hdl/cpu/mp2.sv mp3/hdl/cpu/cpu.sv
- DO NOT start working on MP3 without being sure your MP2 works. While you can (and should) test
your cache without the CPU, you will ultimately need to ensure that your designs work correctly
together. The autograder for MP2 will continue running for some time. The autograder for MP3 will
use your MP2 CPU located in the
mp3
directory, in the commit made at MP3 checkpoint deadlines. - DO NOT make any changes to the CPU datapath or CPU controller beyond those required to fix bugs from MP2. Your CPU should have no knowledge of the memory hierarchy attached to it. If you find yourself changing your CPU to accommodate your cache, you've done something wrong.
- DO NOT model the cache behaviorally in SystemVerilog. Ensure that it is synthesizable.
- DO NOT modify the provided files listed below:
bin/*
hdl/bus_adapter.sv
hdl/cache/ff_array.sv
hdl/cpu/alu.sv
(from MP2)hdl/cpu/ir.sv
(from MP2)hdl/cpu/regfile.sv
(from MP2)pkg/rv32i_mux_types.sv
pkg/rv32i_types.sv
hvl/mp3_data_array.sv
hvl/mp3_tag_array.sv
sram/*
synth/*
- DO NOT add new files in the
pkg/
directory. Add your own cache types inpkg/my_types.sv
.
We have provided a skeleton testbench for testing your cache as a DUT (decoupled from the CPU). You are strongly encouraged to test your cache as a DUT, so that you can verify timing information that testing with the CPU would hide.
We have provided a unfinished module called "shadow memory", which sits on the CPU side of the cache and makes sure all data read from the cache is correct. It does so by maintaining a side-channel memory that functions like the one in MP2. You should complete it if you wish to use it.
There will be three deadlines for MP3:
For the first checkpoint, you will need to submit a digital drawing (i.e., nothing hand-drawn, we recommend using https://draw.io/ or https://www.lucidchart.com/) of your cache datapath and cache controller. Your paper design should include a complete schematic of your datapath and a complete state machine design for your controller. It should be detailed enough for TAs to trace the execution of cache reads and writes. The specific requirements for your datapath are:
- Draw all the components, including the correct number of data arrays, tag arrays, valid arrays, dirty arrays, and LRU arrays.
- Ensure that you specify the dimensions of these arrays in the datapath diagram.
- Ensure that you show the connections for each interface port (except
clk
), for each of these array-like components. - The datapath must have explicitly labeled signals from the controller or other modules. Ensure that these modules are labeled.
- The datapath must handle the cases where:
- Data is read from the data arrays on a read hit.
- Data is loaded into the data arrays from main memory on a read/write miss.
- Data is written to the data arrays on a write hit.
- Data is written from the data arrays to main memory on a dirty eviction.
- The datapath must show how the PLRU is designed, and how the output of the PLRU is used in the rest of the design.
- Feel free to use additional combinational components like gates, MUXes, decoders, encoders, and other "well-known" components in the datapath.
- Keep your datapath schematic clean, complete, and concise. Use label connections and organize your wires well. Poorly formatted schematics will receive a grade penalty.
The requirements for the FSM description are:
- A state diagram of the cache controller with exactly four states and labeled transition conditions.
- A table for your FSM that describes your states, their transitions, and their outputs.
- Note that you must indicate the transition conditions both on the state diagram and in the table.
- The FSM must hit in two cycles.
Your design should be detailed enough for any student taking this course to build an identical, working cache based on your specification.
In addition to the "paper" design, you should start planning how you will test your design. In no more than a single page, answer the following questions:
- Analyze your cache design to identify two tricky cases you will deliberately test. (2 points)
- Provide a brief description of how you will test one of your identified cases. This may be either RISC-V assembly or cache input stimuli. (2 points)
- Briefly describe how you will unit test your cache as the DUT itself, rather than as part of your processor. (4 points)
Upload, as a single PDF document, your design (datapath and controller) and testing analysis to Gradescope before the posted deadline. Your testing analysis should not be longer than a single page (not including test code).
For this checkpoint, you will be required to have cache reads and PLRU working.
For the final hand-in, you will be required to have both cache reads and cache writes working, along with PLRU. Your design should have an area smaller than 100k micrometers squared.
Total: 140 points
- Design Checkpoint: 40 points
- Paper Design: 32 points (hand-drawn design will receive a zero)
- Testing Strategy: 8 points
- Checkpoint 1: 30
- Cache Reads: 30 points
- Checkpoint 2: 70 points
- Targeted Tests (using cache as DUT): 45 points
- CPU Oriented Test (using cache with your CPU): 10 points
- Timing And Synthesis: 15 points
In the past, to generate small memories, you have used a simple array of flip flops (for example, in the MP2 register file). Such a design does not scale for large memories like your cache data and tag arrays. For this, we use a SRAM block, which is a hard IP. SRAMs offer better power and area outcomes for the design as compared to flip flop based implementations. However, SRAMs are not purely digital circuits and need to be explicitly generated and instantiated. The tool we use to generate such IP is known as a memory compiler. For ECE 411, we use the OpenRAM memory compiler, whose output includes a simulation-only behavioral model and a timing model. GDS layout can also be generated, but is out of scope for this class. VCS will use the simulation model to do, you guessed it, simulation. DC will use the timing model black box during synthesis to give best-effort timing estimation.
We have already pre-generated the two required arrays for this MP: data array and tag array. You do not need to directly use OpenRAM for this MP, but we suggest you play with it in preparation for MP4.
To use OpenRAM, after sourcing the usual ECE 411 script, do:
$ source /class/ece411/OpenRAM/env.sh
Then, go to mp3/sram
and run:
$ make
This provided Makefile will call the OpenRAM generator, with the configurations in mp3/sram/config
.
To get the list of available configurations, read the OpenRAM documentation.
This will generate all relevant files in mp3/sram/output
.
The Makefile also converts the timing model to a format that DC can use.
This timing model is used by the provided synthesis script.
We have provided pre-generated files in the aforementioned directories. You should not modify them for this MP.
Here is the list of signals for the SRAM blocks:
clk0
- The clock.
csb0
- Chip select. Active low. Assert when you need to read or write. You have it permanently asserted for this MP.
web0
- Write enable. Active low. Assert when you need to write, or deassert for reading.
addr0
- The address.
wmask0
- Write mask. Active high. Valid only when
web0
is asserted. Only available for data array. din0
- Write data.
dout0
- Read data.
Here is the timing diagram for the SRAM blocks:
Technically, this is the behavior of a write-through SRAM. OpenRAM is non-write-through.
Using non-write-through SRAM in this cache design is a little bit more difficult.
For the purpose of this class, since we only care about the approximate area and timing characteristics,
we elect to patch the provided simulation models in hvl/
and sram/
to make them write-through.