The Board Test
sample is a reference design that contains tests to check FPGA board interfaces and reports the following metrics:
- Host-to-device global memory interface bandwidth
- Kernel clock frequency
- Kernel launch latency
- Kernel-to-device global memory bandwidth
- Unified Shared Memory bandwidth
Area | Description |
---|---|
What you will learn | How to test board interfaces to ensure that the designed platform provides expected performance |
Time to complete | 30 minutes (not including compile time) |
This reference design implements tests to check FPGA board interfaces and measure host-to-device and kernel-to-global memory interface metrics. Use this reference design as a starting point to validate platform interfaces when you customize a BSP.
This sample is part of the FPGA code samples. It is categorized as a Tier 4 sample that demonstrates a reference design.
flowchart LR
tier1("Tier 1: Get Started")
tier2("Tier 2: Explore the Fundamentals")
tier3("Tier 3: Explore the Advanced Techniques")
tier4("Tier 4: Explore the Reference Designs")
tier1 --> tier2 --> tier3 --> tier4
style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier2 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier3 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier4 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, running the sample on the Intel® DevCloud, using Visual Studio Code with the code samples, links to selected documentation, etc.
Optimized for | Description |
---|---|
OS | Ubuntu* 20.04 RHEL*/CentOS* 8 SUSE* 15 Windows* 10 Windows Server* 2019 |
Hardware | Intel® Agilex® 7, Arria® 10, and Stratix® 10 FPGAs |
Software | Intel® oneAPI DPC++/C++ Compiler |
Note: Even though the Intel DPC++/C++ oneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH:
- Questa*-Intel® FPGA Edition
- Questa*-Intel® FPGA Starter Edition
- ModelSim® SE
When using the hardware compile flow, Intel® Quartus® Prime Pro Edition must be installed and accessible through your PATH.
⚠️ Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
⚠️ This sample is benchmarking an FPGA board, therefore it should really be used when targeting an FPGA board/BSP.
A oneAPI Board Support Package (BSP) consists of software layers and an FPGA hardware scaffold design, making it possible to target an FPGA through the Intel® oneAPI DPC++/C++ Compiler.
The compiler stitches the generated FPGA de#to the oneAPI BSP framework. Refer to the Intel® oneAPI Programming Guide for information about oneAPI BSPs.
The BSP hardware components typically comprise RTL for all interfaces the oneAPI kernel requires; for example, a PCIe IP for the host to kernel communication, EMIF (External Memory Interface) IP for kernel to memory, and host to FPGA board memory communication among other things.
The BSP software components typically consist of a Memory Mapped Device (MMD) layer and a driver. The implementation is vendor-dependent.
The BSP consists of components operating at different clock domains. PCIe and external memories operate at a fixed frequency. Corresponding RTL IPs are parametrized to operate at these fixed frequencies by platform vendors. The kernel clock frequency varies and is calculated as part of the oneAPI offline compilation flow for FPGAs. The BSP has logic to handle the data transfer across these clock domains.
Board Test
measures the frequency that the kernel is running at in the FPGA and compares this to the compiled kernel clock frequency.
The following block diagram shows an overview of a typical oneAPI FPGA BSP hardware design and the numbered arrows depict the following:
- Path 1 represents the host to kernel interface.
- Path 2 represents the host-to-device global memory interface.
- Path 3 represents the kernel-to-device global memory interface.
- Path 4 represents the kernel-to-shared host memory interface
Note: The block diagram shown is an overview of a typical oneAPI FPGA platform. See the oneAPI platform or BSP vendor documentation for more details about platform components.
File | Description |
---|---|
board_test.cpp |
Contains the main() function and the test selection logic as well as calls to each test. |
board_test.hpp |
Contains the definitions for all the individual tests in the sample. |
host_speed.hpp |
Header for host speed test. Contains definition of functions used in host speed test. |
usm_speed.hpp |
Header for the USM bandwidth test. Contains definitions of functions used in the USM bandwidth test. |
helper.hpp |
Contains constants (for example, binary name) used throughout the code as well as definition of functions that print help and measure execution time. |
Flag | Description |
---|---|
-Xsno-interleaving |
By default oneAPI compiler burst interleaves across same memory type. -Xsno-interleaving disables burst interleaving and enables testing each memory bank independently. (See the FPGA Optimization Guide for Intel® oneAPI Toolkits Developer Guide for more information.) |
Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the
setvars
script located in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.Linux*:
- For system wide installations:
. /opt/intel/oneapi/setvars.sh
- For private installations:
. ~/intel/oneapi/setvars.sh
- For non-POSIX shells, like csh, use the following command:
bash -c 'source <install-dir>/setvars.sh ; exec csh'
Windows*:
C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat
- Windows PowerShell*, use the following command:
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.
-
Change to the sample directory.
-
Configure the build system for your BSP.
mkdir build cd build cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
Note: You must set FPGA_DEVICE to point to your BSP in order to build this sample.
-
Compile the design. (The provided targets match the recommended development flow.)
- Compile and run for emulation (fast compile time, targets emulated FPGA device).
make fpga_emu
- Generate the optimization report.
make report
- Compile and run for FPGA hardware (longer compile time, targets an FPGA device).
make fpga
- Compile and run for emulation (fast compile time, targets emulated FPGA device).
- Change to the sample directory.
- Configure the build system for your BSP.
mkdir build cd build cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
Note: You must set FPGA_DEVICE to point to your BSP in order to build this sample.
-
Compile the design. (The provided targets match the recommended development flow.)
- Compile and run for emulation (fast compile time, targets emulated FPGA device).
nmake fpga_emu
- Generate the optimization report.
nmake report
- Compile and run for FPGA hardware (longer compile time, targets an FPGA device).
nmake fpga
- Compile and run for emulation (fast compile time, targets emulated FPGA device).
Note: If you encounter any issues with long paths when compiling under Windows*, you may have to create your 'build' directory in a shorter path, for example
C:\samples\build
. You can then run cmake from that directory, and provide cmake with the full path to your sample directory, for example:C:\samples\build> cmake -G "NMake Makefiles" C:\long\path\to\code\sample\CMakeLists.txt
The complete board test is divided into six subtests. By default, all tests run. You can choose to run a single test by using the -test=<test number>
option. Refer to the Running the Sample section for test usage instructions.
Test Number | Test Name |
---|---|
1 | Host Speed and Host Read Write Test |
2 | Kernel Clock Frequency Test |
3 | Kernel Launch Test |
4 | Kernel Latency Measurement |
5 | Kernel-to-Memory Read Write Test |
6 | Kernel-to-Memory Bandwidth Test |
7 | Unified Shared Memory (USM) Bandwidth Test |
Note: You should run all tests at least once to ensure that the platform interfaces are fully functional.
To view test details and usage information using the binary, use the -help
option: <program> -help
.
The tests listed above check the following interfaces in a platform:
-
Host-to-device global memory interface (Test 1): This interface is checked by performing explicit data movement between the host and device global memory. Host to device global memory bandwidth is measured and reported. As a part of this interface check, unaligned data transfers are also performed to verify that non-DMA transfers complete successfully.
-
Kernel clock frequency (Test 2): The test measures the frequency the programmed kernel is running at on the FPGA device and reports it. By default, this test fails if the measured frequency is not within 2% of the compiled frequency.
Note: The test allows overriding this failure; however, overriding may lead to functional errors and is not recommended. The override option is provided to allow debugging in cases where platform design changes are done to force the kernel to run at a slower clock frequency (though this is not a common use case). To override, set the
report_chk
variable tofalse
inboard_test.cpp
and recompile only the host code by using the-reuse-exe=board_test.fpga
option in your compile command (this flag is added by default for you in the CMake file included with this code sample). -
Host-to-kernel interface (Tests 3 & 4): The test ensures that the host to kernel communication is correct and that the host can launch a kernel successfully. It also measures the roundtrip kernel launch latency and throughput (number of kernels/ms) of single task no-operation kernels.
-
Kernel-to-device global memory interface (Tests 5 & 6): This interface is checked by performing kernel to memory data transfers using simple read and write kernels. Kernel to memory bandwidth is measured and reported.
-
Unified shared memory (USM) interface (Test 7): This interface is checked by copying data between, reading data from, and writing data to host USM. The bandwidth is measured and reported for each case. Applies only to board variants with USM support; to run this test you must specify the
SUPPORTS_USM
macro at compile-time; e.g.,cmake .. -DSUPPORTS_USM=1
.
- Run the sample on the FPGA emulator (the kernel executes on the CPU).
By default the program runs all tests. To run a specific test, enter the test number as an argument to the
./board_test.fpga_emu
-test
option:./board_test.fpga_emu -test=<test_number>
- Run the sample on the FPGA device.
By default the program runs all tests. To run a specific test, enter the test number as an argument to the
./board_test.fpga
-test
option:./board_test.fpga -test=<test_number>
- Run the sample on the FPGA emulator (the kernel executes on the CPU):
By default the program runs all tests. To run a specific test, enter the test number as an argument to the
board_test.exe
-test
option:board_test.exe -test=<test_number>
- Run the sample on the FPGA device.
By default the program runs all tests. To run a specific test, enter the test number as an argument to the
board_test.fpga.exe
-test
option:board_test.fpga.exe -test=<test_number>
Running on FPGA device (Intel® FPGA SmartNIC N6001-PL). Performance results are based on testing as of May 10, 2024.
Note: Refer to the Performance Disclaimers section for important performance information.
*** Board_test usage information ***
Command to run board_test using generated binary:
> To run all tests (default): run board_test.fpga
> To run a specific test (see list below); pass the test number as argument to "-test" option:
Linux: ./board_test.fpga -test=<test_number>
Windows: board_test.exe -test=<test_number>
> To see more details on what each test does use -help option
The tests are:
1. Host Speed and Host Read Write Test
2. Kernel Clock Frequency Test
3. Kernel Launch Test
4. Kernel Latency Measurement
5. Kernel-to-Memory Read Write Test
6. Kernel-to-Memory Bandwidth Test
7. Unified Shared Memory Bandwidth Test
Note: Kernel Clock Frequency is run along with all tests except 1 (Host Speed and Host Read Write test)
Running all tests
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 17179869184
clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 17179868160
Device buffer size available for allocation = 17179868160 bytes
*****************************************************************
*********************** Host Speed Test *************************
*****************************************************************
Size of buffer created = 17179868160 bytes
Writing 16383 MiB to device global memory ... 7592.7 MB/s
Reading 16383 MiB from device global memory ... 7628.8 MB/s
Verifying data ...
Successfully wrote and readback 16383 MB buffer
Transferring 8192 KBs in 256 32 KB blocks ...
Transferring 8192 KBs in 128 64 KB blocks ...
Transferring 8192 KBs in 64 128 KB blocks ...
Transferring 8192 KBs in 32 256 KB blocks ...
Transferring 8192 KBs in 16 512 KB blocks ...
Transferring 8192 KBs in 8 1024 KB blocks ...
Transferring 8192 KBs in 4 2048 KB blocks ...
Transferring 8192 KBs in 2 4096 KB blocks ...
Transferring 8192 KBs in 1 8192 KB blocks ...
Writing 8192 KBs with block size (in bytes) below:
Block_Size Avg Max Min End-End (MB/s)
32768 381.23 426.21 249.94 4775.26
65536 510.22 546.47 406.00 7332.94
131072 757.51 1073.87 701.72 13826.89
262144 977.82 1954.74 869.33 16369.93
524288 1272.50 3282.34 1037.37 14452.23
1048576 1746.77 4678.85 1083.52 8202.39
2097152 5797.93 5983.74 5546.83 20416.05
4194304 6436.09 6557.66 6318.95 12325.60
8388608 6919.11 6919.11 6919.11 6919.11
Reading 8192 KBs with block size (in bytes) below:
Block_Size Avg Max Min End-End (MB/s)
32768 416.23 463.80 149.57 4051.34
65536 588.92 634.71 261.48 6861.12
131072 769.07 1089.64 397.59 12046.73
262144 1017.03 2195.03 654.95 16790.08
524288 1204.36 3581.23 815.31 13943.92
1048576 1512.05 4862.33 953.71 8371.75
2097152 2775.19 6196.34 1046.06 4133.37
4194304 2893.07 6699.52 1844.87 3673.20
8388608 2977.26 2977.26 2977.26 2977.26
Host write top speed = 20416.05 MB/s
Host read top speed = 16790.08 MB/s
HOST-TO-MEMORY BANDWIDTH = 18603 MB/s
*****************************************************************
********************* Host Read Write Test **********************
*****************************************************************
--- Running host read write test with device offset 0
--- Running host read write test with device offset 3
HOST READ-WRITE TEST PASSED!
*****************************************************************
******************* Kernel Clock Frequency Test ***************
*****************************************************************
Measured Frequency = 511.062 MHz
Quartus Compiled Frequency = 512 MHz
Measured Clock frequency is within 2 percent of Quartus compiled frequency.
*****************************************************************
********************* Kernel Launch Test ************************
*****************************************************************
Launching kernel KernelSender ...
Launching kernel KernelReceiver ...
... Waiting for sender
Sender sent the token to receiver
... Waiting for receiver
KERNEL_LAUNCH_TEST PASSED
*****************************************************************
******************** Kernel Latency **************************
*****************************************************************
Processed 10000 kernels in 118.6319 ms
Single kernel round trip time = 11.8632 us
Throughput = 84.2943 kernels/ms
Kernel execution is complete
*****************************************************************
************* Kernel-to-Memory Read Write Test ***************
*****************************************************************
Maximum device global memory allocation size is 17179868160 bytes
Finished host memory allocation for input and output data
Creating device buffer
Finished writing to device buffers
Launching kernel MemReadWriteStream ...
Launching kernel with global offset : 0
Launching kernel with global offset : 1073741824
Launching kernel with global offset : 2147483648
Launching kernel with global offset : 3221225472
... kernel finished execution.
Finished Verification
KERNEL TO MEMORY READ WRITE TEST PASSED
*****************************************************************
***************** Kernel-to-Memory Bandwidth *****************
*****************************************************************
Note: This test assumes that design was compiled with -Xsno-interleaving option
Performing kernel transfers of 4096 MiBs on the default global memory (address starting at 0)
Launching kernel MemWriteStream ...
Launching kernel MemReadStream ...
Launching kernel MemReadWriteStream ...
Summarizing bandwidth in MB/s/bank for banks 1 to 8
8765.24 8765.28 8765.26 8765.29 8765.27 8765.24 8765.3 8765.28 MemWriteStream
8786.28 8786.28 8786.27 8786.26 8786.27 8786.26 8786.3 8786.26 MemReadStream
8059.25 8062.61 8061.29 8054.25 8058.78 8061.35 8062.6 8058.39 MemReadWriteStream
KERNEL-TO-MEMORY BANDWIDTH = 8537.12 MB/s/bank
*****************************************************************
*********************** USM Bandwidth *************************
*****************************************************************
Board does not support USM, skipping this test.
BOARD TEST PASSED
Code samples are licensed under the MIT license. See License.txt for details.
Third party program Licenses can be found here: third-party-programs.txt.