This sample is an FPGA tutorial that demonstrates how a user can use the intel::initiation_interval
attribute to change the initiation interval (II) of a loop in scenarios that this feature improves performance.
Area | Description |
---|---|
What you will learn | The fMAX-II tradeoff Default behavior of the compiler when scheduling loops. How to use intel::initiation_interval to attempt to set the II for a loop. Scenarios in which intel::initiation_interval can be helpful in optimizing kernel performance. |
Time to complete | 20 minutes |
Category | Concepts and Functionality |
This FPGA tutorial demonstrates how to use the intel::initiation_interval
attribute to set the II for a loop. The attribute serves two purposes:
- Relax the II of a loop with a loop-carried dependency in order to achieve a higher kernel fMAX
- Enforce the II of a loop such that the compiler will error out if it cannot achieve the specified II
Note: The tutorial assumes you are familiar with the concepts of loop-carried dependencies and initiation interval (II).
- A loop-carried dependency refers to a situation where an operation in a loop iteration cannot proceed until an operation from a previous loop iteration has completed.
- The initiation interval, or II, is the number of clock cycles between the launch of successive loop iterations.
Optimized for | Description |
---|---|
OS | Ubuntu* 20.04 RHEL*/CentOS* 8 SUSE* 15 Windows* 10 Windows Server* 2019 |
Hardware | Intel® Agilex® 7, Agilex® 5, Arria® 10, Stratix® 10, and Cyclone® V FPGAs |
Software | Intel® oneAPI DPC++/C++ Compiler |
Note: Even though the Intel DPC++/C++ oneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
For using the simulator flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) and one of the following simulators must be installed and accessible through your PATH:
- Questa*-Intel® FPGA Edition
- Questa*-Intel® FPGA Starter Edition
- ModelSim® SE
When using the hardware compile flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) must be installed and accessible through your PATH.
Warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
This sample is part of the FPGA code samples. It is categorized as a Tier 2 sample that demonstrates a compiler feature.
flowchart LR
tier1("Tier 1: Get Started")
tier2("Tier 2: Explore the Fundamentals")
tier3("Tier 3: Explore the Advanced Techniques")
tier4("Tier 4: Explore the Reference Designs")
tier1 --> tier2 --> tier3 --> tier4
style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier2 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
style tier3 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, running the sample on the Intel® DevCloud, using Visual Studio Code with the code samples, links to selected documentation, and more.
The sample illustrates the following important concepts.
- The fMAX-II tradeoff.
- Default behavior of the compiler when scheduling loops.
- How to use
intel::initiation_interval
to set the II for a loop. - Scenarios in which
intel::initiation_interval
can be helpful in optimizing kernel performance.
The intel::initiation_interval
attribute is useful when optimizing kernels with loop-carried dependencies in loops with a short trip count, to prevent the compiler from scheduling the loop with a fMAX-II combination that results in low system-wide fMAX, decreasing throughput.
Generally, striving for the lowest possible II of 1 is preferred. However, in some cases, it may be suboptimal for the scheduler to do so.
For example, consider a loop with loop-carried dependencies. The compiler must ensure that these dependencies are satisfied. To achieve an II of 1, the compiler must schedule all of the operations necessary to compute loop-carried dependencies within a single clock cycle. As the number of operations in a clock cycle increases, the circuit's clock frequency (fMAX) must decrease. The lower clock frequency slows down the entire circuit, not just the single loop. This is the fMAX-II tradeoff. Sometimes, the benefits of achieving an II of 1 for a particular loop may not outweigh the negative impact of reducing fMAX for the entire system.
In the presence of loop-carried dependencies, it may be impossible for the compiler to schedule a given loop with II = 1 while respecting a target fMAX.
In this case, the compiler can either:
-
Increase the cycle time (trading off fMAX) to allow operations with loop-carried dependencies to be executed in one clock cycle in order to achieve an II of 1.
-
Maintain the cycle time so the loop body executes in multiple clock cycles, while increasing the number of clock cycles between subsequent loop iterations (trading off II), until the next loop iteration is able to execute after the last operation of a loop-carried dependency has finished.
The intel::initiation_interval
attribute gives the user explicit control over the fMAX-II tradeoff.
By default, the compiler attempts to schedule each loop with the optimal minimum product of the II and cycle time (1/fMAX), while ensuring that all loop carried dependencies are fulfilled. The resulting loop block might not necessarily achieve the targeted fMAX as the fMAX-II heuristic depends on low II or high fMAX. A combination of fMAX and II may have the best heuristic but might not necessarily achieve the target fMAX. This might cause performance bottlenecks as fMAX is a global constraint and II is a local constraint.
The intel::initiation_interval
attribute can be used to specify an II for a particular loop. It informs the compiler to ignore the default heuristic and to try and schedule the loop that the attribute is applied to with the specific II the user provides.
The targeted fMAX can be specified using the –Xsclock
compiler argument. The argument determines the pipelining effort of the compiler, which uses an internal model of the FPGA fabric to estimate fMAX. The true fMAX is known only after compiling to hardware. Without the argument, the default target fMAX is 240MHz for the Intel® Arria® 10 FPGAs and 480MHz for the Intel® Stratix® 10 and Agilex® 7 FPGAs, but the compiler will not strictly enforce reaching that default target when scheduling loops.
Note: The scheduler prioritizes II over fMAX if both
-Xsclock
andintel::initiation_interval
are used. Your kernel may be able to achieve a lower II for the loop with theintel::initiation_interval
attribute while targeting a specific fMAX, but the loop will not be scheduled with the lower II.
To let the compiler attempt to set the II for a loop to a positive constant expression of integer type n, declare the attribute above the loop. For example:
[[intel::initiation_interval(n)]] // n is required
for (int i = 0; i < N; i++) {
s *= a;
s += b;
}
-
Allow users to assert an II for a loop.
This is useful during development when making changes that could potentially compromise the previously achieved II. Upon finding out that a loop can be scheduled with a specific II, one can use the
intel:ii
attribute to set the achieved II as the II the compiler must achieve. If the compiler is unable to schedule the loop with the same II as before after some new changes during development, it will produce an error. This allows changes causing throughput drops to be easily identified in larger designs. -
Alter the compiler's default fMAX-II tradeoff, usually by relaxing II.
An in-depth example is given in this code sample.
The code sample gives a trivial kernel in which the choice made by the compiler is suboptimal and the intel::initiation_interval
attribute can be used to improve performance.
This tutorial contains two distinct pipelineable loops:
- A short-running initialization loop that has a long feedback path as a result loop-carried dependence
- A long-running loop that does the bulk of the processing, with a feedback path
Note: The operations performed in the short and long-running loops are for illustrative purposes only.
Since the tutorial shows performance impacts in terms of fMAX and all kernels are implemented by the compiler in a common clock domain, the results cannot be shown in two kernels that are compiled once. To see the impact of the intel::initiation_interval
optimization in this tutorial, compile the design twice.
Part 1 compiles the kernel code without setting the ENABLE_II
macro, whereas Part 2 compiles the kernel while setting this macro. The macro chooses between two code segments that are functionally equivalent, but the ENABLE_II
version of the code demonstrates the two use cases of intel::initiation_interval
.
According to the default behavior, the compiler does not know that the initialization loop has a smaller impact on the overall throughput. Thus, the compiler schedules the loop using the minimum II/fMAX ratio. Because the initialization loop has a loop-carried dependence, it has a feedback path in the generated hardware. The targeted clock frequency might not be achieved by the scheduler when optimizing for the minimum II/fMAX.
Depending on the feedback path in the long-running loop, the rest of the kernel could have run at a higher fMAX, which is the case in this design. The long-running loop is able to achieve an II of 1 while targeting the default fMAX but will be bottlenecked by the highest fMAX achieved by all blocks, resulting in lowered throughput.
In this part, intel::initiation_interval
is used for both the short and long running loops to show the two scenarios where using the attribute is appropriate.
The first intel::initiation_interval
declaration sets an II value of 3 for the Intel® Arria® 10 FPGA, and an II value of 5 for the Intel® Stratix® 10 and Agilex® 7 FPGAs. Since the initialization loop has a low trip count compared to the long-running loop, a higher II for the initialization loop is a reasonable tradeoff to allow for a higher overall fMAX for the entire kernel.
Note: For Intel® Stratix® 10 FPGA, the estimated fMAX of the long-running loop is not able to reach the default targeted fMAX of 480MHz while maintaining an II of 1. This is due to the nature of the feedback path that exists in the long running loop. Setting the II of the initialization loop to 5 ensures that the initialization loop is not the bottleneck when finding the maximum operating frequency.
The second intel::initiation_interval
declaration sets an II of 1 for the long-running loop. We might not want to compromise the II of 1 achieved for this loop while performing optimizations on other parts of the kernel. By declaring that the loop should have an II of 1, the compiler will produce an error if it cannot schedule this loop with that II. The error implies that the other optimization will have a negative performance impact on this loop. This makes it easier to find the cause of any throughput drops in larger designs.
Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the
setvars
script in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.Linux*:
- For system wide installations:
. /opt/intel/oneapi/setvars.sh
- For private installations:
. ~/intel/oneapi/setvars.sh
- For non-POSIX shells, like csh, use the following command:
bash -c 'source <install-dir>/setvars.sh ; exec csh'
Windows*:
C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat
- Windows PowerShell*, use the following command:
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.
- Change to the sample directory.
- Build the program for Intel® Agilex® 7 device family, which is the default.
where
mkdir build cd build cmake .. -DPART=<x>
-DPART=<X>
is:-DPART=II_ENABLED
-DPART=II_DISABLED
Use-DPART=II_ENABLED
to build the project with II attribute enabled.
Note: You can change the default target by using the command:
cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design. (The provided targets match the recommended development flow.)
- Compile and run for emulation (fast compile time, targets emulates an FPGA device).
make fpga_emu
- Generate the HTML optimization reports. (See Read the Reports below for information on finding and understanding the reports.)
make report
- Compile for simulation (fast compile time, targets simulated FPGA device).
make fpga_sim
- Compile and run on FPGA hardware (longer compile time, targets an FPGA device).
make fpga
- Compile and run for emulation (fast compile time, targets emulates an FPGA device).
- Change to the sample directory.
- Build the program for the Intel® Agilex® 7 device family, which is the default.
where
mkdir build cd build cmake -G "NMake Makefiles" .. -DPART=<x>
-DPART=<X>
is:-DPART=II_ENABLED
-DPART=II_DISABLED
Use-DPART=II_ENABLED
to build the project with II attribute enabled.
Note: You can change the default target by using the command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design. (The provided targets match the recommended development flow.)
- Compile for emulation (fast compile time, targets emulated FPGA device).
nmake fpga_emu
- Generate the optimization report. (See Read the Reports below for information on finding and understanding the reports.)
nmake report
- Compile for simulation (fast compile time, targets simulated FPGA device, reduced problem size).
nmake fpga_sim
- Compile for FPGA hardware (longer compile time, targets FPGA device):
nmake fpga
- Compile for emulation (fast compile time, targets emulated FPGA device).
Note: If you encounter any issues with long paths when compiling under Windows*, you may have to create your 'build' directory in a shorter path, for example c:\samples\build. You can then run cmake from that directory, and provide cmake with the full path to your sample directory, for example:
C:\samples\build> cmake -G "NMake Makefiles" C:\long\path\to\code\sample\CMakeLists.txt
Locate the report.html
file in either:
- Report-only compile:
loop_initiation_interval.report.prj
- FPGA hardware compile:
loop_initiation_interval.fpga.prj
Looking at the reports for the design without the intel::initiation_interval
attribute (when cmake was configured without -DUSER_FLAGS=-DENABLE_II
), navigate to the Loop Analysis report (Throughput Analysis > Loop Analysis). Click the SimpleMath kernel in the Loop List panel and use the Bottlenecks viewer panel in the bottom left. You will see that a throughput bottleneck exists in the SimpleMath kernel.
Select the bottleneck. The report shows that the estimated fMAX is significantly lower than the target fMAX and shows the feedback path responsible, which is the feedback path in the initialization loop.
The Loop Analysis report shows that the long-running loop achieves the target fMAX with an II of 1.
Compare the results to the report for the version of the design using the intel::initiation_interval
attribute (when cmake was configured with -DUSER_FLAGS=-DENABLE_II
), navigate to the Loop Analysis report (Throughput Analysis > Loop Analysis). Here both loops achieve the target fMAX.
Note: Only the report generated after the FPGA hardware compile will reflect the true performance benefit of using the
initiation_interval
extension. The difference is not apparent in the reports generated bymake report
because a design's fMAX cannot be predicted. The final achieved fMAX can be found inloop_initiation_interval.fpga.prj/reports/report.html
(aftermake fpga
completes), in Clock Frequency Summary on the main page of the report.
-
Run the sample on the FPGA emulator (the kernel executes on the CPU).
./loop_initiation_interval.fpga_emu
-
Run the sample on the FPGA simulator device (the kernel executes on the CPU).
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./loop_initiation_interval.fpga_sim
-
Run the sample on the FPGA device (only if you ran
cmake
with-DFPGA_DEVICE=<board-support-package>:<board-variant>
)../loop_initiation_interval.fpga
-
Run the sample on the FPGA emulator (the kernel executes on the CPU).
loop_initiation_interval.fpga_emu.exe
-
Run the sample on the FPGA simulator device (the kernel executes on the CPU).
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 loop_initiation_interval.fpga_sim.exe set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
-
Run the sample on the FPGA device (only if you ran
cmake
with-DFPGA_DEVICE=<board-support-package>:<board-variant>
).loop_initiation_interval.fpga.exe
Output of sample without the intel::initiation_interval
attribute.
Kernel Throughput: 0.0635456MB/s
Exec Time: 60.0309s , InputMB: 3.8147MB
PASSED
Output of sample with the intel::initiation_interval
attribute.
Kernel_ENABLE_II Throughput: 0.117578MB/s
Exec Time: 32.4439s , InputMB: 3.8147MB
PASSED
Total throughput improved with the use of the intel::initiation_interval
attribute because the increase in kernel fMAX is more significant than the II relaxation of the low trip-count loop.
This performance difference will be apparent only when running on FPGA hardware. The emulator, while useful for verifying functionality, will generally not reflect differences in performance.
Code samples are licensed under the MIT license. See License.txt for details.
Third-party program Licenses can be found here: third-party-programs.txt.