Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
src	src
CMakeLists.txt	CMakeLists.txt
README.md	README.md
sample.json	sample.json

Loop `initiation_interval` Sample

This sample is an FPGA tutorial that demonstrates how a user can use the intel::initiation_interval attribute to change the initiation interval (II) of a loop in scenarios that this feature improves performance.

Area	Description
What you will learn	The f_MAX-II tradeoff Default behavior of the compiler when scheduling loops. How to use `intel::initiation_interval` to attempt to set the II for a loop. Scenarios in which `intel::initiation_interval` can be helpful in optimizing kernel performance.
Time to complete	20 minutes
Category	Concepts and Functionality

Purpose

This FPGA tutorial demonstrates how to use the intel::initiation_interval attribute to set the II for a loop. The attribute serves two purposes:

Relax the II of a loop with a loop-carried dependency in order to achieve a higher kernel f_MAX
Enforce the II of a loop such that the compiler will error out if it cannot achieve the specified II

Note: The tutorial assumes you are familiar with the concepts of loop-carried dependencies and initiation interval (II).

A loop-carried dependency refers to a situation where an operation in a loop iteration cannot proceed until an operation from a previous loop iteration has completed.
The initiation interval, or II, is the number of clock cycles between the launch of successive loop iterations.

Prerequisites

Optimized for	Description
OS	Ubuntu* 20.04 RHEL/CentOS 8 SUSE* 15 Windows* 10 Windows Server* 2019
Hardware	Intel® Agilex® 7, Agilex® 5, Arria® 10, Stratix® 10, and Cyclone® V FPGAs
Software	Intel® oneAPI DPC++/C++ Compiler

Note: Even though the Intel DPC++/C++ oneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.

For using the simulator flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) and one of the following simulators must be installed and accessible through your PATH:

Questa*-Intel® FPGA Edition

Questa*-Intel® FPGA Starter Edition

ModelSim® SE

When using the hardware compile flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) must be installed and accessible through your PATH.

Warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.

This sample is part of the FPGA code samples. It is categorized as a Tier 2 sample that demonstrates a compiler feature.

flowchart LR
   tier1("Tier 1: Get Started")
   tier2("Tier 2: Explore the Fundamentals")
   tier3("Tier 3: Explore the Advanced Techniques")
   tier4("Tier 4: Explore the Reference Designs")

   tier1 --> tier2 --> tier3 --> tier4

   style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
   style tier2 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
   style tier3 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
   style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff

Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, running the sample on the Intel® DevCloud, using Visual Studio Code with the code samples, links to selected documentation, and more.

Key Implementation Details

The sample illustrates the following important concepts.

The f_MAX-II tradeoff.
Default behavior of the compiler when scheduling loops.
How to use intel::initiation_interval to set the II for a loop.
Scenarios in which intel::initiation_interval can be helpful in optimizing kernel performance.

The intel::initiation_interval attribute is useful when optimizing kernels with loop-carried dependencies in loops with a short trip count, to prevent the compiler from scheduling the loop with a f_MAX-II combination that results in low system-wide f_MAX, decreasing throughput.

The f_MAX-II Tradeoff

Generally, striving for the lowest possible II of 1 is preferred. However, in some cases, it may be suboptimal for the scheduler to do so.

For example, consider a loop with loop-carried dependencies. The compiler must ensure that these dependencies are satisfied. To achieve an II of 1, the compiler must schedule all of the operations necessary to compute loop-carried dependencies within a single clock cycle. As the number of operations in a clock cycle increases, the circuit's clock frequency (f_MAX) must decrease. The lower clock frequency slows down the entire circuit, not just the single loop. This is the f_MAX-II tradeoff. Sometimes, the benefits of achieving an II of 1 for a particular loop may not outweigh the negative impact of reducing f_MAX for the entire system.

In the presence of loop-carried dependencies, it may be impossible for the compiler to schedule a given loop with II = 1 while respecting a target f_MAX.

In this case, the compiler can either:

Increase the cycle time (trading off f_MAX) to allow operations with loop-carried dependencies to be executed in one clock cycle in order to achieve an II of 1.
Maintain the cycle time so the loop body executes in multiple clock cycles, while increasing the number of clock cycles between subsequent loop iterations (trading off II), until the next loop iteration is able to execute after the last operation of a loop-carried dependency has finished.

The intel::initiation_interval attribute gives the user explicit control over the f_MAX-II tradeoff.

Compiler Default Heuristics and Overrides

By default, the compiler attempts to schedule each loop with the optimal minimum product of the II and cycle time (1/f_MAX), while ensuring that all loop carried dependencies are fulfilled. The resulting loop block might not necessarily achieve the targeted f_MAX as the f_MAX-II heuristic depends on low II or high f_MAX. A combination of f_MAX and II may have the best heuristic but might not necessarily achieve the target f_MAX. This might cause performance bottlenecks as f_MAX is a global constraint and II is a local constraint.

The intel::initiation_interval attribute can be used to specify an II for a particular loop. It informs the compiler to ignore the default heuristic and to try and schedule the loop that the attribute is applied to with the specific II the user provides.

The targeted f_MAX can be specified using the –Xsclock compiler argument. The argument determines the pipelining effort of the compiler, which uses an internal model of the FPGA fabric to estimate f_MAX. The true f_MAX is known only after compiling to hardware. Without the argument, the default target f_MAX is 240MHz for the Intel® Arria® 10 FPGAs and 480MHz for the Intel® Stratix® 10 and Agilex® 7 FPGAs, but the compiler will not strictly enforce reaching that default target when scheduling loops.

Note: The scheduler prioritizes II over f_MAX if both -Xsclock and intel::initiation_interval are used. Your kernel may be able to achieve a lower II for the loop with the intel::initiation_interval attribute while targeting a specific f_MAX, but the loop will not be scheduled with the lower II.

Syntax

To let the compiler attempt to set the II for a loop to a positive constant expression of integer type n, declare the attribute above the loop. For example:

[[intel::initiation_interval(n)]] // n is required
for (int i = 0; i < N; i++) {
  s *= a;
  s += b;
}

Use Cases for `intel::initiation_interval`

Allow users to assert an II for a loop.

This is useful during development when making changes that could potentially compromise the previously achieved II. Upon finding out that a loop can be scheduled with a specific II, one can use the intel:ii attribute to set the achieved II as the II the compiler must achieve. If the compiler is unable to schedule the loop with the same II as before after some new changes during development, it will produce an error. This allows changes causing throughput drops to be easily identified in larger designs.
Alter the compiler's default f_MAX-II tradeoff, usually by relaxing II.

An in-depth example is given in this code sample.

Code Sample: Overriding the f_MAX-II Heuristic in the Compiler

The code sample gives a trivial kernel in which the choice made by the compiler is suboptimal and the intel::initiation_interval attribute can be used to improve performance.

This tutorial contains two distinct pipelineable loops:

A short-running initialization loop that has a long feedback path as a result loop-carried dependence
A long-running loop that does the bulk of the processing, with a feedback path

Note: The operations performed in the short and long-running loops are for illustrative purposes only.

Since the tutorial shows performance impacts in terms of f_MAX and all kernels are implemented by the compiler in a common clock domain, the results cannot be shown in two kernels that are compiled once. To see the impact of the intel::initiation_interval optimization in this tutorial, compile the design twice.

Part 1 compiles the kernel code without setting the ENABLE_II macro, whereas Part 2 compiles the kernel while setting this macro. The macro chooses between two code segments that are functionally equivalent, but the ENABLE_II version of the code demonstrates the two use cases of intel::initiation_interval.

Part 1: Without `ENABLE_II`

According to the default behavior, the compiler does not know that the initialization loop has a smaller impact on the overall throughput. Thus, the compiler schedules the loop using the minimum II/f_MAX ratio. Because the initialization loop has a loop-carried dependence, it has a feedback path in the generated hardware. The targeted clock frequency might not be achieved by the scheduler when optimizing for the minimum II/f_MAX.

Depending on the feedback path in the long-running loop, the rest of the kernel could have run at a higher f_MAX, which is the case in this design. The long-running loop is able to achieve an II of 1 while targeting the default f_MAX but will be bottlenecked by the highest f_MAX achieved by all blocks, resulting in lowered throughput.

Part 2: With `ENABLE_II`

In this part, intel::initiation_interval is used for both the short and long running loops to show the two scenarios where using the attribute is appropriate.

The first intel::initiation_interval declaration sets an II value of 3 for the Intel® Arria® 10 FPGA, and an II value of 5 for the Intel® Stratix® 10 and Agilex® 7 FPGAs. Since the initialization loop has a low trip count compared to the long-running loop, a higher II for the initialization loop is a reasonable tradeoff to allow for a higher overall f_MAX for the entire kernel.

Note: For Intel® Stratix® 10 FPGA, the estimated f_MAX of the long-running loop is not able to reach the default targeted f_MAX of 480MHz while maintaining an II of 1. This is due to the nature of the feedback path that exists in the long running loop. Setting the II of the initialization loop to 5 ensures that the initialization loop is not the bottleneck when finding the maximum operating frequency.

The second intel::initiation_interval declaration sets an II of 1 for the long-running loop. We might not want to compromise the II of 1 achieved for this loop while performing optimizations on other parts of the kernel. By declaring that the loop should have an II of 1, the compiler will produce an error if it cannot schedule this loop with that II. The error implies that the other optimization will have a negative performance impact on this loop. This makes it easier to find the cause of any throughput drops in larger designs.

Build the `Loop Initiation Interval` Tutorial

Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the setvars script in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.

Linux*:

For system wide installations: . /opt/intel/oneapi/setvars.sh

For private installations: . ~/intel/oneapi/setvars.sh

For non-POSIX shells, like csh, use the following command: bash -c 'source <install-dir>/setvars.sh ; exec csh'

Windows*:

C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat

Windows PowerShell*, use the following command: cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'

For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.

On Linux*

Change to the sample directory.
Build the program for Intel® Agilex® 7 device family, which is the default.
```
mkdir build
cd build
cmake .. -DPART=<x>
```
where -DPART=<X> is:
- -DPART=II_ENABLED
- -DPART=II_DISABLED Use -DPART=II_ENABLED to build the project with II attribute enabled.
Note: You can change the default target by using the command:
```
cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
```
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
```
cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
```

Note: You can poll your system for available BSPs using the aoc -list-boards command. The board list that is printed out will be of the form
$> aoc -list-boards
Board list:
  <board-variant>
     Board Package: <path/to/board/package>/board-support-package
  <board-variant2>
     Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.

Compile the design. (The provided targets match the recommended development flow.)
1. Compile and run for emulation (fast compile time, targets emulates an FPGA device).
```
make fpga_emu
```
2. Generate the HTML optimization reports. (See Read the Reports below for information on finding and understanding the reports.)
```
make report
```
3. Compile for simulation (fast compile time, targets simulated FPGA device).
```
make fpga_sim
```
4. Compile and run on FPGA hardware (longer compile time, targets an FPGA device).
```
make fpga
```

On Windows*

Change to the sample directory.
Build the program for the Intel® Agilex® 7 device family, which is the default.
```
mkdir build
cd build
cmake -G "NMake Makefiles" .. -DPART=<x>
```
where -DPART=<X> is:
- -DPART=II_ENABLED
- -DPART=II_DISABLED Use -DPART=II_ENABLED to build the project with II attribute enabled.
Note: You can change the default target by using the command:
```
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
```
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
```
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
```

Note: You can poll your system for available BSPs using the aoc -list-boards command. The board list that is printed out will be of the form
$> aoc -list-boards
Board list:
  <board-variant>
     Board Package: <path/to/board/package>/board-support-package
  <board-variant2>
     Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.

Compile the design. (The provided targets match the recommended development flow.)
1. Compile for emulation (fast compile time, targets emulated FPGA device).
```
nmake fpga_emu
```
2. Generate the optimization report. (See Read the Reports below for information on finding and understanding the reports.)
```
nmake report
```
3. Compile for simulation (fast compile time, targets simulated FPGA device, reduced problem size).
```
nmake fpga_sim
```
4. Compile for FPGA hardware (longer compile time, targets FPGA device):
```
nmake fpga
```

Note: If you encounter any issues with long paths when compiling under Windows*, you may have to create your 'build' directory in a shorter path, for example c:\samples\build. You can then run cmake from that directory, and provide cmake with the full path to your sample directory, for example:
C:\samples\build> cmake -G "NMake Makefiles" C:\long\path\to\code\sample\CMakeLists.txt

Read the Reports

Locate the report.html file in either:

Report-only compile: loop_initiation_interval.report.prj
FPGA hardware compile: loop_initiation_interval.fpga.prj

Looking at the reports for the design without the intel::initiation_interval attribute (when cmake was configured without -DUSER_FLAGS=-DENABLE_II), navigate to the Loop Analysis report (Throughput Analysis > Loop Analysis). Click the SimpleMath kernel in the Loop List panel and use the Bottlenecks viewer panel in the bottom left. You will see that a throughput bottleneck exists in the SimpleMath kernel.

Select the bottleneck. The report shows that the estimated f_MAX is significantly lower than the target f_MAX and shows the feedback path responsible, which is the feedback path in the initialization loop.

The Loop Analysis report shows that the long-running loop achieves the target f_MAX with an II of 1.

Compare the results to the report for the version of the design using the intel::initiation_interval attribute (when cmake was configured with -DUSER_FLAGS=-DENABLE_II), navigate to the Loop Analysis report (Throughput Analysis > Loop Analysis). Here both loops achieve the target f_MAX.

Note: Only the report generated after the FPGA hardware compile will reflect the true performance benefit of using the initiation_interval extension. The difference is not apparent in the reports generated by make report because a design's f_MAX cannot be predicted. The final achieved f_MAX can be found in loop_initiation_interval.fpga.prj/reports/report.html (after make fpga completes), in Clock Frequency Summary on the main page of the report.

Run the `Loop Initiation Interval` Sample

On Linux

Run the sample on the FPGA emulator (the kernel executes on the CPU).
```
./loop_initiation_interval.fpga_emu
```
Run the sample on the FPGA simulator device (the kernel executes on the CPU).
```
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./loop_initiation_interval.fpga_sim
```
Run the sample on the FPGA device (only if you ran cmake with -DFPGA_DEVICE=<board-support-package>:<board-variant>).
```
./loop_initiation_interval.fpga
```

On Windows

Run the sample on the FPGA emulator (the kernel executes on the CPU).
```
loop_initiation_interval.fpga_emu.exe
```

Run the sample on the FPGA simulator device (the kernel executes on the CPU).

set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
loop_initiation_interval.fpga_sim.exe
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=

Run the sample on the FPGA device (only if you ran cmake with -DFPGA_DEVICE=<board-support-package>:<board-variant>).
```
loop_initiation_interval.fpga.exe
```

Example Output

Output of sample without the intel::initiation_interval attribute.

Kernel Throughput: 0.0635456MB/s
Exec Time: 60.0309s , InputMB: 3.8147MB
PASSED

Output of sample with the intel::initiation_interval attribute.

Kernel_ENABLE_II Throughput: 0.117578MB/s
Exec Time: 32.4439s , InputMB: 3.8147MB
PASSED

Total throughput improved with the use of the intel::initiation_interval attribute because the increase in kernel f_MAX is more significant than the II relaxation of the low trip-count loop.

This performance difference will be apparent only when running on FPGA hardware. The emulator, while useful for verifying functionality, will generally not reflect differences in performance.

License

Code samples are licensed under the MIT license. See License.txt for details.

Third-party program Licenses can be found here: third-party-programs.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loop_initiation_interval

loop_initiation_interval

README.md

Loop `initiation_interval` Sample

Purpose

Prerequisites

Key Implementation Details

The f_MAX-II Tradeoff

Compiler Default Heuristics and Overrides

Syntax

Use Cases for `intel::initiation_interval`

Code Sample: Overriding the f_MAX-II Heuristic in the Compiler

Part 1: Without `ENABLE_II`

Part 2: With `ENABLE_II`

Build the `Loop Initiation Interval` Tutorial

On Linux*

On Windows*

Read the Reports

Run the `Loop Initiation Interval` Sample

On Linux

On Windows

Example Output

License

Files

loop_initiation_interval

Directory actions

More options

Directory actions

More options

Latest commit

History

loop_initiation_interval

Folders and files

parent directory

README.md

Loop initiation_interval Sample

Purpose

Prerequisites

Key Implementation Details

The fMAX-II Tradeoff

Compiler Default Heuristics and Overrides

Syntax

Use Cases for intel::initiation_interval

Code Sample: Overriding the fMAX-II Heuristic in the Compiler

Part 1: Without ENABLE_II

Part 2: With ENABLE_II

Build the Loop Initiation Interval Tutorial

On Linux*

On Windows*

Read the Reports

Run the Loop Initiation Interval Sample

On Linux

On Windows

Example Output

License

Loop `initiation_interval` Sample

The f_MAX-II Tradeoff

Use Cases for `intel::initiation_interval`

Code Sample: Overriding the f_MAX-II Heuristic in the Compiler

Part 1: Without `ENABLE_II`

Part 2: With `ENABLE_II`

Build the `Loop Initiation Interval` Tutorial

Run the `Loop Initiation Interval` Sample