-
Notifications
You must be signed in to change notification settings - Fork 85
Expand section on profilers (perf and VTune) #381
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -4,39 +4,228 @@ | |||||
\frametitle{Profiling} | ||||||
\begin{block}{Conceptually} | ||||||
\begin{itemize} | ||||||
\item take a measurement of a performance aspect of a program | ||||||
\item Take a measurement of a performance aspect of a program | ||||||
\begin{itemize} | ||||||
\item where in my code is most of the time spent? | ||||||
\item is my program compute or memory bound? | ||||||
\item does my program make good use of the cache? | ||||||
\item is my program using all cores most of the time? | ||||||
\item how often are threads blocked and why? | ||||||
\item which API calls are made and in which order? | ||||||
\item Where in my code is most of the time spent? | ||||||
\item Is my program compute or memory bound? | ||||||
\item Does my program make good use of the cache? | ||||||
\item Is my program using all cores most of the time? | ||||||
\item How often are threads blocked and why? | ||||||
\item Which API calls are made and in which order? | ||||||
\item ... | ||||||
\end{itemize} | ||||||
\item the goal is to find performance bottlenecks | ||||||
\item is usually done on a compiled program, not on source code | ||||||
\item The goal is to find performance bottlenecks | ||||||
\item Usually done on a compiled program, not on source code | ||||||
\end{itemize} | ||||||
\end{block} | ||||||
\end{frame} | ||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{perf, VTune and uProf} | ||||||
\begin{block}{perf} | ||||||
\frametitle{\mintinline{bash}{perf} -- Performance analysis tools for Linux} | ||||||
\setlength{\leftmargini}{0pt} | ||||||
\begin{itemize} | ||||||
\item perf is a powerful command line profiling tool for linux | ||||||
\item compile with \mintinline{bash}{-g -fno-omit-frame-pointer} | ||||||
\item \mintinline{bash}{perf stat -d <prg>} gathers performance statistics while running \mintinline{bash}{<prg>} | ||||||
\item \mintinline{bash}{perf record -g <prg>} starts profiling \mintinline{bash}{<prg>} | ||||||
\item \mintinline{bash}{perf report} displays a report from the last profile | ||||||
\item More information in \href{https://perf.wiki.kernel.org/index.php/Main_Page}{this wiki}, \href{https://www.brendangregg.com/linuxperf.html}{this website} or \href{https://indico.cern.ch/event/980497/contributions/4130271/attachments/2161581/3647235/linux-systems-performance.pdf}{this talk}. | ||||||
\item Powerful command line profiling tool for Linux | ||||||
\item Not portable, the source code is part of the Linux kernel itself | ||||||
\item Much lower overhead compared with \mintinline{bash}{valgrind} | ||||||
\item In order to profile your code, make sure to compile with | ||||||
\texttt{CXXFLAGS="-O2 -g -fno-omit-frame-pointer"} | ||||||
\item Counting and sampling | ||||||
\begin{itemize} | ||||||
\item Counting -- count occurrences of a given event (e.g.\ cache misses) | ||||||
\item Time-based sampling -- sample the stack at regular time intervals | ||||||
\item Event-based sampling -- take samples when event counter overflows | ||||||
\item Instruction-based sampling -- sample instructions and precisely count events they create | ||||||
\end{itemize} | ||||||
amadio marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
\item Static and dynamic tracing | ||||||
\begin{itemize} | ||||||
\item Static -- pre-defined tracepoints in software (e.g.\ scheduling events) | ||||||
\item Dynamic -- tracepoints created dynamically with \mintinline{bash}{perf probe} | ||||||
\end{itemize} | ||||||
\end{itemize} | ||||||
\end{block} | ||||||
\begin{block}{Intel VTune and AMD uProf} | ||||||
\begin{itemize} | ||||||
\item Graphical profilers from CPU vendors with rich features | ||||||
\item Needs vendor's CPU for full experience | ||||||
\item More information on \href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{Intel's website} and \href{https://developer.amd.com/amd-uprof/}{AMD's website} | ||||||
\end{itemize} | ||||||
\end{block} | ||||||
\end{frame} | ||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{\mintinline{bash}{perf} commands} | ||||||
{ \scriptsize | ||||||
\begin{block}{} | ||||||
\begin{minted}{shell-session} | ||||||
$ perf | ||||||
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] | ||||||
The most commonly used perf commands are: | ||||||
annotate Read perf.data and display annotated code | ||||||
c2c Shared Data C2C/HITM Analyzer. | ||||||
config Get and set variables in a configuration file. | ||||||
diff Read perf.data and display the differential profile | ||||||
evlist List the event names in a perf.data file | ||||||
list List all symbolic event types | ||||||
mem Profile memory accesses | ||||||
record Run a command and record its profile into perf.data | ||||||
report Read perf.data and display the profile | ||||||
sched Tool to trace/measure scheduler properties (latencies) | ||||||
script Read perf.data and display trace output | ||||||
stat Run command and gather performance counter statistics | ||||||
top System profiling tool. | ||||||
version display the version of perf binary | ||||||
probe Define new dynamic tracepoints | ||||||
trace strace inspired tool | ||||||
See 'perf help COMMAND' for more information on a specific command. | ||||||
\end{minted} | ||||||
\end{block} | ||||||
} | ||||||
\end{frame} | ||||||
Comment on lines
+51
to
+75
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this useful ? I think I would drop it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I use a similar slide to this to give a general overview of perf in my own presentations, mentioning that there are more commands than the ones I cover. If you don't want to go into details, this could be a useful slide for that. However, other than that, it's probably fine to drop. I did have to shorten the description of the commands to fit in the slide anyway, so this is not quite what you'd get by running perf without arguments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is that this slide would be systematically skipped when you present. So if it's a pure reference, then let's put it in a reference section at the very end. Otherwise, let's drop it.
Useful indeed, but then I would mention that there are a lot of commands, not list them There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that most people don't think it's useful, so I will drop this slide. |
||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{Listing events with \mintinline{bash}{perf list}} | ||||||
{ \scriptsize | ||||||
\begin{block}{} | ||||||
\begin{minted}{shell-session} | ||||||
$ # List main hardware events | ||||||
$ perf list hw | ||||||
|
||||||
List of pre-defined events (to be used in -e): | ||||||
|
||||||
branch-instructions OR branches [Hardware event] | ||||||
branch-misses [Hardware event] | ||||||
cache-misses [Hardware event] | ||||||
cache-references [Hardware event] | ||||||
cpu-cycles OR cycles [Hardware event] | ||||||
instructions [Hardware event] | ||||||
|
||||||
$ # List main software/cache events | ||||||
$ perf list sw | ||||||
$ perf list cache | ||||||
|
||||||
$ # List all pre-defined metrics | ||||||
$ perf list metric | ||||||
|
||||||
$ # List all currently known events: | ||||||
$ perf list | ||||||
\end{minted} | ||||||
\end{block} | ||||||
} | ||||||
\end{frame} | ||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{Counting events with \mintinline{bash}{perf stat}} | ||||||
{ \scriptsize | ||||||
\begin{block}{} | ||||||
\begin{minted}{shell-session} | ||||||
$ # Standard CPU counter statistics for the specified command: | ||||||
$ perf stat <command> | ||||||
|
||||||
$ # Detailed CPU counter statistics for the specified command: | ||||||
$ perf stat -d <command> | ||||||
$ perf stat -dd <command> | ||||||
|
||||||
$ # Top-down microarchitecture analysis for the entire system, for 10s: | ||||||
$ perf stat -a --topdown -- sleep 10 | ||||||
|
||||||
$ # L1 cache hit rate reported every 1000 ms for the specified command: | ||||||
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses -I 1000 <command> | ||||||
|
||||||
$ # Instruction per cycle and Instruction-level parallelism, for command: | ||||||
$ perf stat -M IPC,ILP -- <command> | ||||||
|
||||||
$ # Measure GFLOPs system-wide, until Ctrl-C is used to stop: | ||||||
$ perf stat -M GFLOPs | ||||||
|
||||||
$ # Measure cycles and instructions 10 times, report results with stddev: | ||||||
$ perf stat -e cycles,instructions -r 10 -- <command> | ||||||
\end{minted} | ||||||
\end{block} | ||||||
} | ||||||
\end{frame} | ||||||
|
||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{Recording profiling information with \mintinline{bash}{perf record}} | ||||||
{ \scriptsize | ||||||
\begin{block}{} | ||||||
\begin{minted}{shell-session} | ||||||
$ # Sample on-CPU functions for the specified command, at 100 Hertz: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is an I just tried that command and it counted
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I start wondering if it's worth keeping examples that cannot be understood simply. The explanation you just gave is already far above the expected knowledge of the people attending the course. In order to explain that, you would need a whole set of slides starting with "thread scheduling", "sampling", etc... |
||||||
$ perf record -F 100 -- <command> | ||||||
sponce marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
$ # Sample CPU stack traces (via frame pointers), at 100 Hertz, for 10s: | ||||||
$ perf record -F 100 -g -- sleep 10 | ||||||
Comment on lines
+148
to
+149
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, good catch, I did intend to have |
||||||
|
||||||
$ # Sample stack traces for PID using DWARF to unwind stacks, for 10s: | ||||||
$ perf record -p <PID> --call-graph=dwarf -- sleep 10 | ||||||
Comment on lines
+151
to
+152
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, it is even more surprising for me. The PID should give the process to profile. What does the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And here we suppose that people are at easy with frame-pointers (previous line) and dwarf. That would require another set of slides by itself. Less and less convinced that we should not simplify drastically and give only one slide of examples with one line of each list/stat/record/report There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tend to agree with @sponce. Maybe I'm assuming too much prior knowledge that the average student doesn't/won't have. I guess in that case, showing just how to do the simplest case, which is to collect and view a report just using the default of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But I'm sure HSF people would love to create a full course dedicated to perf. And I promise I would be one of your first students :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've given a few talks here and there, so I have many slides on perf (not using LaTeX, though). I could think about converting the material I have into a course on performance analysis, and including other less known tools, like bpftrace, uftrace, bcc, etc. That said, perf itself is more than enough for a full course, as I doubt many people have used |
||||||
|
||||||
$ # Precise on-CPU user stack traces (no skid) using PEBS (Intel CPUs): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is an Maybe we need a slide introducing some terms of art and defining the acronyms. Or a glossary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I explained on-CPU above. Basically, there is a margin of error to attribute samples to instructions, as a number of instructions are in flight in parallel on the CPU at any given time. This error is called the skid in the sampling (see more information here). PEBS stands for Precise Event Based Sampling (PEBS), and is a feature on Intel CPUs that allows sampling with low or no skid. The sort of equivalent thing on AMD CPUs is IBS, or Instruction-based Sampling. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I hope that someone presenting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do we need a tool section in the expert part ? That could be a solution There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a tools course, separate from a C++ course. VTune, perf, valgrind, can all be used for much more than just C++, so we can bundle this together with bash, coreutils, and some other command line tools that are used very often and make a new course. |
||||||
$ perf record -g -e cycles:up -- <command> | ||||||
|
||||||
$ # Sample CPU stack traces using Instruction-based sampling (AMD CPUs): | ||||||
$ # (Note that you need to use system-wide sampling for IBS on AMD CPUs) | ||||||
$ perf record -a -g -e cycles:pp -- <command> | ||||||
Comment on lines
+157
to
+159
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IBS is explained above. The requirement to use system-wide sampling is a hardware requirement when using IBS on AMD CPUs. This is also explained in |
||||||
|
||||||
$ # Sample CPU stack traces once every 10k L1 data cache misses, for 5s: | ||||||
$ perf record -a -g -e L1-dcache-load-misses -c 10000 -- sleep 5 | ||||||
|
||||||
$ # Sample CPUs at 100 Hertz, and show top addresses and symbols, live: | ||||||
$ perf top -F 100 | ||||||
\end{minted} | ||||||
\end{block} | ||||||
} | ||||||
\end{frame} | ||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{Reporting and annotating source code with \mintinline{bash}{perf}} | ||||||
{ \scriptsize | ||||||
\begin{block}{} | ||||||
\begin{minted}{shell-session} | ||||||
$ # Standard reporting of perf.data in text UI interface: | ||||||
$ perf report | ||||||
|
||||||
$ # Report by self-time (excluding time spent in callees): | ||||||
$ perf report --no-children | ||||||
|
||||||
$ # Report per source line of code (needs debugging info to work): | ||||||
$ perf report --no-children -s srcline | ||||||
|
||||||
$ # Single inverted (caller-based) call-graph per binary: | ||||||
$ perf report --inverted -s comm | ||||||
|
||||||
$ # Text-based report per library, without call graph: | ||||||
$ perf report --stdio -g none -s dso | ||||||
|
||||||
$ # Hierarchical report for functions taking at least 1% of runtime: | ||||||
$ perf report --stdio -g none --hierarchy --percent-limit 1 | ||||||
|
||||||
$ # Disassemble and annotate a symbol (instructions with percentages): | ||||||
$ # (Needs debugging information available to show source code as well) | ||||||
$ perf annotate <symbol> | ||||||
\end{minted} | ||||||
\end{block} | ||||||
} | ||||||
\end{frame} | ||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{Further information on \mintinline{bash}{perf}} | ||||||
\begin{itemize} | ||||||
\item Official documentation in the Linux repository at | ||||||
\href{https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/Documentation} | ||||||
{linux/tools/perf/Documentation} | ||||||
\item Perf Wiki at \url{https://perf.wiki.kernel.org/} | ||||||
\item Linux \mintinline{bash}{perf} examples by Brendan Gregg | ||||||
\url{https://www.brendangregg.com/linuxperf.html} | ||||||
\item Scripts to visualize profiles as flamegraphs | ||||||
\url{https://github.com/brendangregg/FlameGraph} | ||||||
\item HSF Tools \& Packaging Working Group talk on Indico\\ | ||||||
\href{https://indico.cern.ch/event/974382/} | ||||||
{Linux Systems Performance: Tracing, Profiling \& Visualization} | ||||||
\end{itemize} | ||||||
\end{frame} | ||||||
|
||||||
\begin{frame}[fragile] | ||||||
\frametitle{Intel VTune Profiler} | ||||||
\centering | ||||||
\includegraphics[width=0.75\textwidth]{tools/vtune.png} | ||||||
\begin{itemize} | ||||||
\item Very powerful GUI-based profiler for Intel CPUs and GPUs | ||||||
\item Now free to use with | ||||||
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html}{Intel oneAPI Base Toolkit} or | ||||||
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{standalone} | ||||||
\item See the \href{https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/} | ||||||
{official online documentation} for more information | ||||||
\end{itemize} | ||||||
\end{frame} | ||||||
Comment on lines
+219
to
231
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the picture brings something for people not knowing the tool ? I would maybe replace it with a bullet highlighting the things it can do which perf cannot (if any) and another giving the donwsides There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since VTune is a graphical tool, I thought it would be nice to show what it looks like when you open it. You can use the picture to show the types of analyses that VTune is able to do instead of a bullet list, and just tell people when presenting about the extra features it has over perf. For detailed usage information, I'd point people to the online docs. One thing I'd mention while presenting is the Top-Down Microarchitecture Analysis, which is a very good method to find bottlenecks. While perf can also do it, it cannot show you detailed information for each symbol like VTune does, and the annotation of source code by VTune is also a lot easier to use than perf's. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also link a talk from Ahmad Yasin, who was behind the creation of the Top-Down Microarchitecture Analysis Method at Intel. It's a very nice talk. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not care about picture themselves. I care that if there is a picture, it's understandable, that is that we explain what appears there. In this case, there is a LOT of explanations missing, and I'm not sure we want to include them actually. |
Uh oh!
There was an error while loading. Please reload this page.