Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Euro LLVM 17 Extended Abstract

Daniel Krupp edited this page Jan 26, 2017 · 54 revisions

Abstract

Today Clang SA can perform (context-sensitive) inter-procedural analysis for C,C++ and Objective C files by ''inlining'' the called function into the callers context. This means that that the full calling context (assumptions about function parameters, global variables) is passed when analyzing the called function and then the assumptions about the returned value is passed back to the caller. This works well for function calls within a translation unit (TU), but when the symbolic execution reaches a function that is implemented in another TU, the analyzer engine skips the analysis of the called function definition. In particular, assumptions about references and pointers passed as function parameter get invalidated, and the return value of the function will be unknown. Loosing precision this way may lead to false positive and false negative findings.

The cross translation unit (CTU) feature would allow the analysis of called functions even if the definition of the function is external to the currently analyzed TU. This would allow detection of bugs in library functions stemming from incorrect usage (e.g. a library assumes that the user will free a memory block allocated by the library), and allows for more precise analysis in general of the caller if a TU external function is invoked (by not loosing assumptions).

We implemented (based on the prototype by A. Sidorin et al. [1]) the Cross Translation Unit analysis feature for Clang SA (4.0) and evaluated its performance on various open source projects. In our presentation we show that with the CTU feature we found many new true positive reports and eliminated some false positives in real open source projects. We show that while the total analysis time increases to 2-3 times, the execution remains scalable to the number of processors. We also point out how the analysis coverage changes that may lead to the loss of reports compared to the non-CTU baseline.

Description of the CTU prototype

In this project we are working on a method which enables CTU analysis by inlining external function definitions using clang's existing ASTImporter (see http://clang.llvm.org/doxygen/classclang_1_1ASTImporter.html) functionality. One of our goals was to have minimal, isolated, and lightweight changes. We added less than 300 lines of code to the clang codebase (1200 lines of code including the support utilities that are separate executables and tests).

2-pass analysis

To perform the analysis we need to run clang on the whole source code two times. Cross Translation Analysis Overview 1st pass: The first analysis phase generates 3 important outputs: AST dump of all Tranlsation units, a map file for functions with external linkage, the global call flow graph. The AST binary (using the clang -cc1 -emit-pch http://clang.llvm.org/docs/PCHInternals.html) of each TU is generated into a temporary directory called preanalyze-dir. The mangled name and location of all externally linkable functions are collected into the function definition index (externalFnMap.txt). The global call graph of the program is stored in a file called (cfg.txt).

2nd pass: The Clang Static Analyzer is run for all translation units. When during the analysis a TU external function is reached, the location of the definition of that function is looked up from in the function definition index and the definition is imported from the containing AST binary into the caller's context using the ASTImpoter library.

The source code of the prototype is available in this clang Github fork: https://github.com/dkrupp/clang/tree/ctu-master

Results on Open Source projects

We have analyzed many open source projects (ffmpeg, tmux, curl, vim, postgresql...) using CTU and found many new true positive reports.

In the following table we highlighted the most important findings from 3 C projects.

Analyzed project All Non-CTU Findings (baseline) All CTU Findings New CTU findings Disappeared findings Successfully analyzed Failed to analyze Analysis Time (baseline)[s] Analysis Time XTU (1st Phase + 2nd Phase)[s] Median of bug path length (BPL) in baseline Median of BPL CTU Median of BPL of new findings Median of BPL of disappeared findings
FFMpeg 339 399 101 41 1555 files 8 files 318 44+693 7 9 16 14
TMux 49 77 29 1 133 files 0 files 39 4+75 16 16 17 6
Curl 7 11 4 0 280 files 13 files 32 6.5+54 1 1 12 n.a.

The measurements were taken using clang 4.0 as the baseline and the same clang 4.0 with the CTU patch. Only the by default enabled (stable) checkers were switched on.

Quality of new results The new analysis found significant number of new results on all projects. The number of findings increased by 17% to 50%. We examined the new findings in detail and most of them are true positive. According to our evaluation the CTU analysis did not introduce any new false positives (there was no precision loss at CTU boundary).

The CTU analysis revealed many new bugs that require length symbolic analysis, such as memory leak problems (unix.Malloc), dereference of null pointers (core.NullDereference, core.NonNullParamChecker, core.CallAndMessage) and usage of uninitialized values (core.uninitialized.Assign, core.uninitialized.Branch,core.UndefinedBinaryOperatorResult). See for example distribution of new findings for FFMPeg project.

Checker ID Number of new findings
core.CallAndMessage 5
core.DivideZero 5
core.NonNullParamChecker 12
core.NullDereference 18
core.UndefinedBinaryOperatorResult 21
core.uninitialized.Assign 11
core.uninitialized.Branch 6
unix.Malloc 23

We expected that the bug path length of the new results will be longer than the results in the baseline, due to the long traversal of function calls into external source files. This assumption turned out to be true as the median of the bug path length of new bugs is 16 compared to the non-ctu case 7 (for ffmpeg). However in practice 16 long bug path are still manageable for programmers.

Some of the results disappeared compared to the baseline analysis. We are still analyzing the result for this, but we primarily suspect two reasons:

  1. The coverage of the analysis somewhat changed for the base file analyzed, because in some cases the analysis node budget is consumed by the traversal of long CTU paths instead of paths inside the base file. It must be noted that in the Non-CTU case the paths inside the base file have inherently imprecise assumptions about external functions.

???2. Some of the findings disappeared in the baseline because they were false positives as they were based on false assumptions abount unkown external functions.???

Performance In general the analysis time (including both analysis passes) increased ~2 to ~2.5 times the baseline non-CTU analysis time on all analyzed projects. The increment is due to the first analysis phase and the extra effort of loading the definition of external function from the disc during the analysis. The measurements were performed on the same machine, on CPU cores.

Remaining problems to work on

  • CTU is great to test the ASTImporter. We found bugs where the AST is imported successfully but the resulting AST is not correct (assertion fails triggered by the analyzer, due to faulty import).
  • Uniquing needs to be done between different TUs.
  • What to do with too long paths?
  • Do we lose true positives and coverage? We plan to measure coverage. Different strategy to build the exploded graph?
  • The size of the dumped ASTs. The possibility to improve the situation.

Credits

This work is based on earlier work of Aleksei Sidorin , Artem Dergachev et al. See http://lists.llvm.org/pipermail/cfe-dev/2015-October/045730.html

References

[1] Cross-TU Analysis results on open source projects: http://cc.elte.hu/ [2] Artem Dergachev, Aleksei Sidorin Cross TU Analysis for Clang 3.4 http://lists.llvm.org/pipermail/cfe-dev/2015-October/045730.html