Modern optimizing compilers are becoming increasingly dependent on dynamic profile information. Because the profile information collected by these compilers also is sufficient for QA, it is likely that code-coverage analysis will become an integrated development environment option. This integration should help to simplify your code development and testing processes and should also improve the accuracy of your coverage information and the performance of your optimized code.
You may already be using code-coverage tools to quantify application testing and to identify untested portions of your code. These tools automatically instrument your code to provide detailed information about dynamic application behavior. You may also be using optimizing compilers (static or just-in-time) that are specialized for particular microprocessors. Increasingly, these compilers also measure dynamic application behavior and use the measurements to improve optimization.
In practice, coverage tools and optimizing compilers use similar instrumentation mechanisms. Because of this overlap, code coverage analysis will soon be an integral part of optimizing compilers and will be provided as an option by major vendors. This will simplify software development and testing by eliminating potential differences between the way compilers and coverage tools interpret code. It will also reduce compiler overhead for collecting coverage profile information because a compiler can more effectively optimize code it has instrumented itself. Finally, the uniform coverage information will help you accelerate testing cycles by enabling you to prioritize your regression tests and run the most relevant tests first.
Why Optimizing Compilers Need Coverage Profiles
In many cases, you can improve an application’s performance by monitoring its typical behavior and then allowing an optimizing compiler to re-optimize the code based on the application profile. This is called profile-guided optimization (PGO) and it has been shown to deliver significant performance gains for large-scale commercial database applications as well as performance benchmarks.
There are several reasons for the effectiveness of PGO. First, certain architectural and micro-architectural features of modern processors — instruction caches, branch prediction, TLB management, etc. — can be used more effectively when the typical behavior of the application is understood. Second, dynamic profile information identifies which regions of code are performance-critical. The compiler can then generate specialized and optimized code along the critical path of the application to deliver a very high performance.
In the following sections, we look at several of the most effective optimizations performed by profile-guided optimizing compilers such as code flayout optimizations. We also discuss function inlining and virtual function dispatch optimizations, both of which are especially effective for object-oriented programs such as C++ and Java.
Code Layout Optimizations
Code- ayout optimizations include basic-block ordering, function layout, and function splitting. The main idea is to put related components of code close to each other. This generally helps to reduce branch mispredictions, instruction-cache misses, and paging traffic.
If a compiler lacks profiling information, it will typically place basic blocks in the same order they are found in the source code, possibly with some additional syntactic heuristics to enhance performance. A better approach is to order blocks to reduce the frequency of jumps. However, without profiling information, it is often impossible to accurately predict the most likely outcome of conditional statements at compile time.
Modern microprocessors employ sophisticated multilevel branch-prediction mechanisms, which partially compensate for inefficient block ordering. The history of branch outcomes is recorded in a special type of cache, called the branch-target buffer (BTB), which is implemented in silicon. The information in the buffer is used to dynamically predict the outcome of branches, so the microprocessor can execute the predicted code speculatively. Depending on the number of branches and the accuracy of branch predictions, this can substantially improve performance.
The size of the BTB is constrained by economic issues, which limits the number of branch outcomes that can be recorded. For example, if a processor has 1024 BTB entries, it will record the outcome of the last distinct 1024 branch instructions that were executed. If the critical path of the application contains a lot more than 1024 dynamic branches, the processor may experience frequent BTB misses. In this case, the processor usually resorts to heuristic rules for predicting the outcome, which are generally less accurate than BTB data. For example, the processor might assume that forward branches (jumps to a higher address) should be taken and backward branches (jumps to a lower address) should not.