This section reports our optimization's effect on code size and execution time. We applied our optimization to the C++ programs listed in Table 1. The original and optimized programs were each complied on two machines: a DECStation 5000 with a MIPS R3000 33Mhz processor running Ultrix 4.2A (the DEC machine), and a Gateway 2000 with a Pentium 90Mhz processor running Linux 1.0.9 (the Intel machine). Each program was compiled with GNU's C++ compiler (version 2.3.1 on the DEC machine and version 2.5.8 on the Intel machine) using the -O3 optimization level2, which does (limited) interprocedural optimization, (limited) inlining, and substantial intraprocedural optimization.

Table 1: Descriptions of Programs Used for Data Collection
Program Description
city simulation A toy example that simulates cars, trucks, roads, and stop lights.
employee database An example with different kinds of workers (i.e., hourly, salary, by-piece, ...)
geqn The equation formatter from the Groff text formatting package.

The standard Unix time utility was used to collect the number of CPU seconds used by each program. Each run was done when the machine was unloaded, so it received an average of 99.9% of the CPU time. As a result, elapse time and CPU time were virtually identical.

The choice of execution time--a direct measure of performance--rather than an indirect measure of performance such as instruction count is intentional.3Pipeline effects, cache effects, and the like impact the average number of cycles-per-instruction, which, in turn, impact the computed execution time [7]. This occurs in our last example where the computed execution times before and after our optimization are nearly identical while the observed execution times are not.

Each program was run thirty times at random physical memory locations to account for cache effects (see [2] for a description of why this is necessary and the pitfalls associated with not accounting for memory location induced cache effects when timing programs on modern architectures). We report in Table 2 the average times for the thirty runs including the margin of error and the object code size before and after optimization. Note that the percentage growth in the resulting binary files (which include library code) was considerably less: the first two programs showed essentially zero growth, while geqn showed about 50% growth. We now discuss each program in more detail.

Table 2: Data Obtained with GNU C++'s optimizer enabled -O3
  Size Time
Program Unopt   Optimized Change Unoptimized Optimized Change
city simulation 12618 12762 1.14% 9.650 $\pm$ 0.067 9.235 $\pm$ 0.074 -4.29%
employee 7867 8715 10.78% 95.500 $\pm$ 2.706 90.978 $\pm$ 1.268 -4.74%
geqn 132331 264739 100.06% 51.265 $\pm$ 0.170 52.610 $\pm$ 0.156 2.62%
city simulation 25348 24052 -5.11% 41.865 $\pm$ 0.874 41.621 $\pm$ 0.766 -0.58%
employee 21852 24056 10.09% 295.613 $\pm$ 11.606 282.953 $\pm$ 13.364 -4.28%
geqn 302868 586416 93.62% 101.884 $\pm$ 0.162 85.878 $\pm$ 0.248 -15.71%

Size represents object code size in bytes. Time represents CPU time in seconds.

The city simulation simulates traffic flow on the roads of a city. The program has only one virtual function and was chosen to illustrate the effects on programs that contain very few virtual functions. On the Intel machine, our optimization caused about a 4% decrease in execution time and a nominal 1% increase in object code size. On the DEC machine, the optimization made only a nominal difference in the execution time. However, it is interesting to note that it decreased the object code size. This provides indirect evidence that our optimization is enabling other optimizations.

The employee program, based on code from [5], has a high proportion of virtual functions and virtual function calls; it was chosen to provide expected results for programs that use virtual functions heavily. The program has a class hierarchy of eight classes and virtual functions that (1) print the employee's data (name, current wages, etc.), (2) report the employee's earnings for the current week, (3) give the employee a raise, and (4) reset the employee for a new week. Each type of worker implements these functions differently.

We created a test input file with 1000 employees (200 of each kind), and then iterated for 300 weeks, printing the data and the earnings for each employee each week and giving each employee a raise every ten weeks. For this example our results were very promising. On the Intel machine, we achieved an almost 4% decrease in execution time with only a 10% increase in code size. On the DEC machine, we achieved slightly over 4% decrease in execution time with a similar 10% increase in code size.

Finally, geqn is the equation formatter distributed with GNU's groff text formatting package. It was chosen for its size and to give results for a large ``real world'' program. The code contains 6400 source lines of which 400 are virtual function calls. Geqn was written with a single class (called box) from which its other seventeen classes are derived. Class box declares fourteen virtual functions (on average only 7 are overridden). This class hierarchy leads to very large inlined functions and consequently large growth in code size. Better object-oriented analysis and design could lead to better class organization, and this explosive code size increase should be averted.

We used a large (3.4 MB) input file to obtain our data. For the Intel machine, our optimization provided no help. In contrast, on the DEC machine, we obtained a significant improvement (15%). The explanation lies in the cache sizes on the two machines. The Pentium processor in the Intel machine has separate 2-way associative 8K code and data caches. In the first two test programs, the size of the optimized program was small enough that cache conflicts were minimal. With geqn, the large code size leads to excessive cache misses; thus, leading to an increase in execution time. The MIPS processor in the DEC machine has a larger 64K cache. Thus, the optimized version of geqn did not encounter a restrictive number of cache misses.

The experimental results shown in Table 2 represent a promising step towards the efficient compilation of C++ programs containing virtual functions. Since our optimization enables inlining, we expected a code size increase. This was not a problem except for the largest program running on the Intel machine, which has the smallest cache. While execution time reduction was not universal, the employee program with its high percentage of virtual function calls showed a consistent performance improvement. Improved compiler optimization, for example interprocedural constant propagation (of types), should improve these results.


Copyright © 1995, 1996 Bradley M. Kuhn, David W. Binkley.

Verbatim copying and distribution of this entire paper is permitted in any medium, provided this notice is preserved.