2008-10-22 update: new benchmarks comparing several versions of GCC and Intel C++ on a Core2 are available here.
This was written some years ago and is now entirely obsolete. GCC is now at 4.1, Intel C++ is at 9.0, and the company that made KAI C++ was bought by Intel and the product was discontinued. Not to mention that Botan has gone from 0.8.1 to 1.5.6...
I'm planning on recreating these results with updated software (amusingly, the 1.4 GHz Athlon is still my primary machine), but until then keep in mind that nothing here reflects on the current versions of any of these compilers. So, with that said, here are the results:
I spend far too much of my time working on Botan, a C++ crypto library that I have been developing for the last 5 years. Crypto really needs good compilers to get decent performance, and so I've spent a lot of time tuning optimization flags for various compilers. I decided to compare the performance of code generated by GCC 3.0.4, Intel C++ 6.0, and KAI C++ 4.0 on a 1.4 GHz AMD Athlon (Thunderbird) running Linux 2.4.16. The version of Botan used was 0.8.1.
There is a pre-existing benchmark application for Botan, which is included in the distribution. This was used to compare the code generation. At this time, Botan's test application does not benchmark public key algorithms, so the performance of these is not compared.
First, a couple of notes. Botan uses virtually no floating point; this test is only a comparison of integer code. There is a good deal of rather general C++ code, but all of the hot spots are in code that is very heavy with integer and logical operations. So please only consider these results useful if your application is also heavily integer-based. For example, Intel C++ is said to excel at floating point code, due to its support of SSE/SSE2 and its auto-vectorization of loops, but we would never see much benefit from these features with this benchmark.
Botan compiles the main library and the test application with different optimization flags. Generally, the library is compiled with the best possible optimization flags, whereas the test app is compiled with what a "typical" developer might use (usually something like "-O2"). This is to help show a better estimate of "real world" performance, as most developers are not going to spend several days trying to find the right set of optimization flags.
Version: 3.0.4
Library Flags: -O3 -fstrict-aliasing -fomit-frame-pointer -march=athlon
Check Flags: -O2 -fstrict-aliasing
Version: 6.0 (Build 020312Z)
Library Flags*: -O3 -ip -unroll -fno-alias -fno-fnalias -tpp6 -xiM
Check Flags: -O2
* Blowfish and ISAAC were compiled without -fno-alias and -fno-fnalias. Blowfish's key schedule, unfortunately, virtually requires a certain amount of aliasing, and ICC gets caught up by this. ISAAC, however, seems to be victim of a bug in ICC.
I also tried -tpp7, to schedule for a Pentium 4, in the hopes that the compiler would try harder to schedule instructions, so that the Athlon could use all of its ALUs. The result was that the code, in general, got slightly slower.
Version: 4.0e
Library Flags*: +K3 --inline_auto_space_time=65 --abstract_pointer
Check Flags: +K3
* KAI C++ could also benefit from using some additional options, such as "--backend -march=athlon" and "--backend -fomit-frame-pointer". However, this assumes properties about which backend compiler KAI C++ is using, which is not portable (for example, KAI can use anything from egcs 1.0.3a to gcc 3.1, not to mention the C compilers shipped by Compaq, IBM, HP, Sun, etc).
Speeds are in Megabytes per second. Note that a lot of the benchmarks rely directly on DES performance: DESX, Triple-DES, all 8 cipher modes, and the X9.19 MAC. So taking the lead in DES means beating the other compilers in 11 other benchmarks as well. This is arguably fair, since DES is an important and oft-used algorithm. Perhaps using AES for the cipher modes might be a more realistic benchmark, as it will be used a lot in coming years.
| Algorithm | GCC | ICC | KCC-4.0 | Winner |
|---|---|---|---|---|
| Blowfish | 23.00 | 17.11 | 20.55 | GCC |
| CAST256 | 11.78 | 7.80 | 11.15 | GCC |
| CAST5 | 18.58 | 10.54 | 24.00 | KCC-4.0 |
| CS-Cipher | 2.15 | 1.76 | 2.21 | KCC-4.0 |
| DES | 10.64 | 6.91 | 13.46 | KCC-4.0 |
| DESX | 9.81 | 6.42 | 11.91 | KCC-4.0 |
| Triple-DES | 3.74 | 2.64 | 5.38 | KCC-4.0 |
| GOST | 10.92 | 6.01 | 17.33 | KCC-4.0 |
| IDEA | 11.78 | 15.79 | 11.01 | ICC |
| Lion<MD5,ISAAC> | 30.08 | 22.00 | 34.08 | KCC-4.0 |
| Lion<SHA1,SEAL> | 13.21 | 13.58 | 15.61 | KCC-4.0 |
| Luby-Rackoff<SHA1> | 2 | 3.52 | 4.43 | GCC |
| MISTY1 | 12.43 | 5.48 | 11.86 | GCC |
| RC2 | 11.49 | 7.68 | 10.36 | GCC |
| RC5(12) | 19.55 | 32.52 | 39.77 | KCC-4.0 |
| RC5(16) | 15.61 | 26.70 | 35.45 | KCC-4.0 |
| RC6 | 38.47 | 22.58 | 37.93 | GCC |
| Rijndael (r = 10) | 21.09 | 20.41 | 24.23 | KCC-4.0 |
| Rijndael (r = 12) | 18.06 | 17.67 | 21.60 | KCC-4.0 |
| Rijndael (r = 14) | 15.72 | 15.49 | 19.57 | KCC-4.0 |
| SAFER-SK128 | 8.31 | 12.11 | 8.12 | ICC |
| Serpent | 11.54 | 7.42 | 13.20 | KCC-4.0 |
| SHARK | 19.00 | 20.09 | 17.06 | ICC |
| Skipjack | 8.89 | 3.26 | 9.76 | KCC-4.0 |
| Square | 25.17 | 24.94 | 26.76 | KCC-4.0 |
| TEA | 18.25 | 19.59 | 16.37 | ICC |
| ThreeWay | 21.75 | 25.32 | 18.96 | ICC |
| Twofish | 22.07 | 18.84 | 22.80 | KCC-4.0 |
| XTEA | 20.65 | 20.07 | 10.89 | GCC |
| CBC<DES> | 10.40 | 6.85 | 13.05 | KCC-4.0 |
| CTS<DES> | 10.42 | 6.87 | 12.92 | KCC-4.0 |
| CFB<DES>(8) | 9.78 | 6.50 | 12.00 | KCC-4.0 |
| CFB<DES>(4) | 4.75 | 3.22 | 5.89 | KCC-4.0 |
| CFB<DES>(2) | 2.36 | 1.60 | 2.92 | KCC-4.0 |
| CFB<DES>(1) | 1.17 | 0.80 | 1.46 | KCC-4.0 |
| OFB<DES> | 10.33 | 6.80 | 13.14 | KCC-4.0 |
| Counter<DES> | 10.03 | 6.69 | 12.91 | KCC-4.0 |
| ARC4 | 47.07 | 49.88 | 37.93 | ICC |
| ISAAC | 103.72 | 109.45 | 124.66 | KCC-4.0 |
| SEAL | 46.23 | 83.75 | 68.25 | ICC |
| Adler32 | 541.99 | 790.11 | 448.24 | ICC |
| CRC24 | 152.62 | 152.58 | 154.66 | KCC-4.0 |
| CRC32 | 172.55 | 196.58 | 158.33 | ICC |
| HAVAL | 73.43 | 31.07 | 49.40 | GCC |
| MD2 | 2.79 | 2.00 | 2.37 | GCC |
| MD4 | 176.17 | 68.19 | 147.65 | GCC |
| MD5 | 127.08 | 48.78 | 104.35 | GCC |
| RIPEMD-128 | 103.42 | 27.74 | 85.96 | GCC |
| RIPEMD-160 | 69.13 | 18.29 | 58.07 | GCC |
| SHA-1 | 81.50 | 38.08 | 53.37 | GCC |
| SHA2-256 | 39.68 | 26.04 | 30.04 | GCC |
| SHA2-512 | 7.17 | 12.96 | 14.10 | KCC-4.0 |
| Tiger | 34.86 | 26.68 | 33.06 | GCC |
| EMAC<Square> | 25.36 | 25.49 | 27.67 | KCC-4.0 |
| HMAC-SHA1 | 81.24 | 37.94 | 53.73 | GCC |
| MD5-MAC | 81.54 | 41.77 | 96.91 | KCC-4.0 |
| X9.19-MAC | 11.07 | 6.92 | 14.38 | KCC-4.0 |
| Randpool | 0.50 | 0.28 | 0.40 | GCC |
| X917<Square> | 1.23 | 1.16 | 1.39 | KCC-4.0 |
GCC 3.0.4 does a really good job compiling the hash functions; it won every hash function except SHA2-512. Its performance there is quite poor, losing to both ICC 6.0 and KCC 4.0. This suggests an optimization problem, probably related to SHA2-512's heavy use of the 64-bit "long long" type. (Update: it turns out it was due to keeping a number of constants in an array rather than inlining them. Apparently array accesses cause problems for the GCC function inliner.)
Intel C++, in general, does quite poorly. It does win on a good half-dozen or so algorithms, but for the most part its performance is below-par, in particular with the hash functions, where its code is as little as 1/4 the speed of GCC's code. It is interesting to note that every algorithm ICC wins is one where good scheduling performance is a must for high performance. For example, Adler32's hotspot is a straight line of loads, additions, and stores. It seems ICC knows how to do this right, leading to an impressive performance of 790 Megabytes per second.
These results also point to a scheduling problem with GCC 3.0.4. RC5 and SEAL, in particular, seem to perform very poorly with GCC compared to the other compilers, and to published estimates of their speed. Both are very dependent upon the compiler doing good instruction scheduling to work around the data dependencies present in the algorithms; it seems likely that GCC is not quite up to the task yet. This is probably unrelated to the problem with SHA2-512, because SHA2-512's internal structure is very similiar to MD4, MD5, SHA-1, etc, and GCC does very well with those.
Feel free to send me email if you have any questions or comments on this analysis.