/ :: Linux/x86 Compilers Comparison

2008-10-22 update: new benchmarks comparing several versions of GCC and Intel C++ on a Core2 are available here.


This was written some years ago and is now entirely obsolete. GCC is now at 4.1, Intel C++ is at 9.0, and the company that made KAI C++ was bought by Intel and the product was discontinued. Not to mention that Botan has gone from 0.8.1 to 1.5.6...

I'm planning on recreating these results with updated software (amusingly, the 1.4 GHz Athlon is still my primary machine), but until then keep in mind that nothing here reflects on the current versions of any of these compilers. So, with that said, here are the results:


I spend far too much of my time working on Botan, a C++ crypto library that I have been developing for the last 5 years. Crypto really needs good compilers to get decent performance, and so I've spent a lot of time tuning optimization flags for various compilers. I decided to compare the performance of code generated by GCC 3.0.4, Intel C++ 6.0, and KAI C++ 4.0 on a 1.4 GHz AMD Athlon (Thunderbird) running Linux 2.4.16. The version of Botan used was 0.8.1.

There is a pre-existing benchmark application for Botan, which is included in the distribution. This was used to compare the code generation. At this time, Botan's test application does not benchmark public key algorithms, so the performance of these is not compared.

First, a couple of notes. Botan uses virtually no floating point; this test is only a comparison of integer code. There is a good deal of rather general C++ code, but all of the hot spots are in code that is very heavy with integer and logical operations. So please only consider these results useful if your application is also heavily integer-based. For example, Intel C++ is said to excel at floating point code, due to its support of SSE/SSE2 and its auto-vectorization of loops, but we would never see much benefit from these features with this benchmark.

Botan compiles the main library and the test application with different optimization flags. Generally, the library is compiled with the best possible optimization flags, whereas the test app is compiled with what a "typical" developer might use (usually something like "-O2"). This is to help show a better estimate of "real world" performance, as most developers are not going to spend several days trying to find the right set of optimization flags.

GCC

Version: 3.0.4
Library Flags: -O3 -fstrict-aliasing -fomit-frame-pointer -march=athlon
Check Flags: -O2 -fstrict-aliasing

Intel C++

Version: 6.0 (Build 020312Z)
Library Flags*: -O3 -ip -unroll -fno-alias -fno-fnalias -tpp6 -xiM
Check Flags: -O2

* Blowfish and ISAAC were compiled without -fno-alias and -fno-fnalias. Blowfish's key schedule, unfortunately, virtually requires a certain amount of aliasing, and ICC gets caught up by this. ISAAC, however, seems to be victim of a bug in ICC.

I also tried -tpp7, to schedule for a Pentium 4, in the hopes that the compiler would try harder to schedule instructions, so that the Athlon could use all of its ALUs. The result was that the code, in general, got slightly slower.

KAI C++

Version: 4.0e
Library Flags*: +K3 --inline_auto_space_time=65 --abstract_pointer
Check Flags: +K3

* KAI C++ could also benefit from using some additional options, such as "--backend -march=athlon" and "--backend -fomit-frame-pointer". However, this assumes properties about which backend compiler KAI C++ is using, which is not portable (for example, KAI can use anything from egcs 1.0.3a to gcc 3.1, not to mention the C compilers shipped by Compaq, IBM, HP, Sun, etc).

Results

Speeds are in Megabytes per second. Note that a lot of the benchmarks rely directly on DES performance: DESX, Triple-DES, all 8 cipher modes, and the X9.19 MAC. So taking the lead in DES means beating the other compilers in 11 other benchmarks as well. This is arguably fair, since DES is an important and oft-used algorithm. Perhaps using AES for the cipher modes might be a more realistic benchmark, as it will be used a lot in coming years.

Algorithm GCC ICC KCC-4.0 Winner
Blowfish 23.00 17.11 20.55 GCC
CAST256 11.78 7.80 11.15 GCC
CAST5 18.58 10.54 24.00 KCC-4.0
CS-Cipher 2.15 1.76 2.21 KCC-4.0
DES 10.64 6.91 13.46 KCC-4.0
DESX 9.81 6.42 11.91 KCC-4.0
Triple-DES 3.74 2.64 5.38 KCC-4.0
GOST 10.92 6.01 17.33 KCC-4.0
IDEA 11.78 15.79 11.01 ICC
Lion<MD5,ISAAC> 30.08 22.00 34.08 KCC-4.0
Lion<SHA1,SEAL> 13.21 13.58 15.61 KCC-4.0
Luby-Rackoff<SHA1>2 3.52 4.43 GCC
MISTY1 12.43 5.48 11.86 GCC
RC2 11.49 7.68 10.36 GCC
RC5(12) 19.55 32.52 39.77 KCC-4.0
RC5(16) 15.61 26.70 35.45 KCC-4.0
RC6 38.47 22.58 37.93 GCC
Rijndael (r = 10) 21.09 20.41 24.23 KCC-4.0
Rijndael (r = 12) 18.06 17.67 21.60 KCC-4.0
Rijndael (r = 14) 15.72 15.49 19.57 KCC-4.0
SAFER-SK128 8.31 12.11 8.12 ICC
Serpent 11.54 7.42 13.20 KCC-4.0
SHARK 19.00 20.09 17.06 ICC
Skipjack 8.89 3.26 9.76 KCC-4.0
Square 25.17 24.94 26.76 KCC-4.0
TEA 18.25 19.59 16.37 ICC
ThreeWay 21.75 25.32 18.96 ICC
Twofish 22.07 18.84 22.80 KCC-4.0
XTEA 20.65 20.07 10.89 GCC
CBC<DES> 10.40 6.85 13.05 KCC-4.0
CTS<DES> 10.42 6.87 12.92 KCC-4.0
CFB<DES>(8) 9.78 6.50 12.00 KCC-4.0
CFB<DES>(4) 4.75 3.22 5.89 KCC-4.0
CFB<DES>(2) 2.36 1.60 2.92 KCC-4.0
CFB<DES>(1) 1.17 0.80 1.46 KCC-4.0
OFB<DES> 10.33 6.80 13.14 KCC-4.0
Counter<DES> 10.03 6.69 12.91 KCC-4.0
ARC4 47.07 49.88 37.93 ICC
ISAAC 103.72 109.45 124.66 KCC-4.0
SEAL 46.23 83.75 68.25 ICC
Adler32 541.99 790.11 448.24 ICC
CRC24 152.62 152.58 154.66 KCC-4.0
CRC32 172.55 196.58 158.33 ICC
HAVAL 73.43 31.07 49.40 GCC
MD2 2.79 2.00 2.37 GCC
MD4 176.17 68.19 147.65 GCC
MD5 127.08 48.78 104.35 GCC
RIPEMD-128 103.42 27.74 85.96 GCC
RIPEMD-160 69.13 18.29 58.07 GCC
SHA-1 81.50 38.08 53.37 GCC
SHA2-256 39.68 26.04 30.04 GCC
SHA2-512 7.17 12.96 14.10 KCC-4.0
Tiger 34.86 26.68 33.06 GCC
EMAC<Square> 25.36 25.49 27.67 KCC-4.0
HMAC-SHA1 81.24 37.94 53.73 GCC
MD5-MAC 81.54 41.77 96.91 KCC-4.0
X9.19-MAC 11.07 6.92 14.38 KCC-4.0
Randpool 0.50 0.28 0.40 GCC
X917<Square> 1.23 1.16 1.39 KCC-4.0

Conclusions

GCC 3.0.4 does a really good job compiling the hash functions; it won every hash function except SHA2-512. Its performance there is quite poor, losing to both ICC 6.0 and KCC 4.0. This suggests an optimization problem, probably related to SHA2-512's heavy use of the 64-bit "long long" type. (Update: it turns out it was due to keeping a number of constants in an array rather than inlining them. Apparently array accesses cause problems for the GCC function inliner.)

Intel C++, in general, does quite poorly. It does win on a good half-dozen or so algorithms, but for the most part its performance is below-par, in particular with the hash functions, where its code is as little as 1/4 the speed of GCC's code. It is interesting to note that every algorithm ICC wins is one where good scheduling performance is a must for high performance. For example, Adler32's hotspot is a straight line of loads, additions, and stores. It seems ICC knows how to do this right, leading to an impressive performance of 790 Megabytes per second.

These results also point to a scheduling problem with GCC 3.0.4. RC5 and SEAL, in particular, seem to perform very poorly with GCC compared to the other compilers, and to published estimates of their speed. Both are very dependent upon the compiler doing good instruction scheduling to work around the data dependencies present in the algorithms; it seems likely that GCC is not quite up to the task yet. This is probably unrelated to the problem with SHA2-512, because SHA2-512's internal structure is very similiar to MD4, MD5, SHA-1, etc, and GCC does very well with those.

Feel free to send me email if you have any questions or comments on this analysis.