Much attention has been given to what modern optimizing compilers can do with your code, but little is ever said as to how to make the compiler invoke these optimizations. Of course, the answer is compiler switches! But which ones are needed to generate the best code? How many switches does it take to get the best performance? How do different compilers compare when using the same set of switches? I explore all of these questions and more to shed light on the interplay between C++ compilers and modern hardware drawing on my work in high performance scientific computing.
Enabling modern optimizing compilers to exploit current-generation processor features is critical to success in this field. Yet, modernizing aging codebases to utilize these processor features is a daunting task that often results in non-portable code. Rather than relying on hand-tuned optimizations, I explore the ability of today’s compilers to breathe new life into old code. In particular, I examine how industry-standard compilers like those from gcc, clang, and Intel perform when compiling operations common to scientific computing without any modifications to the source code. Specifically, I look at streaming data manipulations, reduction operations, compute-intensive loops, and selective array operations. By comparing the quality of the code generated and time to solution from these compilers with various optimization settings for several different C++ implementations, I am able to quantify the utility of each compiler switch in handling varying degrees of abstractions in C++ code. Finally, I measure the effects of these compiler settings on the up-and-coming industrial benchmark High Performance Conjugate Gradient that focuses more on the effects of the memory subsystem than current benchmarks like the traditional High Performance LinPACK suite.