The modern hardware of the HPCx system, and its sophisticated compilers mean that many time-honoured code `hand-crafting' optimisation techniques are irrelevant, and should not be necessary except in the pathological case.
The use of appropriate optimisation switches at compile time can lead to significant (sometimes dramatic) improvements in execution time
Use of the thread-safe re-entrant versions of compilers ([mp]xlf90_r, [mp]xlc_r, [mp]xlC_r, etc) is essential for any code which is required to be thread-safe. There is no performance hit when using these compilers for codes which are not required to be thread-safe, so we recommend their use for all programs.
Fortran and C compilers have a similar set of optimisation switches, some of these are described briefly here. Users should refer to the relevant compiler documentation for more detailed information.
Enables 64-bit addressing which allows better memory management for all programs, even if 64-bit addressing is not explicitly exploited in the program.
The -qarch flag specifies the instruction set architecture of the machine, and may take advantage of instructions only available on the specified machine. pwr4 specifies that this is for the POWER 4 system. -qarch=auto may take advantage of instructions only available on the compiling machine (or similar machines). Initial test show that compiling with pwr5 can actually slow codes down. It may be worth experimenting with both pwr4 and pwr5 to see which works best for your code.
The -qhot flag forces the compiler to carry out certain high-order transformations of the source code. For example, loops with similar trip counts and no dependencies may be merged; inner and outer loops may be interchanged so that the innermost loop counter varies most rapidly; in some cases, intrinsic functions will be extracted from loops and computed in batches (vectorisation). -qhot is turned on automatically at -O4 and higher.
The -qtune flag biases the optimisation towards execution on a given machine, pwr4 specifies that this is for the POWER 4 system. -qtune=auto generates code which is automatically tuned for the compiling machine (or similar machines). As with the qtune flag, pwr5 may slow down codes. Again, it is worth experimenting with both options to see which works best for your code.
The -O flag is the main compiler optimisation flag and can be specified with a range of values.
A good mix of compiler switches to use when starting serious optimisation might be:
-qhot -qarch=pwr4 -O3 -q64
Please note that it is better to compile and debug your code first without optimisation. If your code is in some way non-standard, then optimising may break your code. Optimising your code in the above ways can alter the precise numerical results. If you do not want this to happen then try compiling with the -qstrict flag which will overcome this problem, but may result in some performance degradation.