Q: How can I get reliable execution times ? A: Timing codes on HPCx ====================== Firstly, HPCx is basically a shared resource, so care must be taken to produce execution timing figures. It can be seen that timing the same code, on any platform, for a few times will produce times that fluctuate. This fluctuation is far greater for shared resources, such as HPCx, as one may share the interconnect and processors with system demons, other users, etc. Therefore, for HPCx in particular, great care must be taken when producing execution timing figures. Timing Parallel Codes -------------------------- Parallel codes run on the 'back end' of hpcx. Each user is given exclusive use of each LPAR requested, however, the user must share processors with system demons and the interconnect with other user's data movements. Using single LPARs If only one LPAR is employed and is full, i.e. cpus=16, then code must share the LPAR with the operating system, however, this overhead is a constant time penalty. If a code only employs seven processors, the operating system demons are free to run on the one remaining processors, which may well improve the code's efficiency, however, the user is charged for all 16 processors. Timing fluctuations can occur as system demons can wake up and check to see that the LPAR is functioning correctly. We have found that approximately every 5 seconds, one particular demon wakes up and probes the state of the system and exist for about .01 seconds. For this short duration, they can slow down computation (and not communication) by a factor of 100. One must be aware that, when timing a piece of code which runs in less than 1 second, the time for execution may fluctuate wildly. In this case, one must run around 100 instances of the code to determine an ensemble of times. Once may then take the average, once the outliers have been removed from the sample, or simply take the minimum time. However, if one times a piece of code which takes minutes rather than seconds, the time will fluctuate much less. In this case, one need only run around 5 instances of the code and take the average or minimum time. Using multiple frames When employing more than one frame, times can fluctuate quite dramatically, since now any communications will be transmitted through the switch. Thus, the speed of communications depend on how busy the switch is, either by MPI messages (including your own) and I/O to disk. Further, if a packet is dropped, the message is resent. We have found, during busy periods, that the latency of a communicated message can increase ten-fold. We recommend timing your code around 10 times and taking the median or minimum time, after removing any outliers from the sample. Timing Serial Codes ---------------------- Serial codes are normally run on the 'back end' of hpcx via LoadLever. Up to 16 serial jobs can share a single LPAR at any given time, thus times will be affected by other codes competing for memory accesses. One can gain exclusive use of an LPAR in order to time a serial code to reduce the impact of sharing an LPAR. This gives the code exclusive access to the LPAR's total memory (~32Gb). However, this is an expensive option, since the user has to pay for 16 processors. One could run a short job on the 'front end', however, the times will fluctuate greatly as all these processors are shared with other users compiling, editing, running postmortem profilers, etc. This practice is actively discouraged.