There are three MPI profiling libraries: mpitrace : wrappers for low-overhead MPI elapsed-time measurements mpihpm : the trace wrappers above plus power-4 hpm counter data mpiprof : provides elapsed-time call-graph data for MPI routines mpitrace -------- To use the trace wrappers, just link with -L/usr/local/lib -lmpitrace, and then run the application. By default each MPI task will create files in the working directory with names: mpi_profile.0, mpi_profile.1, ... If you want to reduce the number of output files, you can set the environment variable TRACE_SOME to "yes" or "1" (or anything else), then you will get output from only some of the tasks. The output is written when the application calls MPI_Finalize(). mpihpm ------ To use mpihpm, link with -L/usr/local/lib -lmpitrac -lpmapi. Choose a power-4 performance counter group (for example group 5): export HPM_GROUP=5 then run the code. The pmapi library (bos.pmapi.lib) must be installed. A list of counter groups is in the file power4.ref. By default, you will only get counter group 60. The counters are started in MPI_Init(), and stopped in MPI_Finalize(). You get an output file for each task: mpi_profile_group5.0, mpi_profile_group5.1, ... where the group number is identifed, and the MPI task id is appended onto the file name. In general you will have to run with several diferent values for HPM_GROUP. Good choices are groups 5, 53, 56, 58, and 60. mpiprof ------- To use the mpiprof library, the code must be compiled/linked with -L/usr/local/lib -lmpitrace, plus "-qtbtable=full" or "-g" as an additional compiler option. The wrappers in the mpiprof library use a trace-back method to find the name of the routine that called the MPI function, and this only works if there is a full trace-back table. Once compiled and linked, simply run the code. Each MPI task writes an output file, mpi_profile.0, ..., in the working directory when the application calls MPI_Finalize(). For all of the versions you can choose to bind MPI tasks to processors if you set an environment variable BIND_TASKS to anything. When BIND_TASKS is set, the wrapper for MPI_Init will attempt to bind the MPI tasks to processors in a way that spreads the tasks out over the avialble cpus as much as possible. The main objective for the mpitrace library was to provide a very low overhead elapsed-time measurement of MPI routines for applications written in any mixture of Fortran, C, and C++. The overhead for the current version is about 1 microsecond per call. The read_real_time() routine is used to measure elapsed-time, with a direct method to convert timebase structures into seconds. This is much faster than using rtc() or the time_base_to_time() conversion routine. The main objective for the mpiprof library was to provide an elapsed-time profile of MPI routines including some call-graph information so that one can identify communication time on a per-subroutine basis. For example, if an application has MPI calls in routines "main", "exchange", and "transpose", the profile would show how much communication time was spent in each of these routines, including a detailed breakdown by MPI function. This provides a more detailed picture of message-passing time at the expense of a bit more overhead : ~ 5 microseconds per call. In some applications there are message-passing wrappers, and one would like the profile to indicate the name of the routine that called the wrapper, not the name of the routine that called the MPI function. In this case, one can set an environment variable TRACEBACK_LEVEL=2, and then run the application (which must be compiled with either -g or -qtbtable=full, and linked with the mpiprof library). It may also be useful to try higher levels such as TRACEBACK_LEVEL=3, which associates the message-passing time with the great-grandparent in the call chain. Note: the current version of these libraries is not thread-safe, so it should only be used in single-threaded applications, or when only one thread makes MPI calls.