Hardware Performance Monitor (HPM) Toolkit

(C) COPYRIGHT International Business Machines Corp. 2004 All Rights Reserved.

 

Luiz DeRose

Advanced Computing Technology Center

IBM Research

laderose@us.ibm.com
Phone: +1-914-945-2828
Fax: +1-914-945-4269

Version 2.5.4 - March 22, 2004


LICENSE TERMS:

The Hardware Performance Monitor (HPM) Toolkit is distributed under a nontransferable, nonexclusive, and revocable license. The HPM software is provided "AS IS". IBM MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  IBM has no obligation to defend or indemnify against any claim of infringement, including, but not limited to, patents, copyright, trade secret, or intellectual property rights of any kind. IBM is under no obligation to maintain, correct, or otherwise support this software. IBM does not represent that the HPM Toolkit will be made generally available. IBM does not represent that any software made generally available will be similar to or compatible with the HPM Toolkit.


Table of Contents

1. The HPM Toolkit
2. HPMCOUNT
3.
LIBHPM
3.1.
Event Sets

3.1.1. Power4

3.1.2. Power3

3.2. Functions
3.3. Output

3.3.1 Overhead and Measurement Error Issues
3.4. Examples of Use
3.4.1.
C and C++
3.4.2.
Fortran
3.4.3. Multi-threaded Program Instrumentation Issues
3.4.4. Compiling and Linking

4. Summary of Environment Flags and Files

5. PeekPerf

6. HPMSTAT

6.1. Usage

6.2. Limitations and Security Considerations

7. Derived Metrics
8. Release history


1. The HPM Toolkit

The HPM Toolkit was developed for performance measurement of applications running on IBM systems supporting the following processors and operating systems: Power3 and Power4 with AIX 5L and AIX 4.3.3.

The HPM Toolkit consists of:

    • An utility (hpmcount), which starts an application and provides at the end of execution wall clock time, hardware performance counters information, derived hardware metrics, and resource utilization statistics.
    • An instrumentation library (libhpm), which provides instrumented programs with a summary output containing the above information for each instrumented region in a program (resource utilization statistics is provided only once, for the whole section of the program that is instrumented). This library supports serial and parallel (MPI, threaded, and mixed mode) applications, written in Fortran, C, and C++.
    • A graphical user interface (PeekPerf), for graphical visualization of the performance file generated by libhpm.
    • An utility (hpmstat), to collect system level hardware performance counters information.

For more information on the resource utilization statistics, please refer to the “getrusage” man pages.

Requirements:

o        On AIX 5L

        The bos.pmapi, which is a product level fileset provided with the AIX distribution, but not installed by default.

o        On AIX 4.3.3

        The PMAPI kernel extension, available on http://www.alphaworks.ibm.com/tech/pmapi


2. HPMCOUNT

Usage:

Sequential and shared memory programs:

> hpmcount     [-o <filename>] [-n] [-s <set>] [-g <group] [-e ev[,ev]*] <program>

MPI programs:
   > poe hpmcount [-a] [-o <filename>] [-n] [-s <set>] [-g <group] [-e ev[,ev]*] <program>

 

or:

 

> hpmcount [-h][-c][-l]    

 

 where:

<program> is the program to be executed.
  -h displays a help message.

-a aggregate counts on POE applications.

With this flag, a single file performance file is generated for all tasks.

This flag only works with POE and/or Load Leveller.

This flag requires the availability of a parallel file system (e.g., GPFS) on the system.

Notice: the program may hang if a parallel file system is not available and this flag is set.

-c list events from all counters.

-l list all groups (on POWER 4 systems)
  -o <filename> generates an output file: <filename>.<pid>.

On parallel programs, this flags creates one file for each process (unless the “-a” flag is set).
By default, the output goes to stdout.

-n for "no output to stdout".

This flag is only active when the -o flag is used. 

-k includes counts of system activity on behalf of the process

-e ev0,ev1,... (POWER 3)

list of event numbers, separated by commas.

ev<i> corresponds to event selected for counter <i>.

 

The event number can be obtained by running hpmcount –c (on Power3 systems)

-g <group> (POWER 4 only). 

Valid groups are from 0 to 60. The description of groups is available in /usr/pmapi/lib/POWER4.gps. The default group is 60. Groups considered interesting for application performance analysis are:

        60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores).

        56, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses

        5, for counts of loads from L2, L3, and memory.

        58, for counts of cycles, instructions, loads from L3, and loads from memory.

        53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).

-s predefined set of events.

On Power 4 systems, -s is the same as -g.

On Power 3 systems, the available sets are:

 

Event set 1 (def.)

Event set 2

Event set 3

Event set 4

Event set 5

Event set 6

Cycles

Cycles

Cycles

Cycles

Cycles

Cycles

Inst. completed

Inst. completed

Loads dispatched

Inst. completed

Inst. completed

Inst. dispatched

TLB misses

TLB misses

L1 load misses

Cycles w/ 0 inst. comp.

I cache misses

Inst. completed

Stores completed

Stores dispatched

L2 misses

Stores completed

Branches predicted

I cache misses

Loads completed

L1 store misses

Stores dispatched

Loads completed

Branches completed

I cache hits

FPU0 ops

Loads dispatched

L2 store misses

FXU0 ops

Cond. branches

Cond. branches

FPU1 ops

L1 load misses

Number of write back

FXU1 ops

Br. Misspred.

Br. dispatched

FMAs executed 

LSU idle 

LSU idle

FXU2 ops

TLB misses

TLB misses

 

Notice that unless the flag “-a” is provided, or the environment variable: HPM_AGGREGATE_COUNTERS is set, a parallel programs will generate one output for each task. Thus, if the “-o” flag is not used, on AIX systems it is recommended that the environment variable: MP_LABELIO be set to YES, in order to correlate each line of the output with the corresponding task. Another option is to set the environment variable MP_STDOUTMODE to one of the task IDs (e.g., 0), to discard output from the other tasks. In this latter case, only the output from the selected task will appear in stdout.

 

Also notice that on AIX systems, sequential programs when compiled with the “mp” prefix (e.g., mpxlf) are MPI programs and will need to be executed as “poe hpmcount … program”.

 


3. LIBHPM

Libhpm supports multiple instrumentation sections, nested instrumentation, and each instrumented section can be called multiple times. When nested instrumentation is used, exclusive duration is generated for the outer sections. Average and standard deviation is provided when an instrumented section is activated multiple times.

Libhpm supports OpenMP and threaded applications. In this case, the thread safe version of the library (libhpm_r) should be used. Both 32- and 64-bit applications are supported, as long all modules are compiled in one of the two modes (32- or 64-bit).

Notice that libhpm collects information and performs summarization during run-time. Thus, there could be a considerable overhead if instrumentation sections are inserted inside inner loops.

Libhpm uses the same set of hardware counters events used by hpmcount. The event set to be used can be selected via the environment variable: HPM_EVENT_SET.

3.1. Event Sets

3.1.1. Power4

On Power 4 systems, HPM_EVENT_SET should be set to a group from 0 to 60. The default is group 60. The description of groups is available in /usr/pmapi/lib/POWER4.gps. Groups considered interesting for application performance analysis are:

        60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores).

        56, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses

        5, for counts of loads from L2, L3, and memory.

        58, for counts of cycles, instructions, loads from L3, and loads from memory.

        53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).

3.1.2 Power3

On Power 3 systems, HPM_EVENT_SET can be set to a value between 1 and 4. The default is 1. The four event sets on the Power3 are:

Event set 1 (def.)

Event set 2

Event set 3

Event set 4

Event set 5

Event set 6

Cycles

Cycles

Cycles

Cycles

Cycles

Cycles

Inst. completed

Inst. completed

Loads dispatched

Inst. completed

Inst. completed

Inst. dispatched

TLB misses

TLB misses

L1 load misses

Cycles w/ 0 inst. comp.

I cache misses

Inst. completed

Stores completed

Stores dispatched

L2 misses

Stores completed

Branches predicted

I cache misses

Loads completed

L1 store misses

Stores dispatched

Loads completed

Branches completed

I cache hits

FPU0 ops

Loads dispatched

L2 store misses

FXU0 ops

Cond. branches

Cond. branches

FPU1 ops

L1 load misses

Number of write back

FXU1 ops

Br. Misspred.

Br. dispatched

FMAs executed 

LSU idle 

LSU idle

FXU2 ops

TLB misses

TLB misses

3.2. Functions

The following instrumentation functions are provided:

     hpmInit( taskID, progName )
  f_hpminit( taskID, progName )
 

    • taskID is an integer value indicating the node ID.
    • progName is a string with the program name.

     hpmStart( instID, label )
  f_hpmstart( instID, label )
 

    • instID is the instrumented section ID. It should be > 0 and <= 100. To run a program with more than 100 instrumented sections, the user should set the environment variable HPM_NUM_INST_PTS. In this case, instID should be less than the value set for HPM_NUM_INST_PTS.
    • Label is a string containing a label, which is displayed by PeekPerf.

     hpmStop( instID )
  f_hpmstop( instID )
 

    • for each call to hpmStart, there should be a corresponding call to hpmStop with matching instID. This requirement should be valid during program execution.


     hpmTstart( instID, label )
  f_hpmtstart( instID, label )

     hpmTstop( instID )
  f_hpmtstop( instID )
 

    • In order to instrument threaded applications, one should use the pair hpmTstart and hpmTstop to start and stop the counters independently on each thread. Notice that if two distinct threads use the same instID, the output will indicate multiple calls. However, the counts will be accumulated. See the Section on Multi-threaded issues for examples

     hpmGetTimeAndCounters( numCounters, time, values )
  f_GetTimeAndCounters ( numCounters, time, values )

    • Every time this function is called, it returns the time in seconds and the accumulated counts since the call to hpmInit.
      • numCounters is an integer indicating the number of counters to be accessed.
      • time is a double precision float
      • values is a “long long” vector of size “numCounters”.


     hpmGetCounters( values )
  f_hpmGetCounters ( values )

      • Similar to hpmGetTimeAndCounters, but only returns the total counts since the call to hpmInit, and to minimize intrusion and overhead, does not perform any check on the vector size.

     hpmTerminate( taskID )
  f_hpmterminate( taskID )
 

    • This function will generate the output. If the program exits without calling hpmTerminate, no performance information will be generated.

3.3. Output

A summary report for each task will be written by default in the file: perfhpm<taskID>.<pid>. Additionally, a set of performance files named hpm<taskID>_<progName>_<pid>.viz will be generated to be used as input for PeekPerf. The generation of the “.viz” file can be avoided with the environment flag: HPM_VIZ_OUTPUT = FALSE.

Users can define the output file name with the environment flag: HPM_OUTPUT_NAME. libhpm will still add the extensions: _<taskID>.hpm and _<taskID>.viz for the performance files and visualization files respectively. Using this environment flag, one can for example setup the output file to have date and time. For example, using ksh:

MYDATE=$(date +"%Y%m%d:%H%M%S")
export HPM_OUTPUT_NAME=myprogram_$MYDATE

In this example, the output file for task 27 will have the name: myprogram_yyyymmdd:HHMMSS_0027.hpm

3.3.1. Overhead and Measurement Error Issues

Any software instrumentation is expected to incur in some overhead. Thus, since it is not possible to eliminate the overhead, the goal was to minimize it. In the HPM Toolkit, most of the overhead is due to time measurement, which unfortunately tends to be an expensive operation in most systems. A second source of overhead is due to run-time accumulation and storage of performance data. Notice, that libhpm collects information and performs summarization during run-time. Hence, there could be a considerable overhead if instrumentation sections are inserted inside inner loops.
 

Several issues were considered in order to reduce measurement error. First, most of the library operations are executed before starting the counters, when returning the control to the program, or after stopping the counters, when the program calls a “stop” function. However, even at the library level, there are a few operations that must be executed within the counting process, as for example, releasing a lock. Second, since timing collection and capture of hardware counters information are two distinctive operations, the order of these operations had to be set. Basically, it had to be decided between timing the counters, or counting the timer. Since the cost of timing is about one order of magnitude more expensive than the cost of counting, the timer call precedes the PMAPI call to start the counters in the HPM “start” function, while the first two operations executed by the HPM “stop” function are stopping the counters, followed by calling the timer function. Thus, there is a small error in the time measurement, but there is minimal error in the counting process. Finally, to access and read the counters, the library calls lower level routines from the operating system. Hence, there are always some instructions executed by the kernel that are accounted as part of the program. So, in order to compensate for this measurement error, the HPM toolkit uses the hardware counters during the initialization and finalization of the library to estimate the cost of one call to the start and stop functions. This estimated overhead is subtracted from the values obtained on each instrumented code section. With this approach, the error of measurement becomes close to zero. However, since this is a statistical approximation, in some situations, this approach fails. In this case, the following message is printed on stderr: “WARNING: Measurement error for <event name> not removed”, which indicates that the estimated overhead was not subtracted from the measured values. One can deactivate the procedure that attempts to remove measurement errors by setting the environment variable: HPM_WITH_MEASUREMENT_ERROR to TRUE (1).

3.4. Examples of Use

3.4.1. C and C++

   declaration:
        #include “libhpm.h
   use:
        hpmInit( tasked, “my program” );
        hpmStart( 1, “outer call” );
        do_work();
        hpmStart( 2, “computing meaning of life” );
        do_more_work();
        hpmStop( 2 );
        hpmStop( 1 );
        hpmTerminate( taskID );

The syntax for C and C++ is the same. However, the include files are different, since the libhpm routines must be declared as having extern "C" linkage in C++.

3.4.2. Fortran

Fortran programs should call the functions with prefix “f_”. Also, notice that the following declaration is required on all source files that have instrumentation calls.

   declaration:
        #include “f_hpm.h
   use:
        call f_hpminit( taskID, “my program” )
        call f_hpmstart( 1, “Do Loop”  )
        do …
          call do_work()
          call f_hpmstart( 5, “computing meaning of life” );
          call do_more_work();
          call f_hpmstop( 5 );
        end do
        call f_hpmstop( 1 )
        call f_hpmterminate( taskID )

3.4.3. Multi-Threaded Program Instrumentation Issues

When placing instrumentation inside of parallel regions, one should use different ID numbers for each thread, as shown in the following Fortran example:
 

!$OMP PARALLEL
!$OMP&PRIVATE (instID)
      instID = 30+omp_get_thread_num()
      call f_hpmtstart( instID, "computing meaning of life" )
!$OMP DO
      do ...
        do_work()
      end do
      call f_hpmtstop( instID )
!$OMP END PARALLEL


Notice that the functions hpmTstart and hpmTstop are required for threaded programs. Also, the parameter instID should always be a variable or a number, it cannot be an expression. This is due to the include file that contains a set of "define" statements that are used during the pre-processing phase that collects line numbers and file names. Finally, notice that the library accepts the use of the same instID for different threads. However, the counters will be accumulated for all instances with the same instID.

3.4.4. Compiling and Linking

In order to use libhpm, one should add libpmapi.a, libhpm.a (or libhpm_r.a), and liblm to the link step:

#
HPM_DIR = <<<ENTER HPM Home directory>>>
HPM_INC = -I$(HPM_DIR)/include
HPM_LIB = -L$(HPM_DIR)/lib -lhpm_r -lpmapi -lm
FFLAGS  = -qsuffix=cpp=f  <<<Other Flags>>>

my.x :  my.f
        $(FF) $(HPM_INC) $(FFLAGS) my.f $(HPM_LIB) -o my.x

The flag “-qsuffix=cpp=f” is only required for the compilation of Fortran programs with extension “.f”, on AIX systems.


4. Summary of Environment Flags and Files

The following environment flags can be used on libhpm and hpmcount:

    • HPM_EVENT_SET
      • Used to select one of the events sets on Power 3 systems, or to select a group of events on Power 4 systems.
      • On Power 3 systems, integer between 1 and 4.
      • On Power 4 systems, integer between 0 and 60.
    • HPM_DIV_WEIGHT
      • Provides a weight to be used to compute “weighted flips” on Power 4 systems.
      • On Power 4 systems, integer > 1.

In addition, users can provide estimations of memory, cache, and TLB miss latencies for the computation of derived metrics, with the following environment variables (please notice that not all flags are valid in all systems):

    • HPM_MEM_LATENCY – estimated latency for a memory load.
    • HPM_AVG_L3_LATENCY – estimated average latency for an L3 load.
    • HPM_L3_LATENCY – estimated latency for an L3 load within a MCM.
    • HPM_L35_LATENCY – estimated latency for an L3 load outside of the MCM.
    • HPM_AVG_L2_LATENCY – estimated average latency for an L2 load.
    • HPM_L2_LATENCY – estimated latency for an L2 load from the processor.
    • HPM_L25_LATENCY – estimated latency for an L2 load from the same MCM.
    • HPM_L275_LATENCY – estimated latency for an L2 load from another MCM.
    • HPM_TLB_LATENCY – estimated latency for a TLB miss.

When computing derived metrics that take into consideration estimated latencies for L2 or L3, the HPM Toolkit will use the provided “average latency” only if the other latencies for the same cache level are not provided. For example, it will only use the value set in HPM_AVG_L3_LATENCY, if at least one of the values of HPM_ L3_LATENCY and HPM_ L35_LATENCY is not set.

Users can also provide the estimated memory latencies, as well as, the event set or/and a divide weight, with the file: HPM_flags.env. Each line in the file specifies one value, which takes precedence over the corresponding environment variable. The syntax is: <flag name> <value>.

HPM_flags.env example:

HPM_MEM_LATENCY 400

HPM_L3_LATENCY 102

HPM_L35_LATENCY 150

HPM_L2_LATENCY 12

HPM_L25_LATENCY 72

HPM_L275_LATENCY 108

HPM_TLB_LATENCY 700

HPM_EVENT_SET 5

On hpmcount the following environment variables can also be used:

    • HPM_AGGREGATE_COUNTERS
      • Used to aggregate counts on POE applications (forces the command line argument “-a”).
      • With this flag, a single file performance file is generated for all tasks.
      • This flag only works with POE and/or Load Leveller.
      • This flag requires the availability of a parallel file system (e.g., GPFS) on the system.
      • Notice: the program may hang if a parallel file system is not available and this flag is set.
    • HPM_LOG_DIR <directory>
      • When this flag is set, hpmcount will write a file: hpm_log.<id> with the performance data in the provided directory. This is in addition to the regular output.
      • On POE applications <id> is a POE ID, provided by MP_PARTITION. Otherwise, it is the pid.

On libhpm the following environment variables can also be used:

    • HPM_NUM_INST_PTS
      • Used to overwrite the default of 100 instrumentation sections in the application.
      • Integer value > 0
    • HPM_WITH_MEASUREMENT_ERROR
      • Used to deactivate the procedure that attempts to remove measurement errors.
      • True or False (0 or 1).
    • HPM_VIZ_OUTPUT
      • To indicate if “.viz” file (for input to PeekPerf) should be generated or not.
      • True or False (0 or 1).
    • HPM_OUTPUT_NAME
      • To define an output file name different from the default.
      • String

On Power 3 systems, users can also specify an event set with the file: libHPMevents. This file takes precedence over the environment variable. Each line in the file specifies one event from the hardware counters. Only one event from each counter can be used (the Power3 has 8 counters, the 604e has 4 counters). Each line should contain:

    • Counter number (e.g., from 0 to 7 on the Power3)
    • Event number (e.g., from 0 to 15 for counter 7 on the Power3)
    • Mnemonic (e.g., PM_FPU0_CMPL#)
    • Description (e.g., FPU 0 instructions#)

libHPMevents example:

    3 1 PM_CYC#        Cycles#
  4  5 PM_FPU0_CMPL#  FPU 0 instructions#
  1 35 PM_FPU1_CMPL#  FPU 1 instructions#
  0  5 PM_IC_MISS#    I cache misses#
  2  5 PM_LD_MISS_L1# Load misses in L1#
  7  0 PM_TLB_MISS#   TLB misses#
  5  5 PM_CBR_DISP#   Branches#
  6  3 PM_MPRED_BR#   Misspredicted branches#

There are some consistence checks for this file, but in general, it is expected the user to know enough information regarding the hardware counters in order to create and use this file.


5. PeekPerf

PeekPerf is an extension to hpmviz. It takes as input the performance files (“.viz”) generated by libhpm. If the performance files are not provided in the command line, PeekPerf will display a dialog box for user input. Users can select a single file by left clicking on a file name, or multiple files, by using the <Shift> or/and <Ctrl> keys. The <Shift> key allows the selection of a range of files (from the last one selected till the current selection), while the <Ctrl> key allows the selection of multiple files in any order.

 

Usage:

   > peekperf [<performance files>]

or for installations with hpmviz:

The main window of the PeekPerf graphical user interface is divided in two panes. The left pane displays for each instrumented section, identified by its label, the inclusive duration (i.e., the total wall clock time executing the corresponding code region), exclusive duration (i.e., the wall clock time of the instrumented code region, excluding the time from inner instrumented regions), and count. The instrumented sections are sorted by “Label”. Left clicking on any of the columns tab will sort the data in the corresponding column. The first click will sort in ascending order, while the second will sort in descending order.

Right clicking on an instrumentation section brings a “metrics” window displaying the node ID, Thread ID, count, exclusive duration, inclusive duration, and the derived hardware metrics. This window can be closed by typing “<Ctrl>W” or by clicking the “Close” button. There are also two menu options in the metrics window: Metrics Options, and Precision. The “Metrics Options” menu brings a metrics list that allows the user to select the metrics to be displayed. Clicking on the top of this list will make it into a “X Windows” dialog box. The “Precision” menu allows the user to indicate to PeekPerf the precision used when running the program (double or single). Some values in the metrics displayed may be highlighted with red, indicating that the metric value is below a threshold value in a predefined range of average values for the metric. Similarly, a number in a light gray indicate that the metric value is above a threshold value in a predefined range of average values for the metric. Notice that some of the predefined range depends on the precision used in the program. The default precision assumed is "double", but the user can replace it to "single", with the menu option described above. Any of the columns in the metrics display can be sorted by clicking the corresponding tab. The first click will sort the values in ascending order, while the second will sort in descending order.

Left clicking on an instrumentation section in the main window brings the corresponding section of the source code in the right pane, highlighted. If the corresponding source file is not available in the directory where PeekPerf is being executed, a dialog box will be displayed, so the user can select the source file. On the top of the source code pane, there are a set of tabs, one for each instrumented module. The user can select a module to be displayed by clicking on the corresponding tab.

The “File” menu options provided in the main window allows one to open a new set of performance files, close the current data, close all data, or quit PeekPerf. The “open data”, “close data”, and “quit” operations can also be selected with the keys <Ctrl>O, <Ctrl>C, and <Ctrl>Q respectively. The “open” command will bring the dialog box for the selection of the performance files.


6. HPMSTAT

Hpmstat is an utility for system-wide hardware performance monitoring. It requires “root” privilege (but it can be used by non-root users when the set-user-ID and/or set-group-ID bit is set). When activated without command line parameters, hpmstat counts user and kernel activity for 60 seconds and presents the raw counts and derived metrics on stdout.

6.1 Usage:

> hpmstat [-o <filename>] [-n] [-k] [-u] [-I|-U <Interval>] [-C <Count>] [-s <set>] [-g <group>] [-e ev[,ev]*]

 

or:

 

> hpmstat [-h][-c][-l]    

 

 where:

-I <Interval> indicates the counting time interval in seconds (default is 1 second)

-U <Interval> indicates the counting time interval in microseconds

-C <Count> Number of iterations to count (default is 1 for “-I” and infinity for “-U”).

-k overwrites default to count system activity only

-u overwrites default to count user activity only
  -h displays a help message.

-c list events from all counters.

-l list all groups (on POWER 4 systems)
  -o <filename> generates an output file: <filename>.<pid>.

On parallel programs, this flags creates one file for each process.
By default, the output goes to stdout.

-n for "no output to stdout".

This flag is only active when the -o flag is used. 

-e ev0,ev1,... (POWER 3 only)

list of event numbers, separated by commas.

ev<i> corresponds to event selected for counter <i>.

 

The event number can be obtained by running hpmcount –c (on Power3 systems)

-g <group> (POWER 4 only). 

Valid groups are from 0 to 60. The description of groups is available in /usr/pmapi/lib/POWER4.gps. The default group is 60. Groups considered interesting for application performance analysis are:

        60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores).

        56, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses

        5, for counts of loads from L2, L3, and memory.

        58, for counts of cycles, instructions, loads from L3, and loads from memory.

        53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).

-s predefined set of events.

On Power 4 systems, -s is the same as -g.

On Power 3 systems, the available sets are:

 

Event set 1 (def.)

Event set 2

Event set 3

Event set 4

Event set 5

Event set 6

Cycles

Cycles

Cycles

Cycles

Cycles

Cycles

Inst. completed

Inst. completed

Loads dispatched

Inst. completed

Inst. completed

Inst. dispatched

TLB misses

TLB misses

L1 load misses

Cycles w/ 0 inst. comp.

I cache misses

Inst. completed

Stores completed

Stores dispatched

L2 misses

Stores completed

Branches predicted

I cache misses

Loads completed

L1 store misses

Stores dispatched

Loads completed

Branches completed

I cache hits

FPU0 ops

Loads dispatched

L2 store misses

FXU0 ops

Cond. branches

Cond. branches

FPU1 ops

L1 load misses

Number of write back

FXU1 ops

Br. Misspred.

Br. dispatched

FMAs executed 

LSU idle 

LSU idle

FXU2 ops

TLB misses

TLB misses

 

6.2. Limitations and Security Considerations

Only the root user can activate hpmstat.

Hpmstat uses the PMAPI system-level API, as opposed to libhpm and hpmcount, which use the thread-level API. Because the system-level APIs would report bogus data if the thread-level API is in use, system-level API calls are not allowed at the same time as thread-level API calls. Thus, the allocation of a thread context will take the system-level API lock, which will not be released until the last context has been deallocated. Hence, hpmstat counts will not be accurate if a program instrumented with libhpm or hpmcount is activated within the window of time that hpmstat is active.


7. Derived Metrics Description

In addition to presenting the raw counter data, the HPM toolkit also computes derive metrics, depending on the hardware events that were selected to be counted. The following derived metrics are supported (please notice that not all metrics are supported on all systems):

    • Total time in user mode (User time):

User time = Cycles / Processor frequency

    • Utilization rate:

User time / Wall clock time

    • Instructions per cycle:

Instructions completed / Cycles

    • MIPS:

0.000001 * Instructions completed / Wall clock time

    • Instructions per I Cache Miss:

Instructions completed / Instructions cache misses

    • Percentage of instructions dispatched that completed:

100 * Instructions completed / Instructions dispatched

    • Load and store operations (Total LS):

Total LS = Loads + Stores

    • Instructions per load/store:

Instructions completed / Total LS

    • Average number of loads per load miss:

Loads / Load misses in L1

    • Average number of L1 load misses per L2 load miss:

Load misses in L1 / Load misses in L2

    • Average number of stores per store miss:

Stores / Store misses in L1

    • Average number of loads per TLB miss:

Loads / TLB misses

    • Average number of load/store per TLB miss:

Total LS / TLB misses

    • Average number of load/stores per L1 miss:

Total LS / (Load misses in L1 + Store misses in L1)

    • Average number of load/stores per L2 miss:

Total LS / (Load misses in L2 + Store misses in L2)

    • L1 cache hit rate:

100 * ( 1 - ( (Load misses in L1 + Store misses in L1) / Total LS )

    • L2 cache hit rate:

100 * ( 1 - ( (Load misses in L2 + Store misses in L2) / ( Total L1 Misses)

    • Memory traffic:

Power3: (L2 misses + Write backs) * Cache Line Size / (1024 * 1024)

Power4: Data loaded from memory * 128 / (1024 * 1024)

    • Memory bandwidth:

Memory traffic / Wall clock time

    • Snoop hit ratio:

100 * Snoop hit occurred / Snoop requests

    • Hardware float point instructions per cycle:

( FPU 0 + FPU 1 ) / Cycles

    • Hardware float point instructions / user time:

( FPU 0 + FPU 1 ) / User time

    • Float point instructions plus FMA ( flip ):

Power3: flip = FPU 0 instructions + FPU 1 instructions + FMAs executed

Power4: flip = FPU 0 instructions + FPU 1 instructions + FMAs executed – FPU Stores

    • Float point instructions plus FMAs rate (Mflip/sec):

0.000001 * flip / Wall clock time

    • Mflip / User time :

0.000001 * flip / User time

    • Weighted float point instructions (wflip):

wflip = flip + (HPM_DIV_WEIGHT – 1) * Divides

    • Weighted float point instructions rate (M Wflip/s):

M Wflip/s = 0.000001 * wflip / Wall clock time

    • Computation intensity:

flip / Total LS

    • FMA percentage:

100 * FMAs executed * 2 / flip

    • Fixed point instructions:

FXU 0 instructions + FXU 1 Instructions + FXU 2 Instructions

    • Fixed point operations per Cycle:

Fixed point instructions / Cycles

    • Fixed point operations per Load Stores:

Fixed point instructions / Total LS

    • Branches Misspredicted percentage:

100 * Branches Misspredicted / Branches

    • Percentage of TLB misses per cycle:

100 * TLB Misses / Cycle

    • Estimated latency from TLB misses:

User estimated TLB Miss latency * TLB Misses / Processor frequency

    • Power 4 specific metrics (note that latencies are obtained via user input with environment flags):

         Percentage of loads from memory per cycle:

100 * Total loads from memory / Cycles

         Estimated latency from loads from memory:

Memory latency * loads from memory / Processor frequency

         Total loads from L3 (L3 loads):

Data loaded from L3 + Data loaded from L3.5

         L3 traffic:

L3 loads * Cache Line Size / (1024 * 1024)

         L3 bandwidth:

L3 traffic / wall clock time

         L3 Load miss rate:

Total loads from memory / (total loads from L3 + total loads from memory)

         Percentage of L3 loads per cycle:

100 * L3 loads / Cycle

         Estimated latency from loads from L3:

(L3 latency * L3 data loads) + (L3.5 latency * L3.5 data loads) / Processor frequency

or

Average L3 latency * Total loads from L3 / Processor frequency

         Total loads from L2 (L2 loads):

Sum (data loaded from (L2, L2.5(shared), L2.5(mod), L2.75(shared), and L2.75(mod))).

         L2 traffic:

L2 loads * Cache Line Size / (1024 * 1024)

         L2 bandwidth:

L2 traffic / wall clock time

         L2 Load miss rate:

(loads from memory + L3 loads) / (L2 loads + L3 loads + loads from memory)

         Percentage of L2 loads per cycle:

100 * L2 loads / Cycle

         Estimated latency from loads from L2:

(L2 lat. * L2 loads) + (L2.5 lat. * L2.5 loads) + (L2.75 lat. * L2.75 loads) / Processor frequency

or

Average L2 latency * Total loads from L2 / Processor frequency

    • Power3 specific metrics:

         Percentage of cycles LSU is idle:

100 * LSU idle / Cycles

         Percentage of cycles with zero instructions completed:

100 * Zero instructions completed / Cycles

         Average number of loads per L2 miss:

Loads / Master generated load op not retried

         Average number of stores per L2 miss:

Loads / Master generated load op not retried


8. Release History

Version 2.5.4 (03/22/2004)

    • Extensions and bug fixes:
      • Added Flag “-U” to hpmstat to provide the ability to count at Microseconds interval.
      • Added cleanup of “.hpm_datafile_*” file at the end of hpmcount (with the “-a” flag).
      • Added include file “f_hpm_i8.h” for support of Fortran programs compiled with -qintsize=8.
      • Changed “char *” in hpmInit and hpmStart function prototypes to “const char *”.
      • Fixed problem with pm_cycles() returning invalid values.

 

Version 2.5.3 (01/31/2004)

    • Extensions on hpmcount:
      • Added aggregation feature for POE programs (command line flag: “-a” or environment variable HPM_AGGREGATE_COUNTERS).
      • Added “Log” feature with environment variable HPM_LOG_DIR.
    • New Interpretations:
      • Derived metrics based on Bytes now uses base 2 for computation of Mega, while others (e.g., MFlips, Counts) use base 10.
      • L3 cache line size on Power4 now interpreted as being 128 bytes. The corresponding derived metrics use this new interpretation.
    • Bug fixes:
      • Fixed divide-by-zero error when pm_cycles() fail.
      • Added message indicating that pm_cycles() failed.

 

Version 2.5.2 (08/01/2003)

    • Added missing external symbols to libhpm

 

Version 2.5.1 (06/02/2003)

    • Hpmstat utility to collect system-wide hardware performance counters
    • Libhpm and hpmviz support for BGLsim
    • Added “-k” flag to hpmcount, to include system activity counts
    • Renamed column heading metrics on hpmviz, to avoid excessive window overflow
    • Renamed Power4 bandwidth and traffic metrics to reflect when it refers to load activity only
    • Added RPM distribution for AIX 5

 

Version 2.4.4 (03/01/2003)

    • General updates:
      • New derived metrics:

        Flips / user time

        HW floating point / user time

 

Version 2.4.3 (01/07/2003)

    • General updates:
      • New derived metrics:

        Fixed point operations per Cycle

        Fixed point opertations per load/stores

      • New functions:

        hpmGetTimeAndCounters

        hpmGetCounters

      • Support for Linux on Intel Pentium 3
      • Limited support for Linux on Intel Itanium (libhpm and hpmviz)
    • Bug fixes:
      • Fixed problem of support of 64-bit applications on Power 3 systems with AIX5L

        New tar file being generated for Power3 with AIX5L

      • User input latencies for Level 2 now printed correctly
      • hpmcount

        Fixed problem of POE application generating message about added counters on all tasks

      • libhpm / hpmviz

        Fixed problem that was causing the “.viz” file to be created with invalid IDs

 

Version 2.4.2 (07/08/2002)

    • General updates:
      • Command line flag “-c” added to hpmcount.
      • Command line flag “-l” added to hpmcount on Power4 systems.

 

Version 2.4.1 (04/28/2002)

    • General updates (for both Power 3 and Power 4 systems):
      • All environment flags have now the prefix HPM_ (instead of LIBHPM_).
      • New environment flag: HPM_TLB_LATENCY – user estimated latency for a TLB miss.
      • User estimated latencies and event set can be provided with the file HPM_latencies.dat.
      • New derived metric:
        • Estimated latency from TLB misses.
        • Percentage of TLB misses per cycle.
        • Memory traffic.
        • Memory bandwidth.
      • New hpmcount flag (“-n”) for "no output to stdout".
        This flag is only active when the -o flag is used.
      • bug fixes:
        • hpmcount always returning a return code 0.
      • Error message regarding stack empty on hpm_tstop referring to hpm_stop.
    • Limited Power 4 support (requires bos.pmapi.5.1.0.16 or newer PTF)
      • Hpmviz is not yet supported on Regattas. Performance data (.viz files) generated on Regatta systems can be visualized on Power 3 systems (under AIX 4.3.3).
      • Individual events can no longer be selected. Only groups of events.
      • Groups 0 to 60 are supported (description of groups is available in /usr/pmapi/lib/ POWER4.gps).
      • Default group is 60.
      • New environment flags for specification of estimated latencies (in cycles):
        • HPM_MEM_LATENCY – estimated latency for a memory load.
        • HPM_AVG_L3_LATENCY – estimated average latency for an L3 load.
        • HPM_L3_LATENCY – estimated latency for an L3 load within a MCM.
        • HPM_L35_LATENCY – estimated latency for an L3 load outside of the MCM.
        • HPM_AVG_L2_LATENCY – estimated average latency for an L2 load.
        • HPM_L2_LATENCY – estimated latency for an L2 load from the processor.
        • HPM_L25_LATENCY – estimated latency for an L2 load from the same MCM.
        • HPM_L275_LATENCY – estimated latency for an L2 load from another MCM.
        • HPM_TLB_LATENCY – estimated latency for a TLB miss.
      • New environment flag to specify weight of a divide in a weighted flips metric
      • New derived metrics:
        • Weighted float point instructions (wflip)
        • Weighted float point instructions rate
        • Percentage of loads from memory per cycle
        • Estimated latency from loads from memory
        • Total loads from L3
        • L3 Load miss rate
        • L3 traffic
        • L3 bandwidth
        • Percentage of L3 loads per cycle
        • Estimated latency from loads from L3
        • Total loads from L2
        • L2 Load miss rate
        • L2 traffic
        • L2 bandwidth
        • Percentage of L2 loads per cycle
        • Estimated latency from loads from L2
      • hpmcount updates:
        • New command line flag (“-g”) for selection of groups.
        • Command line flag “-s” has the same effect as “-g”.
        • Command line flag “-e” no longer supported.
        • Environment variable HPM_EVENT_SET can also be used with hpmcount
      • libhpm:
        • Environment flag HPM_EVENT_SET specifies group number.
        • New environment flags for specification of estimated latencies (in cycles):
          • HPM_MEM_LATENCY – estimated latency for a memory load.
          • HPM_AVG_L3_LATENCY – estimated average latency for an L3 load.
          • HPM_L3_LATENCY – estimated latency for an L3 load within a MCM.
          • HPM_L35_LATENCY – estimated latency for an L3 load outside of the MCM.
          • HPM_AVG_L2_LATENCY – estimated average latency for an L2 load.
          • HPM_L2_LATENCY – estimated latency for an L2 load from the processor.
          • HPM_L25_LATENCY – estimated latency for an L2 load from the same MCM.
          • HPM_L275_LATENCY – estimated latency for an L2 load from another MCM.
      • Known Problems
        • This software does not work (was not tested) with LPAR.
        • The system call “pmcycles”, which is used to provide the processor cycle time, is not always working on Power 4 systems. The algorithm implemented to work around this problem was to read pmcycles, and if it returns a value less than 1GHz, set it to 1.3GHz. One can change the clock cycle value with the environment variable: HPM_CLOCK_CYCLE. When using this environment variable, the clock cycle should be set in MHz.
        • Every time that the system is rebooted, one has to run pmcycles –m in order to have the hardware counters working properly.
    • Updates on Power 3 systems:
      • New Power3 hardware counters events interpretation:
        • "Master generated load operation is not retried" (PM_BIU_LD_NORTRY) now interpreted as L2 misses.
        • "Master generated store operation is not retried" (PM_BIU_ST_NORTRY) now interpreted as Write backs.

 

Version 2.3.1 (11/01/2001)

    • New ".viz" format (2.0)
    • Redesign of the hpmviz interface, which now has the following new features
      • Ascending and descending sorting of all metrics
      • Tabs for modules allowing quick access to source files
      • Performance data from different runs can be visualized in the same session
      • Tabs to switch between performance data from different runs
      • Semi-automatic conversion of .viz format 1.0 to .viz format 2.0
    • HPMviz modifications
      • Metric values below threshold range are now highlighted with a red background
      • Metric values above threshold range are displayed in light gray
    • LIBHPM modifications
      • Labels of instrumented sections are now included in the output
      • The generation of the ".viz" file can be disabled with environment flag
      • User can define output file name with environment flag: LIBHPM_OUTPUT_NAME
    • New derived metrics:
      • Percentage of instructions dispatched that completed
      • Percentage of cycles with zero instructions completed
      • Average number of L1 load misses per L2 load misses
      • Average number of L1 store misses per L2 store misses
    • HPMviz no longer supported features:
      • Vertical layout is no longer supported
      • The source code pane is no longer an editor
      • Different font size is no longer supported
    • HPMviz know problems and temporary solutions:
      • Scrool bar does not appear automatically with the metrics window (when the window is too large)
        • Temporary fix: manually resize the display window or any metric column, for the scrool bar to appear
      • Sometimes, when starting hpmviz <.viz file>, the labels pane appears too wide, while the source pane appears too narrow
        • Temporary fix: manually resize the panes, by grabbing and moving the line that divides the panes
      • Sometimes, the first label selection does not highlight the corresponding section in the source code pane
        • Temporary fix: In this case, the highlighting starts to work with the second selection.
    • Bug fixes:
      • Fixed problem of not being able to access the last event from each counter.
      • Fixed problems with scrool bars

 

Version 2.2.3: (02/09/2001)

    • New derived metrics:
      • Total time in user mode
      • Utilization rate
    • Performance file (text) generated by libhpm is now named: perfhpm<taskID>.<pid>
    • Hpmcount modifications:
      • When -o flag is used, hpmcount displays "added counters" information only on node 0.
      • The file format with the -o flag is now <parameter name>_<taskID>.<pid>
      • Version and revision numbers are now displayed with -h flag.

 

Version 2.2.1:

    • HTML documentation added (this one)
    • File format ".viz" modified to support new features in hpmviz.
    • Compiler flag: "-qnullterm" no longer required.
    • New derived metrics:
      • MIPS
      • Snoop hit rate
      • Hardware float point instructions per cycle
    • Derived metrics that deals with float point and fixed point instructions were renamed. New names:
      • Float point instructions plus FMA ( flip )
      • Float point instructions plus FMAs rate (Mflip/s)
      • Fixed point instructions
    • hpmviz modifications
      • File name, line number, and ID removed from the left pane of main window.
      • Inclusive duration added to main window.
      • Option to select metrics to be displayed added to metrics window.
      • Close window renamed from <Ctrl>C to <Ctrl>W.
      • Option of vertical layout added.
      • Added test file precision.
      • Added colors to metrics to indicate good and bad values

 

Version 2.1:

    • Added Graphical User Interface: hpmviz


Version 1.1:

    • Initial Release

Top of Page