next up previous
Next: Optimisation Up: User's Guide to the Previous: Tools



A number of numerical libraries are available for use by programmers, some are provided by IBM, some are in the public domain. This section gives a very brief summary of some of what is available, and also indicates where more information can be obtained.

IBM Provided Libraries


ESSL (Engineering and Scientific Subroutine Library) contains a large number of highly tuned serial numerical routines. It includes the BLAS for basic linear algebra, and also routines that cover such areas as linear equation solving, both dense and sparse, eigensolvers, courier analysis, quadrature, interpolation, random number generation and sorting. As such in many ways it plays a similar role to libsci on Cray machines. A PDF document detailing the full capabilities of ESSL and the interfaces to the routines contained in it is available here:

Note that to use ESSL -lessl must be included on the link line. Further, unlike libsci, ESSL typically uses its own proprietary interfaces to these routines, and in particular contains only a very small subset of the LAPACK library.

NB: -lessl is a 32- and 64-bit thread-safe library.

Parallel ESSL

Parallel versions of ESSL comes in two forms.

The first is the ESSL SMP library. This contains a threaded subset of the ESSL routines, and so can be used to parallelise operations within a shared memory partition. The number of threads employed by the ESSL SMP library is set by the environment variable OMP_NUM_THREADS which has a current default value of 32 on HPCx.

To link in the ESSL SMP library use -lesslsmp.

The second is the PESSL distributed data version. This contains a subset of the standard PBLAS and ScaLAPACK routines for performing linear algebra (all the `work' routines are included, but some of the utilities are missing), and also routines for FFTs, Fourier analysis and random number generation. (If one requires PESSL, then one must also link in a BLACS library. See the following section for details on BLACS.)

To link use -lpesslsmp -lblacssmp. Both these libraries are 32- and 64-bit thread-safe libraries. NB: If your code is purely MPI, then OMP_NUM_THREADS=1 should be set in your LoadLeveler script, if you link with these libraries, as they also include thread-based parallelism.

There exists another version of PESSL, namely -lpessl, which is not thread-safe and only caters for 32-bit addressing. This library is not affected by the OMP_NUM_THREADS environment variable.

If the code is a mixed mode code, i.e. MPI between LPARs and OpenMP inside LPARs, one would link with -lpesslsmp -lblacssmp -lesslsmp.

Full information for PESSL can be found in PDF format at:


BLACS, or Basic Linear Algebra Communications Subroutines, are similar to MPI and are built on the same layer as MPI, therefore, their performance should be just as good. However, MPI is a more powerful and more versatile communications library.

Two implementations of BLACS are available on HPCx.

Firstly, the IBM implementation of BLACS. There are two IBM BLACS libraries, namely BLACS and BLACSSMP. BLACS, -lblacs, is currently only 32-bit addressing and is incomplete. BLACSSMP, -lblacssmp, caters for both 32- and 64-bit addressing and is thread-safe.

Secondly, the public implementation of BLACS, built on top of MPI and is located in /usr/local/lib. The user also needs to link -lblacsCinit or -lblacsF77init as well as -lblacs.

The public domain BLACS supports both 32- and 64-bit addressing and is thread-safe.


The highly-optimised `mathematical intrinsics' are available via the MASS package. Please visit the following web page for more information.

NB: The current MASS libraries are now included, by default, as part of the compiler process.

However, you may wish to use an alternative MASS library. To do this, ensure -lmass appears before -lm when linking.

The MASS.readme file is located at /usr/lpp/mass/MASS.readme on HPCx. The current default version is 3.3, however, versions 3.0 and 4.2 are also available.

Other Libraries


LAPACK contains a very large number of routines that perform serial dense linear algebra, and has been ported to a very large number of machines. It achieves high performance by using the BLAS library, which in the case of IBM (and many other manufacturers) has been highly optimised. It also is the interface that libsci uses, so use of LAPACK may aid in porting codes to the new machine, but please see the note below.

More information on LAPACK may be found at

and to use it you need -lessl -llapack in your link line. This ordering will ensure that the faster ESSL routines will replace the slower LAPACK routines, although, the argument list of the ESSL routines may not match the argument list of the LAPACK routines. Linking with -llapack -lessl will also work.

NB. -llapack is a 32- and 64-bit thread-safe library.

You also need to add /usr/local/lib to your link path.


ScaLAPACK is the distributed memory version of LAPACK. As mentioned above, some of the ScaLAPACK routines are included in PESSL (see above), but to aid in porting, the public domain version is provided. More information may be found at

To use the public domain version of ScaLAPACK, use

-lessl -lblacs -lblacsF77init -lscalapack

You also need to add /usr/local/lib to your link path. Note that default version of ScaLAPACK on HPCx is a 32- and 64-bit thread-safe library.

If the ScaLAPACK routine is included in IBM's own PESSL, then you can use this library instead. See the section on PESSL above for details.


Release 3.2 of the Parallel Linear Algebra Package (PLAPACK) is installed on HPCx.

The header files are located at /usr/local/packages/plapack/INCLUDE and are included by adding -I/usr/local/packages/plapack/INCLUDE to your compile line.

The libraries themselves are located in /usr/local/packages/plapack and are included by adding -L/usr/local/packages/plapack -lPLAPACK to your compile line.

More information and documentation about PLAPACK can be found at

A report about the QR and the MR3-algorithm based eigensolvers coming with PLAPACK is at

If you want to use the beta-version of the MR3-algorithm based Eigensolver we recommend to copy /usr/local/packages/plapack/ParEig-1.2.tgz into your local directory where you can change the code to your needs. The README/ files in the directories Tridiag/ and Dense explain in detail of how to use it. You might especially want to change the variables PRINT, CHECK, TIME in Tridiag/global.h and mpiexec, nprows, npcols, nb_distr, nb_alg, nb_alg2, n and right in Dense/test_sym_eig.c and also adapt the code to read in your specific matrix. Set HOME = /usr/local/packages/plapack and PLAPACK_ROOT = $(HOME) in Dense/Makefile in addition. Please cite

A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations.
Paolo Bientinesi, Inderjit S. Dhillon, Robert A. van de Geijn.
Accepted for publication on SIAM Journal on Scientific Computing, 2003

when using the MR3-algorithm based Eigensolver.

An example makefile and C-routines of how to use the QR-Eigensolver can be copied from /usr/local/packages/plapack/QR_example.tar.gz, there is no example input file included.


FFTW (Fastest Fourier Transform in the West) is a set of self-optimising Fourier transform routines which can be faster than those provided in ESSL/PESSL. Serial, threaded and distributed data versions are available. For more information see

Since the interface is incompatible between FFTW version 2.x and version 3.x, we presently have the versions 2.1.5 and 3.0.1 installed on the service.

The version 2.1.5 of the FFTW library has been installed for both single- and double-precision for the serial, threaded and MPI versions. Further, the libraries are both installed for both 32- and 64-bit compilations.

The header files are located at /usr/local/packages/fftw/include and are included by adding -I/usr/local/packages/fftw/include to your compile line.

The libraries themselves are located in /usr/local/packages/fftw/lib and are included by adding -L/usr/local/packages/fftw/lib to your compile line.

The information files are located in /usr/local/packages/fftw/info.

For version 3.0.1 the the 32- and 64-bit libraries are installed in two different directories, namely /usr/local/packages/fftw/fftw3_32 and /usr/local/packages/fftw/fftw3_64 respectively.

So, to employ, say, the double-precision serial 32-bit FFTW library, one would compile with

xlf90_r code.f -I/usr/local/packages/fftw/include \
-L/usr/local/packages/fftw/fftw3_32/lib -ldfftw

Similarly, to employ the 64-bit single-precision serial FFTW library in an MPI code, one would compile using

mpxlf90_r -q64 code.f -I/usr/local/packages/fftw/include \
-L/usr/local/packages/fftw/fftw3_64/lib -lsfftw

The required include files are located in /usr/local/packages/fftw/fftw3_32/include and /usr/local/packages/fftw/fftw3_64/include.


HSL, formerly the Harwell Subroutine library, is a collection of ISO Fortran codes for large scale scientific computation written and maintained by the Numerical Analysis Group at Rutherford-Appleton Laboratory. A large range of problems are addressed, but unlike many of the libraries mentioned elsewhere, sparse equation solving is a particular forte of HSL. More information is available from

To link you need -lhsl2004, given that /usr/local/lib is added to your link path.

Parallel HDF5

Parallel HDF5 (Hierarchical Data Format) is software for scientific data management. It includes I/O libraries and tools for analyzing, visualising, and converting scientific data. Parallel HDF5 uses MPI-IO calls for parallel file access. For more information see

The libraries are in /usr/local/packages/hdf5/lib. In addition, an improved version of the gzip library has been installed at /usr/local/packages/hdf5/zlib/lib. If you are using gzip compression within HDF5 you are advised to link to this version of the gzip library rather than the default system one. The current version is 1.6.4 which is compiled in 64-bit mode only. The complete information about the installation options can be found in /usr/local/packages/hdf5/libhdf5.settings and /usr/local/packages/hdf5/libhdf5_fortran.settings

Users may call this library from within a serial code, however, the associated compiler must be the parallel version, i.e. if the code uses xlf90_r or xlc_r, then to use this parallel library, one will need to employ mpxlf90_r or mpxlc_r, respectively.

Example Makefile:

MF=     Makefile
FC=     mpxlf90_r
FFLAGS= -qsuffix=f=f90 -q64 -O3 -qarch=pwr4 -qtune=pwr4 \
        -I/usr/local/packages/hdf5/include \
        -L/usr/local/packages/hdf5/lib \
        -I/usr/local/packages/hdf5/lib \
        -L/usr/local/packages/hdf5/zlib/lib \
        -lhdf5_fortran -lhdf5 -lgpfs -lz

EXE=    prog.exe
SRC=    prog_withHDF5.f90

# No need to edit below this line

.SUFFIXES: .f90 .o

OBJ=    $(SRC:.f90=.o)

        $(FC) $(FFLAGS) -c -o $(OBJ) $<

all:    $(EXE)

$(EXE): $(OBJ)
        $(FC) $(LFLAGS) -o $@ $(OBJ)

$(OBJ): $(MF)

        tar cvf $(EXE).tar $(MF) $(SRC)

        rm -f $(OBJ) $(EXE) core

The HDF5 module can then be used by including use HDF5 in the code.

Some useful HDF5 tools are located in /usr/local/packages/hdf5/bin.

MPI splitting library

This library allows users to run many small jobs using the same binary from one batch script. The MPI splitting library creates a split communicator for all MPI calls. To use this library, the target code only needs to be linked to the splitting library to overide the default MPI library. Several enviroment variables need to be set in the batch job file which then control the behaviour of the split communicator.

To recompile and link to this library, users should link with the following commands for c:

mpcc_r -q64 -o main main.c -L/usr/local/packages/mpisplit/lib -lmpi_split

and for Fortran:

mpxlf90_r -qsuffix=f=f90 -q64 -o fmain fmain.f90 -L /usr/local/packages/mpisplit/lib -lfort_mpi_split

To use the library several enviroment variables need to be set. SMPI_GROUP_SIZE is the number of processors in each sub-group. It should satisfy #cpus mod group_size = 0. If this condition is not met, then the group size is set to #cpus. As there will be several instances of the application running, each instance will need its own input and output files. SMPI_DIR_PREFIX sets the prefix for directory for the IO for each group of processors. The group number (with a ``.'') is automatically appended to the prefix. So if SMPI_DIR_PREFIX=Group is set then the separate directories for each group would be Group.0, Group.1, etc. The remaining two enviroment variables deal with the stdout and stderr. SMPI_STDOUT_REDIRECT can be set to one of three values. The default is NO in which case all the output from each instance of the main program executing will be written to the stdout. The second option is GROUP where a separate stdout file for each group is written to the SMPI_DIR_PREFIX directory. The third option is ALL, in this case each processor writes to a separate file, again in the SMPI_DIR_PREFIX but with the rank of the processor appended. Similarly for the stderr.

Below is a sample batch script file, with 32 CPUS split into 4 groups of 8, with a directory Group.n for each group for io files, and the stdout and stderr written separately for each group into that directory.

#@ shell = /bin/ksh
#@ job_name = hello
#@ job_type = parallel
#@ cpus = 32
#@ node_usage = not_shared
#@ network.MPI = csss,shared,US
#@ bulkxfer = yes
#@ wall_clock_limit = 00:10:00
#@ account_no = z001
#@ output = 4$(job_name).$(schedd_host).$(jobid).out 
#@ error  = $(job_name).$(schedd_host).$(jobid).err
#@ notification = never
#@ queue

# suggested environment settings:
export MP_EAGER_LIMIT=65536

# Above lines are common settings in every batch file
# Following lines are Splitting Library specific


export SMPI_DIR_PREFIX=Group


poe ./main

Libraries and porting to the IBM

In this section a small number of porting issues are very briefly addressed. Mostly they are about porting from Cray systems to the IBM, but some are more general.

next up previous
Next: Optimisation Up: User's Guide to the Previous: Tools
Andrew Turner