A number of numerical libraries are available for use by programmers, some are provided by IBM, some are in the public domain. This section gives a very brief summary of some of what is available, and also indicates where more information can be obtained.
ESSL (Engineering and Scientific Subroutine Library) contains a large number of highly tuned serial numerical routines. It includes the BLAS for basic linear algebra, and also routines that cover such areas as linear equation solving, both dense and sparse, eigensolvers, courier analysis, quadrature, interpolation, random number generation and sorting. As such in many ways it plays a similar role to libsci on Cray machines. A PDF document detailing the full capabilities of ESSL and the interfaces to the routines contained in it is available here:
Note that to use ESSL -lessl must be included on the link line. Further, unlike libsci, ESSL typically uses its own proprietary interfaces to these routines, and in particular contains only a very small subset of the LAPACK library.
NB: -lessl is a 32- and 64-bit thread-safe library.
Parallel versions of ESSL comes in two forms.
The first is the ESSL SMP library. This contains a threaded subset of the ESSL routines, and so can be used to parallelise operations within a shared memory partition. The number of threads employed by the ESSL SMP library is set by the environment variable OMP_NUM_THREADS which has a current default value of 32 on HPCx.
To link in the ESSL SMP library use -lesslsmp.
The second is the PESSL distributed data version. This contains a subset of the standard PBLAS and ScaLAPACK routines for performing linear algebra (all the `work' routines are included, but some of the utilities are missing), and also routines for FFTs, Fourier analysis and random number generation. (If one requires PESSL, then one must also link in a BLACS library. See the following section for details on BLACS.)
To link use -lpesslsmp -lblacssmp. Both these libraries are 32- and 64-bit thread-safe libraries. NB: If your code is purely MPI, then OMP_NUM_THREADS=1 should be set in your LoadLeveler script, if you link with these libraries, as they also include thread-based parallelism.
There exists another version of PESSL, namely -lpessl, which is not thread-safe and only caters for 32-bit addressing. This library is not affected by the OMP_NUM_THREADS environment variable.
If the code is a mixed mode code, i.e. MPI between LPARs and OpenMP inside LPARs, one would link with -lpesslsmp -lblacssmp -lesslsmp.
Full information for PESSL can be found in PDF format at:
BLACS, or Basic Linear Algebra Communications Subroutines, are similar to MPI and are built on the same layer as MPI, therefore, their performance should be just as good. However, MPI is a more powerful and more versatile communications library.
Two implementations of BLACS are available on HPCx.
Firstly, the IBM implementation of BLACS. There are two IBM BLACS libraries, namely BLACS and BLACSSMP. BLACS, -lblacs, is currently only 32-bit addressing and is incomplete. BLACSSMP, -lblacssmp, caters for both 32- and 64-bit addressing and is thread-safe.
Secondly, the public implementation of BLACS, built on top of MPI and is located in /usr/local/lib. The user also needs to link -lblacsCinit or -lblacsF77init as well as -lblacs.
The public domain BLACS supports both 32- and 64-bit addressing and is thread-safe.
The highly-optimised `mathematical intrinsics' are available via the MASS package. Please visit the following web page for more information. http://techsupport.services.ibm.com/server/mass?fetch=home.html
NB: The current MASS libraries are now included, by default, as part of the compiler process.
However, you may wish to use an alternative MASS library. To do this, ensure -lmass appears before -lm when linking.
The MASS.readme file is located at /usr/lpp/mass/MASS.readme on HPCx. The current default version is 3.3, however, versions 3.0 and 4.2 are also available.
LAPACK contains a very large number of routines that perform serial dense linear algebra, and has been ported to a very large number of machines. It achieves high performance by using the BLAS library, which in the case of IBM (and many other manufacturers) has been highly optimised. It also is the interface that libsci uses, so use of LAPACK may aid in porting codes to the new machine, but please see the note below.
More information on LAPACK may be found at
and to use it you need -lessl -llapack in your link line. This ordering will ensure that the faster ESSL routines will replace the slower LAPACK routines, although, the argument list of the ESSL routines may not match the argument list of the LAPACK routines. Linking with -llapack -lessl will also work.
NB. -llapack is a 32- and 64-bit thread-safe library.
You also need to add /usr/local/lib to your link path.
ScaLAPACK is the distributed memory version of LAPACK. As mentioned above, some of the ScaLAPACK routines are included in PESSL (see above), but to aid in porting, the public domain version is provided. More information may be found at
To use the public domain version of ScaLAPACK, use
-lessl -lblacs -lblacsF77init -lscalapack
You also need to add /usr/local/lib to your link path. Note that default version of ScaLAPACK on HPCx is a 32- and 64-bit thread-safe library.
If the ScaLAPACK routine is included in IBM's own PESSL, then you can use this library instead. See the section on PESSL above for details.
The header files are located at /usr/local/packages/plapack/INCLUDE and are included by adding -I/usr/local/packages/plapack/INCLUDE to your compile line.
The libraries themselves are located in /usr/local/packages/plapack and are included by adding -L/usr/local/packages/plapack -lPLAPACK to your compile line.
More information and documentation about PLAPACK can be found at
A report about the QR and the MR3-algorithm based eigensolvers coming with PLAPACK is at
If you want to use the beta-version of the MR3-algorithm based Eigensolver we recommend to copy /usr/local/packages/plapack/ParEig-1.2.tgz into your local directory where you can change the code to your needs. The README/ files in the directories Tridiag/ and Dense explain in detail of how to use it. You might especially want to change the variables PRINT, CHECK, TIME in Tridiag/global.h and mpiexec, nprows, npcols, nb_distr, nb_alg, nb_alg2, n and right in Dense/test_sym_eig.c and also adapt the code to read in your specific matrix. Set HOME = /usr/local/packages/plapack and PLAPACK_ROOT = $(HOME) in Dense/Makefile in addition. Please cite
A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple
Relatively Robust Representations.
Paolo Bientinesi, Inderjit S. Dhillon, Robert A. van de Geijn.
Accepted for publication on SIAM Journal on Scientific Computing, 2003
when using the MR3-algorithm based Eigensolver.
An example makefile and C-routines of how to use the QR-Eigensolver can be copied from /usr/local/packages/plapack/QR_example.tar.gz, there is no example input file included.
FFTW (Fastest Fourier Transform in the West) is a set of self-optimising Fourier transform routines which can be faster than those provided in ESSL/PESSL. Serial, threaded and distributed data versions are available. For more information see
Since the interface is incompatible between FFTW version 2.x and version 3.x, we presently have the versions 2.1.5 and 3.0.1 installed on the service.
The version 2.1.5 of the FFTW library has been installed for both single- and double-precision for the serial, threaded and MPI versions. Further, the libraries are both installed for both 32- and 64-bit compilations.
The header files are located at /usr/local/packages/fftw/include and are included by adding -I/usr/local/packages/fftw/include to your compile line.
The libraries themselves are located in /usr/local/packages/fftw/lib and are included by adding -L/usr/local/packages/fftw/lib to your compile line.
The information files are located in /usr/local/packages/fftw/info.
For version 3.0.1 the the 32- and 64-bit libraries are installed in two different directories, namely /usr/local/packages/fftw/fftw3_32 and /usr/local/packages/fftw/fftw3_64 respectively.
So, to employ, say, the double-precision serial 32-bit FFTW library, one would compile with
xlf90_r code.f -I/usr/local/packages/fftw/include \ -L/usr/local/packages/fftw/fftw3_32/lib -ldfftw
Similarly, to employ the 64-bit single-precision serial FFTW library in an MPI code, one would compile using
mpxlf90_r -q64 code.f -I/usr/local/packages/fftw/include \ -L/usr/local/packages/fftw/fftw3_64/lib -lsfftw
The required include files are located in /usr/local/packages/fftw/fftw3_32/include and /usr/local/packages/fftw/fftw3_64/include.
HSL, formerly the Harwell Subroutine library, is a collection of ISO Fortran codes for large scale scientific computation written and maintained by the Numerical Analysis Group at Rutherford-Appleton Laboratory. A large range of problems are addressed, but unlike many of the libraries mentioned elsewhere, sparse equation solving is a particular forte of HSL. More information is available from
To link you need -lhsl2004, given that /usr/local/lib is added to your link path.
The libraries are in /usr/local/packages/hdf5/lib. In addition, an improved version of the gzip library has been installed at /usr/local/packages/hdf5/zlib/lib. If you are using gzip compression within HDF5 you are advised to link to this version of the gzip library rather than the default system one. The current version is 1.6.4 which is compiled in 64-bit mode only. The complete information about the installation options can be found in /usr/local/packages/hdf5/libhdf5.settings and /usr/local/packages/hdf5/libhdf5_fortran.settings
Users may call this library from within a serial code, however, the associated compiler must be the parallel version, i.e. if the code uses xlf90_r or xlc_r, then to use this parallel library, one will need to employ mpxlf90_r or mpxlc_r, respectively.
MF= Makefile FC= mpxlf90_r FFLAGS= -qsuffix=f=f90 -q64 -O3 -qarch=pwr4 -qtune=pwr4 \ -I/usr/local/packages/hdf5/include \ -L/usr/local/packages/hdf5/lib \ -I/usr/local/packages/hdf5/lib \ -L/usr/local/packages/hdf5/zlib/lib \ -lhdf5_fortran -lhdf5 -lgpfs -lz LFLAGS= $(FFLAGS) EXE= prog.exe SRC= prog_withHDF5.f90 # # No need to edit below this line # .SUFFIXES: .SUFFIXES: .f90 .o OBJ= $(SRC:.f90=.o) .f90.o: $(FC) $(FFLAGS) -c -o $(OBJ) $< all: $(EXE) $(EXE): $(OBJ) $(FC) $(LFLAGS) -o $@ $(OBJ) $(OBJ): $(MF) tar: tar cvf $(EXE).tar $(MF) $(SRC) clean: rm -f $(OBJ) $(EXE) core
The HDF5 module can then be used by including use HDF5 in the code.
Some useful HDF5 tools are located in /usr/local/packages/hdf5/bin.
To recompile and link to this library, users
should link with the following commands for c:
mpcc_r -q64 -o main main.c -L/usr/local/packages/mpisplit/lib -lmpi_split
and for Fortran:
mpxlf90_r -qsuffix=f=f90 -q64 -o fmain fmain.f90 -L /usr/local/packages/mpisplit/lib -lfort_mpi_split
To use the library several enviroment variables need to be
set. SMPI_GROUP_SIZE is the number of processors in each
sub-group. It should satisfy
#cpus mod group_size = 0. If this
condition is not met, then the group size is set to #cpus. As
there will be several instances of the application running, each
instance will need its own input and output files. SMPI_DIR_PREFIX
sets the prefix for directory for the IO for each group of
processors. The group number (with a ``.'') is automatically appended
to the prefix. So if
SMPI_DIR_PREFIX=Group is set then the
separate directories for each group would be
Group.1, etc. The remaining two enviroment variables deal
with the stdout and stderr.
SMPI_STDOUT_REDIRECT can be set to
one of three values. The default is NO in which case all the
output from each instance of the main program executing will be written
to the stdout. The second option is GROUP where a separate
stdout file for each group is written to the
directory. The third option is ALL, in this case each processor writes to a separate file, again in the
but with the rank of the processor appended. Similarly for the stderr.
Below is a sample batch script file, with 32 CPUS split into 4 groups of 8, with a directory
Group.n for each group for io files, and the stdout and stderr written separately for each group into that directory.
#@ shell = /bin/ksh # #@ job_name = hello # #@ job_type = parallel #@ cpus = 32 #@ node_usage = not_shared # #@ network.MPI = csss,shared,US #@ bulkxfer = yes # #@ wall_clock_limit = 00:10:00 #@ account_no = z001 # #@ output = 4$(job_name).$(schedd_host).$(jobid).out #@ error = $(job_name).$(schedd_host).$(jobid).err #@ notification = never # #@ queue # suggested environment settings: export MP_EAGER_LIMIT=65536 export MP_SHARED_MEMORY=yes export MEMORY_AFFINITY=MCM export MP_TASK_AFFINITY=MCM # # Above lines are common settings in every batch file #----------------------------------------------------- # Following lines are Splitting Library specific # export SMPI_GROUP_SIZE=8 export SMPI_DIR_PREFIX=Group export SMPI_STDOUT_REDIRECT=GROUP export SMPI_STDERR_REDIRECT=GROUP poe ./main
In this section a small number of porting issues are very briefly addressed. Mostly they are about porting from Cray systems to the IBM, but some are more general.
liblas.a is provided on IBM systems, and while it does contain a BLAS library it is actually only the compiled version of the public domain source, and as such has very low performance. Instead use -lessl.
As noted above ESSL, BLAS aside, does not have standard interfaces to its routines.
On Cray systems the default real is 64 bits, while this corresponds to (in FORTRAN 77 terms) double precision on IBM systems. As far as libraries are concerned this affects porting in two ways.
Call saxpy( n, 1.0, a, 1, b, 1 )
on a Cray becomes something like
Call daxpy( n, 1.0d0, a, 1, b, 1 )
on IBM. This can be done either as above, but a better method is to use Fortran 90 parameterised types.
As noted above on IBM systems you always need to specify the libraries you are using, which is not always the case on Cray systems.