HPCx homepage
Services User support Projects Research About us Sitemap Search  
  line          
Helpdesk User Guide Notices Bulletin Documentation
Training Porting Filestore Batch Interactive FAQ
               
home > support > FAQ
spacer
hr
spacer

FAQ

spacer

This FAQ covers the general computing environment and includes some general tips. The administration FAQ can be found here

General User FAQ

Accessing HPCx

Compiling and Linking Codes on HPCx

Submitting Batch Jobs on HPCx

Debugging Codes on HPCx

Profiling Codes on HPCx

Software Availability on HPCx

Accounting on HPCx

The GRID on HPCx



Q: What is the full address of the hpcx computing platform?
A: login.hpcx.ac.uk .


Q: I've forgotten my password for the HPCx web pages and for hpcx itself. Can you help?
A: Actually, you can administer this yourself. For the web pages, simply point your browser to https://www.hpcx.ac.uk/index.jsp and enter in your email and leave the password field empty. Then press the login button and your password will then be emailed to you. For hpcx, you may request a new password via the web pages, by choosing the relevant account and machine from your main page and then choose "New Password".


Q: Help, my gnome-terminal display is garbled
A:You will need to switch off your Num-Lock.


Q: Are there example Makefile and LoadLeveler scripts available?
A: Click here to download a tarball containing a standard Makefile and LoadLeveler scripts.


Q: What is SMT and how do I use it?
A: By enabling simultaneous multi-threading, or SMT, each processor can support two different instruction streams, allowing the simultaneous execution of two threads or tasks on a single processor, in an attempt to reduce the processor's idle time. These streams appear as logical processors, so in effect, SMT doubles the processor number of the nodes your job runs on.

Enabling SMT can result in a significant performance improvement, but it depends on the individual application. This is an advanced feature that can be enabled within the loadleveller job script. See the Batch Processing Section of the User Guide for details of how to use SMT, and this Technical Report for a more detailed study of SMT on HPCx.

Important Timings Note: When using SMT, you should be careful how you interpret timing calls to your code. The "User" and "System" times commonly reported accumulate only when the thread concerned is actually running, hence a processor running two threads will report only about half of its true execution time with a CPU timer. It's important to make sure you use a "wallclock" or "elapsed time" timer, which measures the amount of real time spent running your application. For this purpose, we recommend the MPI_Wtime() and irtc() routines, both of which are good wallclock timers.


Q: Why do you recommend using only re-entrant compilers, i.e. mpxlf90_r rather than mpxlf90?
A: Several reasons:


Q: How do I compile and link my code in 64-bit mode?
A: This is achieved by compiling and linking your program with the -q64 flag set. Further information is available in the Compilation section of the User Guide.


Q: Can I put both 32-bit and 64-bit versions of the same routine into a single library?
A: Yes, click here for detailed instructions. There is a separate tarball of a basic example including a Makefile here.


Q: How can I get some warning that my batch job is going to be killed?
A: It is quite straightforward to set up an alarm system within LoadLeveler such that you are notified when a certain time has elapsed. This was described in the Summer 2004 issue of Capability Computing ("Time and LoadLeveler wait for no man ..."). Here is the text of that article, along with an example C code, Makefile, LoadLeveler script and sample output.

Since the article was published we have developed Fortran interfaces to the same library that can be accessed either in a Fortran77 style:

  include 'fhpcxalarm.h'
or via a Fortran90 module:
  use fhpcxalarm

In both cases you need to set the include and link paths appropriately using -I and -L. Example Fortran codes and Makefiles for both cases are available in a single tar file, which also includes the LoadLeveler script and the original C versions for completeness. The only slight subtlety is that you must specifiy a different include path if you are using the module and compiling in 32-bit mode (-q32) as I have set the default to be 64-bit mode. See the comments in Makefile.f90 for details.

To differentiate the two methods of calling the library I have named the associated test files testalarm.f and testalarm.f90. However, you should note that these are both Fortran90 codes compiled with the f90 compiler - it is purely a matter of preference which method you choose to access the library from your own programs. Of course, if you are compiling with a true f77 compiler then you will not have access to the 'use' statement and the only option is to include the header.

The test code takes about a second to perform an iteration, so will not manage to complete all 25 (the value of MAXLOOP) iterations with the specified 20-second wall_clock_limit. However, as the soft limit is set to 10 seconds it is able to exit gracefully, receiving the alarm call after 10 iterations.

Note that although this test code runs under poe in a parallel queue, it is actually a serial program. When implementing the alarm system in a parallel program it is important to consider the possibility that one process might receive the signal at a slightly different time from another. You probably want to terminate the code when any of the processes is signalled. The best way to do this is probably to combine the alarm variables across all processes using a global logical or, ie an MPI_Allreduce operation with MPI_Op = MPI_LOR.


Q: What can MP_EAGER_LIMIT do for me?
A: export MP_EAGER_LIMIT=0 will help debug, MP_EAGER_LIMIT=65536 may help performance. Click here for further information.


How can I find out the hostnames of the nodes that I use in my parallel job?
A: The environment variable LOADL_PORCESSOR_LIST will give you a list of all the nodes that your job is using.


Q: How can I run a task farm on HPCx, ie execute many simultaneous runs of an existing serial program?
A: This can be done using the taskfarm utility which is located in /usr/local/packages/bin/. Further information can be found in the short task farming guide, or in the task farming presentation given at the HPCx User Group meeeting on 21 July 2005.


Q: How can I run many OpenMP jobs simultaneously on different SMP nodes of HPCx?
A: This can be done using the taskfarm utility which is located in /usr/local/packages/bin/. The taskfarm utility is described in general in the short task farming guide, and the associated task farming presentation (given at the HPCx User Group meeeting on 21 July 2005) also explains how to use it with OpenMP programs. However, as it is slightly complicated, it is explained in detail in the short OpenMP task farming guide.


Q: How do I perform network file-transfer within a batch job?
A: The normal compute nodes of HPCx are not directly connected to the internet so it is not possible to perform network file transfers within a simple batch job. However the node where the serial queues run does have full network connectivity so you can split your batch job into a series of serial and parallel job-steps and perform the file transfers within the serial step. There are more detailed isntructions in the User guide


Q: What parallel debuggers are available on HPCx?
A: IBM provide pdbx, a parallel version of the standard dbx command-line debugger. Information on how to run this is available in the Tools section of the User Guide. We also support Etnus TotalView, the widely used GUI-based parallel debugger. Before running TotalView you must go through some setup stages which are detailed here.


Q: My job has failed with the error number 0031-250 from each task. What has happened?
A: It looks like this:
ERROR: 0031-250 task 1: Terminated
ERROR: 0031-250 task 3: Terminated
ERROR: 0031-250 task 2: Terminated
ERROR: 0031-250 task 0: Terminated
Every task has been terminated, this may be due to a specific error (e.g. look for an error message from MPI) or just due to the job hitting the LoadLeveler time limit. In the latter case, resubmit with a larger time-limit (LoadLeveler wall_clock_limit option), reduce the amount of work your job is doing or investigate whether the code has deadlocked.


Q: Where can I find further information about a particular error number?
A: There is a full list of error numbers in the IBM documentation. The first part of the error number tells you from which software product the error message came. E.g.

These messages all come from the Parallel Environment (PE) and more information may be found in the PE for AIX: Messages manual.

LoadLeveler error messages start with 2512, 2539 and 2544 and detailed information can be found in the LoadLeveler: Diagnosis and Messages guide.


Q: My job prints an error message and then stops. How can I find out where it stops?
A: A number of common errors, e.g. from MPI routines, print a message and stop, making it very difficult to locate the point where the error occurred. When debugging a large, complex and unfamiliar code, it could be that the program contains a stop statement somewhere deep in the code.

You can arrange for a code to abort even if it performs a stop or otherwise ends normally. The following routine, abortatexit.c, provides a simple interface, callable from Fortran, which uses the atexit Unix system call to register the abort function to be called at normal program termination To use it simply call abortatexit() at the start of the program.

/* abortatexit.c */

#include <stdlib.h>

/* Allow for different interfaces for a C routine called from Fortran */
#if defined CRAY_T3E || defined NT32
int ABORTATEXIT  ()
#else
#if defined IBM || defined HP
int abortatexit  ()
#else
int abortatexit_ ()
#endif
#endif
{

/* Register the abort function to be called on successful program exit */
atexit(*abort);

}

Clearly this call should be commented out or otherwise disabled when the debugging process is complete.

Q: How can I debug dynamic memory allocation problems?
A: There are two tools available for debugging memory problems on HPCx. Either use the memory debugging features of Totalview, or use the IBM-supplied memory debugging library libhmd. Both methods require users to re-link their programs with supplied memory debugging libraries.


Q: When using the tape archive a file 'dsmerror.log' is produced. What went wrong ?
A: Nothing. The file 'dsmerror.log' is produced with every invocation of the archiving process and you can safely ignore it. As long as the file(s) you wanted to archive are listed when you query them, there is no reason to question the success of the archival process.


Q: How can I force the output to flush to the files during runtime?
A: As an alterntive to calling flush to ensure that the output is written to file before the wall clock limit is reached, use export XLFRTEOPTS buffering=disable_all which flushes ALL output. To flush only std out and std err (for performance purposes), use export XLFRTEOPTS buffering=disable_preconn.


Q: How do I profile a code on HPCx?
A: Details of how to use gprof, xprofiler and Vampir can be found in the Tools section of the User Guide.


Q: How can I profile a code, using gprof or xprofiler, and include the calls to LAPACK?
A: You'll need to include the profile-enabled version of LAPACK, namely -llapack_profile which resides in /usr/local/lib


Q: How can I Profile an OpenMP or Mixed Mode code ?
A: The Paraver Tool can be used to profile either MPI, OpenMP or MPI-OpenMP codes on HPCx. A guide to its use on HPCx can be found here.


Q: Is VASP available on HPCx and if so, where is it?
A: VASP is not currently publically available on HPCx, however, if you have access to this software you are welcome to install it in your own home space. Here are some example installation makefiles which you may find useful. The file makefile.lib is for the library and makefile.4.4.3 is for the actual VASP directory. These makefiles will not necessarily work with different versions of VASP, but may contain the essential pieces of information.


Q: Are there any graphics libraries available on HPCx ?
A: There are two versions of OpenGL installed on HPCx:

Note that the native implementation is much faster, but depending on the capabilities of your local terminal you may not be able to use it. This is why we also supply Mesa which should work in all situations.

Here is a very simple OpenGL test code, that you can download and compile on HPCx using this Makefile (or you can try a pre-built executable), to see how to use the GL library. You should build and run as follows:

  user@hpcx$ make
          xlc_r -q64 -O3 -c ogltest.c
          xlc_r -q64 -O3 -o ogltest ogltest.o -lGL -lGLU -lglut -lXmu -lX11 -lm
  Target "all" is up to date.
  user@hpcx$ ./ogltest

and you should see output like this popping up in a window.

If you get the following error

  user@hpcx ./ogltest
  GLUT: Fatal Error in ogltest: OpenGL GLX extension not supported by display: l1f51:37.0

it means that the terminal you are logging in from does not have sufficient graphics capabilities. For example, if you are using a Windows PC and running eXceed then you may have to install and/or enable the GLX extensions by hand - please speak to your local system administrator.

If you are unable to use IBM's OpenGL then you should uncomment lines 9, 10 and 11 of the Makefile so that they now read:

INCGL=	-I/usr/local/packages/mesa/mesa6.2/include
LIBGL=	-L/usr/local/packages/mesa/mesa6.2/lib
LIBS=	-lMesaGL -lMesaGLU -lMesaglut -lXmu -lX11 -lm

Now build and run as before:

  user@hpcx$ make
          xlc_r -q64 -O3 -I/usr/local/packages/mesa/mesa6.2/include -c ogltest.c
          xlc_r -q64 -O3 -I/usr/local/packages/mesa/mesa6.2/include -L/usr/local/packages/mesa/mesa6.2/lib -o ogltest ogltest.o -lMesaGL -lMesaGLU -lMesaglut -lXmu -lX11 -lm
  Target "all" is up to date.
  user@hpcx$ ./ogltest
and you should now see the test picture, or you can use the pre-built Mesa executable.

There is more information on the WWW regarding Mesa.


Q: How do I access different versions of the MASS library?
A: To use a specific version of the MASS library you have to link explicitly against the appropriate library. The libraries can be found in the directories /usr/lpp/massX.Y, where X.Y is the version number. The default MASS library is held in /usr/lpp/mass. NB the most up-to-date MASS library is now included as part of IBM's compilers.


Q: If I specify fewer than 16 processes on each 16-way node (under-population of nodes with tasks), at what rate will I be charged?
A: You will be charged at the full rate of 16 processes per node, regardless of the number of processors you actually use. Application nodes are NOT shared with other users.


Q: How do I get a gridmap entry
A: You can request this from the admin site https://www.hpcx.ac.uk . Login and go to the Your User Accounts section. Click the view button for the account you want a gridmap entry for then use the Add button in the section Globus Certificates

Q: How do I submit a batch job via globus
A: Use globus-job-submit or globusrun in the usual way but send the jobs to the loadleveler jobmanager e.g.

globus-job-submit login.hpcx.ac.uk:2119/jobmanager-loadleveler -np 4 -maxtime 2 -project e15 /bin/date
Note that you must specify maxtime and project the same as manual loadleveler submission.

Q: How do I make network connections from the back-end
A: The back-end nodes are not directly connected to the internet so it is not possible to make socket connections directly from the parallel batch queues (serial jobs do have internet access). However grid enabled applications such as those using MPICH-G2 can use a socks server running on the login node to forward network connections to and from the backend. To enable this you need to add the socks wrapper object file when linking your applicaiton. e.g.

/usr/local/packages/globus/mpich-1.2.7_32/bin/mpicc -L/usr/local/packages/dante/lib.32  -o ring /usr/local/packages/dante/lib.32/socks_wrapper.o ring.c -lsocks
Use the 32 or 64 bit versions of these libraries as appropriate.

spacer
hr
spacer
http://www.hpcx.ac.uk/support/FAQ/index.html contact email - www@hpcx.ac.uk © UoE HPCX Ltd