Time and LoadLeveler wait for no man ...

In this issue, I outline how you can arrange for your program to be
warned some time in advance that it is about to run out of time in the
batch queue.

Q: I try always to specify the wall_clock_limit in my LoadLeveler script
   to be as close as possible to the actual runtime. However, if I set
   it even slightly too small then my job is killed before it completes
   and I get no output. What should I do?

A: A brief answer is given below: for full details, see "How can I get
   some warning that my batch job is going to be killed?" at

You're right that it's good to specify as small a wall_clock_limit as
possible. This is the only information that Loadleveler has about the
runtime when it schedules jobs, and there will be times when it is
specifically looking for short jobs to fill up temporary gaps in the
queues. However, LoadLeveler also takes the limits very seriously and,
as you have noticed, will kill jobs that exceed the limit by even a

Below I describe a way of setting an alarm call so that your job is
notified some specified time (eg several minutes) in advance of being
killed so you can save data to disk and exit gracefully.

First a note on how this should be used. It's very unlikely that your
program will be able to checkpoint itself at arbitrary places in the
code. However, it is very possible that you have some large outer loop
(eg over timesteps or iterations) and that you can, in principle,
checkpoint at the end of any loop. In such a case you would test for the
alarm at the end of each loop and exit if it has gone off. If your main
loop takes about five minutes, and it takes two minutes to save to disk,
then setting an alarm value of eight minutes should be safe. Even in the
worst case when the alarm goes off immediately after you have tested for
it, you will have time to execute one more complete loop and still
checkpoint safely.

The procedure relies on the fact that you can specify two
wall_clock_limit's to LoadLeveler: a hard one and a soft one. If you
only set a single limit it is taken to be a hard limit, at which point
the system terminates your program. However, you can also specify an
additional soft limit (less than the hard one) at which point the system
issues a signal, SIGXCPU, indicating that this soft limit has been
reached. By default this signal is ignored, but a user can choose to
trap it and trigger a bespoke signal handler. The simplest approach is
to set some special alarm variable whose value can be tested from user

The normal LoadLeveler syntax is

 #@ wall_clock_limit = hardlimit

with the limit specified as hours:minutes:seconds. For example, with

 #@ wall_clock_limit = 01:30:00

your program will be terminated after 90 minutes. The full syntax is

 #@ wall_clock_limit = hardlimit [, softlimit]

eg with

 #@ wall_clock_limit = 01:30:00, 01:25:00

your program would be sent notification (via SIGXCPU) after 85 minutes,
five minutes before it is terminated.

I have implemented a couple of simple routines, HPCxAlarmSet and
HPCxAlarm, so that you can use this facility without having to bother
about the details of signal handlers under AIX.

There is one slight subtlety regarding which program is actually sent
the signal. In the normal situation your job comprises a script which
LoadLeveler executes. Unfortunately, in this case the system sends the
SIGXCPU signal to the script and NOT the user program, and this can
cause problems.

The trick is to run a job without any associated script. The Loadleveler
parameters "executable" and "arguments" allow you to do this.

If the last lines of your current script are:

  #@ queue
  poe ./a.out

then you should replace them by

  #@ executable = /usr/bin/poe
  #@ arguments  = ./a.out
  #@ queue

Note that if your current script doesn't contain the explicit call to
"poe" (eg you simply have "./a.out") you must still specify
"/usr/bin/poe" as the executable: for parallel jobs, the system is
actually using poe automatically to launch your program at runtime.

To use the alarm system in C programs, you need to include the header
"hpcxalarm.h" and link against "libhpcxalarm.a". These files are located
in /usr/local/packages/include/ and /usr/local/packages/lib/
respectively. The same library should work for both 32-bit and 64-bit

I am currently working on a Fortran interface: keep an eye on the FAQ
entry noted above for news of any progress!

A typical C code would look something like:

#include "hpcxalarm.h"

void main(void)
  HPCxAlarmSet();       /* Set up the alarm */

  for (loop=0; loop < MAXLOOP; loop++)
    ...                 /* Do all the complicated work for this loop */

    if ( HPCxAlarm() )  /* Has the alarm gone off yet? */
        printf("WARNING: Alarm Call Received!\n");

  printf("Checkpointing after completing %d iterations\n", loop);
A working test program (including a Makefile and LoadLeveler script) and sample output are also available.

This library can also be called from with Fortran programs, either in a Fortran77 style:

  include 'fhpcxalarm.h'
or via a Fortran90 module:
  use fhpcxalarm

In both cases you need to set the include and link paths appropriately using -I and -L. Example Fortran codes and Makefiles for both cases are available in a single tar file, which also includes the C versions for completeness.

If you want to use this from within a parallel program, see the note at the end of the FAQ entry.