http://www.hpcx.ac.uk/support/documentation/IBMdocuments/sg246038.pdf
http://www.hpcx.ac.uk/support/documentation/IBMdocuments/a2278810.pdf
http://www.hpcx.ac.uk/support/documentation/IBMdocuments/a2278820.pdf
LoadLeveler provides the facility to build, submit and process batch jobs on the HPCx system. Users can interact with LoadLeveler either using the command line, or using the LoadLeveler GUI xloadl (see here).
The following table gives the runtime limits the present LoadLeveler setup allows, depending on the number of tasks/threads. The table assumes 16 tasks/threads are allocated per node.
| max. tasks/threads | max. runtime | Comment |
| 16 | 48h | Largest OpenMP job |
| 128 | 48h | Largest Capacity job |
| 1024 | 48h | Capability jobs |
| 1536 | 1h | largest partition available |
| serial | 12h | special serial queue |
There are special queues offering a higher priority for testing and
debugging jobs using up to 128 tasks. These queues are available from
9:00 to 18:00 Monday to Friday. To access these queues you have to
ask for a wall_clock_limit of 20 minutes or less.
Users should note that capacity and capability jobs are run in different regions of the machine. Capability jobs often achieve a faster turn-around time. Users are therefore encouraged to run on 256 processors or more. If you require help to achieve scalability to this level then please contact the helpdesk (helpdesk@hpcx.ac.uk). Users of OpenMP should note that it is possible to run many separate OpenMP jobs simultaneously within a single script, enabling them to submit jobs using more than 16 processors (the limit for a single OpenMP program). For details on how to do this see the FAQ entry http://www.hpcx.ac.uk/support/FAQ/#omptaskfarm.
Simultaneous Multithreading (SMT) is a new feature of the Power5 processor that allows two task/threads to be executed simultaneously by a single CPU. If SMT is enabled, the numbers of tasks/threads in the table above should be multiplied by two. See here for further details.
Every LoadLeveler job must be submitted through a job command file. This describes the job to be submitted and contains a number of LoadLeveler keyword statements which specify the various resources needed by the job.
Job command files can either be created using a text editor by using the xloadl tool. Job command files have no specific naming convention, and may be given whatever name you wish. LoadLeveler keywords in the file are case insensitive.
This is a simple LoadLeveler script for running on 256 processors. It
also contains our presently recommended setting of the environment
variables for production jobs. These setting and values
proved to give good performance for a wide variety of applications.
Tweaking individual settings might lead to further improvements for
some applications.
For a development
job MP_EAGER_LIMIT should be changed to 0.
#@ shell = /bin/ksh # #@ job_name = myrun # #@ job_type = parallel #@ cpus = 256 #@ node_usage = not_shared # #@ network.MPI = csss,shared,US #@ bulkxfer = yes # #@ wall_clock_limit = 00:20:00 #@ account_no = z001 # #@ output = $(job_name).$(schedd_host).$(jobid).out #@ error = $(job_name).$(schedd_host).$(jobid).err #@ notification = never # #@ queue # suggested environment settings: export MP_EAGER_LIMIT=65536 export MP_SHARED_MEMORY=yes export MEMORY_AFFINITY=MCM export MP_TASK_AFFINITY=MCM poe ./my_executable
Note that all the lines in this command file are required, unless stated below. The meaning of each line in this command file is as follows:
shell = /bin/ksh
Specifies the shell to be used for the job.
job_name = my_run
This allows you to give a name to the job. This is not mandatory, but is
useful for identifying output files (see later).
job_type = parallel
This informs LoadLeveler that this is a parallel job which requires
scheduling on multiple processors.
cpus = 256
Specify the number of CPUs you need. This can be
any number, however since the machine is partitioned into LPARs of 16
CPUs, we recommend multiples of 16, e.g. 16, 32, 48, 64, .... If you
select a number of CPUs different from the above, you will be
charged for the next full multiple of 16. To be specific, if you ask
for 59 CPUs, you will be charged for
4 x 16 = 64 CPUs.
This is because your jobs is granted exclusive access to the
frames it occupies.
node_usage = not_shared
This ensures that your job will have exclusive use of the LPAR.
network.MPI = csss,shared,US
By default, MPI jobs on HPCx use IP communications not the dedicated
US user-space communications.
As US has better latency, we recommend that this should be used in MPI
scripts by default. However, if you use a single LPAR you may see something
similar to the following error:
US is not valid for single nodes. Delete "<whatever #@network is given>" or use IPIf this is the case then do not include this line as US communications do not function on a single LPAR. By leaving out the line, IP communications will be used and this error should not occur.
bulkxfer = yes
Switches on Remote Direct
Memory Access (RDMA) to boost the maximum bandwidth available via the
switch network. We expect this to be beneficial to most codes, but
a few applications might see a decrease in performance. We recommend
comparing the performance of your application with and without this
line in the script.
wall_clock_limit = 00:20:00
Specifies a wall clock limit of 20 minutes for the job. The wall
clock limit has the format hh:mm:ss or mm:ss. The time choosen
has to be shorter than the time limit for the number of tasks you are using.
See here for details.
account_no = z001
You must set account_no to a valid budget for your project (e.g.
y001, y002, ...). You can determine the budgets you have access to with the
budgets command:
$ budgets z001: 1000000 AU 416666:40:0
output = $(job_name).$(schedd_host).$(jobid).out
error = $(job_name).$(schedd_host).$(jobid).err
These lines specify the files to which stdout and stderr from your job
will be redirected. There is no default, so you must set something here.
The use of $(schedd_host).$(jobid) is recommended as this matches
the hostid/jobid reported by LoadLeveler (see below).
notification = never
Suppresses email
notification of job completion. To receive email notification, please
see below.
queue
This line tells LoadLeveler to enqueue the job: this is essential!
export MP_EAGER_LIMIT=65536
Tweak the MPI library.
Most applications perform best for a value of 65536, however you might
want to check your application. When running on more than 256 CPUs,
poe will reduce the MP_EAGER_LIMIT to a lower value.
Ignore the corresponding warning in your error file.
For development work set export MP_EAGER_LIMIT=0.
If your code doesn't work for MP_EAGER_LIMIT=0, there is a
problem with the way it uses MPI, which needs mending.
export MP_SHARED_MEMORY=yes
Use shared memory inside your logical partition (frame). Don't change.
export MEMORY_AFFINITY=MCM
Use the memory closest to the cpu.
export TASK_AFFINITY=MCM
Ensure that processes do not migrate between MCMs.
poe ./my_executable
This line executes an MPI executable callel my_executable in the
current directory. poe is the MPI job launcher: note that you
do not have to specify the number of processes here: it is
automatically derived from the number of
requested with the LoadLeveler keyword cpus.
By default LoadLeveler uses your login shell to interpret the command file. You may need to modifiy this line to use the syntax of your login shell for setting environment variables. Alternatively, you can specify a different shell to interpret the command file by adding a line such as:
#@ shell = /bin/ksh
Adding the optional line
#@ tasks_per_node = 15above the ``
#@ queue'', will place 15 MPI-tasks on a single LPAR.
In this case you should specify multiples of 15 for the required
number of CPUs (LoadLeveler keyword cpus).
Remember you still have to pay for 16 CPUs per LPAR.
#@ shell = /bin/ksh # #@ job_name = my_openmp_run # #@ job_type = parallel #@ cpus = 1 #@ node_usage = not_shared # #@ wall_clock_limit = 00:20:00 #@ account_no = z001 # #@ output = $(job_name).$(schedd_host).$(jobid).out #@ error = $(job_name).$(schedd_host).$(jobid).err #@ notification = never # #@ queue # export XLSMPOPTS=spins=0:yields=0 export OMP_NUM_THREADS=16 ./my_omp_executable
Comment:
The lines
#@ job_type = parallel #@ cpus = 1 #@ node_usage = not_sharedensure LoadLeveler dedicates a full frame with 16 processors to your job but doesn't place more than 1 task. Obviously your account gets charged for all 16 processors of the frame dedicated to your job. The command export XLSMPOPTS=spins=0:yields=0 forces the threads to adopt busy-waiting where they keep executing in a tight loop looking for new work instead of going to sleep.
Please email support@hpcx.ac.uk if you are interested in running mixed MPI/OpenMP applications.
#@ core_limit = 10mbto the header of your LoadLeveler script. The specified size applies to each MPI task individually. The files get truncated to this size. For large core files this might need adjusting.
Simultaneous Multi-Threading is now a user configurable option within
LoadLeveler, and is available for all jobs using
``#@ node_usage = not_shared''. This excludes serial and
interactive jobs. SMT allows two separate threads or tasks to be executed
simultaneously on each physical processor, splitting it into two
logical processors.
To enable SMT for a job, you must include the SMT feature keyword in your job submission script:
#@ requirements = ( Feature == "SMT" )
SMT will be enabled prior to the beginning of your job, and disabled at completion, making 32 processors available per node. For MPI jobs, you should therefore increase your tasks per node by setting:
#@ tasks_per_node = 32
Similarly, for Mixed or OpenMP jobs,
export OMP_NUM_THREADS=32
As usual, nodes can be underpopulated if you require this (MPI tasks or OpenMP threads less than 32).
Many applications stand to benefit significantly from SMT, but please note that this is an advanced feature, and the performance of some applications may not improve. For more details of SMT and its potential performance benefits, see:
http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0604.pdf
Important Note: When using SMT, you should be careful how you
interpret timing calls to your code. The "User" and "System" times
commonly reported accumulate only when the thread concerned is
actually running, hence a processor running two threads will report
only about half of its true execution time with a CPU timer. It is
important to make sure you use a "wallclock" or "elapsed time" timer,
which measures the amount of real time spent running your
application. For this purpose, we recommend the
MPI_Wtime() and irtc() routines, both of which are
good wallclock timers.
Please email support@hpcx.ac.uk if you experience problems when using SMT.
#@ shell = /bin/ksh # #@ job_name = my_serial_run # #@ job_type = serial #@ node_usage = shared # #@ wall_clock_limit = 00:20:00 #@ account_no = z001 # #@ output = $(job_name).$(schedd_host).$(jobid).out #@ error = $(job_name).$(schedd_host).$(jobid).err #@ notification = never # #@ queue # ./my_serial_executable
This job will run in a serial class: see Section 8.3.
For serial jobs ``#@ node_usage = shared'' has to be specified.
#@ wall_clock_limit = 00:10:00 #@ account_no = z001 #@ job_name = hello #@ output = $(job_name).$(jobid).$(stepid).out #@ error = $(job_name).$(jobid).$(stepid).err #@ job_type = serial #@ node_usage = shared #@ executable = serial_script1.sh #@ step_name = serial_1 #@ queue #@ executable = parallel_script.sh #@ node_usage = not_shared #@ dependency = serial_1 == 0 #@ job_type = parallel #@ step_name = parallel_1 #@ cpus=16 #@ queue #@ node_usage = shared #@ executable = serial_script2.sh #@ dependency = parallel_1 == 0 #@ job_type = serial #@ step_name = serial_2 #@ queueThe LoadLeveler script consists of a series of job-steps seperated by the queue keyword. Any keyword that is not explicitly overridden is inherited by the following steps. The commands to be run at each step needs to be places in a seperate script. In the above example each step only continues if the previous step exited with a zero return value.
Job allocation and queueing policy are controlled to some extent by machine configuration which can change to accommodate changing user demand, so only a brief overviw of a `typical' configuration is given here
LoadLeveler execution queues are called classes.
Jobs submitted via llsubmit (see below) are automatically assigned to the appropriate class based on the job type, number of CPUs and wall clock limit requested in the LoadLeveler command file. This also applies to serial jobs.
Any given job will be allocated to the class whose number of CPUs and wall clock time is the next above that which has been requested in the LoadLeveler script file.
Allocation of processors for parallel jobs is by node (i.e. in multiples of 16), and is exclusive, even if not all the processors on the node are used by a job. The relevant budget will be charged with the wall clock time multiplied by the number of nodes. Regardless of how many processors in each node are actually used each node will be charges as 16 CPUs. The same conditions apply to all parallel jobs.
LoadLeveler is operated with a backfilling policy in operation. This means that for jobs requesting large number of nodes, LoadLeveler will identify a time slot in the future for the job to run in. It will then prevent smaller jobs from running which would impinge on this time slot. Therefore, small jobs may not run when there appear to be sufficient free processors available. This policy means that large jobs will not get ``frozen out'' because sufficient processor are never available at the same time.
Of course, this sometimes has an impact on the throughput of smaller jobs, but users should never ask for a longer wall-clock time than a job needs in the hope that this will speed up the throughput of the job. It may well do so occasionally, but overall, the ``optimal'' strategy is always to ask for just as much wall clock time as it is believed the job will need.
The available classes can be viewed using the llclass command.
The column
``Description'' informs you about the time and
CPU limit of each class.
Jobs are submitted to the system via the llsubmit command, or via the xloadl GUI (see Section 8.9).
For example:
$ llsubmit hello.ll
where hello.ll is the LoadLeveler command file. The filter
does a number of checks on your command file. If it spots a problem,
you will receive an error message and the job will not be submitted.
The llq command displays information on the current status of
LoadLeveler jobs, as does the xloadl GUI.
The status column (ST) provides details of the current status of all the jobs on the system. The most common states are: R (Running), I (Idle), ST (Starting) and C (completed). Queueing jobs will usually be idle, signifying that they are being considered to run, but a machine has not yet been selected to run the job on.
You may find out further details on one of your own specific jobs using the -s flag.
To cancel a job you should use the llcancel command:
$ llcancel l3f42.124.0 llcancel: Cancel command has been sent to the central manager.
Use the llstat command to display information about the overall status of the system, including which jobs are running on which frames.
The llstatus command can be used to display a much more comprehensive (and arguably, less comprehensible) summary of the machines state.
All the sample LoadLeveler script files in this section contain the line:
#@ notification = never
This means that users are not informed (by email) when a batch job completes.
To receive an email message on a remote machine when a batch job completes, edit the LoadLeveler script and replace:
#@ notification = neverwith:
#@ notify_user = email-address@another.machine.somewhere.else #@ notification = completewhere email-address@another.machine.somewhere.else is the address to which notification is to be sent. One can also use always instead of complete.
The main xloadl window has three panels. The top panel show the current queue status (equivalent to the output from llq), the centre panel shows the machine status (equivalent to the output from llstatus), and the bottom panel shows messages from LoadLeveler, including the output from LoadLeveler commands executed via xloadl.
To submit a job, click on File -> Submit a job... in any of the panels. This pops up a new window, which allows you to select, edit and submit a command file.
To view job details, click on the job in the top panel and then click on Actions -> Details.
To cancel a job, select Actions -> Cancel in the top panel.