next up previous
Next: Interactive Execution Up: User's Guide to the Previous: Porting Codes

Subsections


Batch Processing

Batch processing on the HPCx system is controlled using LoadLeveler, a batch job scheduling application produced by IBM:

http://www.hpcx.ac.uk/support/documentation/IBMdocuments/sg246038.pdf
http://www.hpcx.ac.uk/support/documentation/IBMdocuments/a2278810.pdf
http://www.hpcx.ac.uk/support/documentation/IBMdocuments/a2278820.pdf


Batch System Overview

LoadLeveler provides the facility to build, submit and process batch jobs on the HPCx system. Users can interact with LoadLeveler either using the command line, or using the LoadLeveler GUI xloadl (see here).

The following table gives the runtime limits the present LoadLeveler setup allows, depending on the number of tasks/threads. The table assumes 16 tasks/threads are allocated per node.

max. tasks/threads max. runtime Comment
16 48h Largest OpenMP job
128 48h Largest Capacity job
1024 48h Capability jobs
1536 1h largest partition available
serial 12h special serial queue

There are special queues offering a higher priority for testing and debugging jobs using up to 128 tasks. These queues are available from 9:00 to 18:00 Monday to Friday. To access these queues you have to ask for a wall_clock_limit of 20 minutes or less.

Users should note that capacity and capability jobs are run in different regions of the machine. Capability jobs often achieve a faster turn-around time. Users are therefore encouraged to run on 256 processors or more. If you require help to achieve scalability to this level then please contact the helpdesk (helpdesk@hpcx.ac.uk). Users of OpenMP should note that it is possible to run many separate OpenMP jobs simultaneously within a single script, enabling them to submit jobs using more than 16 processors (the limit for a single OpenMP program). For details on how to do this see the FAQ entry http://www.hpcx.ac.uk/support/FAQ/#omptaskfarm.

Simultaneous Multithreading (SMT) is a new feature of the Power5 processor that allows two task/threads to be executed simultaneously by a single CPU. If SMT is enabled, the numbers of tasks/threads in the table above should be multiplied by two. See here for further details.

Creating a job command file

Every LoadLeveler job must be submitted through a job command file. This describes the job to be submitted and contains a number of LoadLeveler keyword statements which specify the various resources needed by the job.

Job command files can either be created using a text editor by using the xloadl tool. Job command files have no specific naming convention, and may be given whatever name you wish. LoadLeveler keywords in the file are case insensitive.


MPI job command files

This is a simple LoadLeveler script for running on 256 processors. It also contains our presently recommended setting of the environment variables for production jobs. These setting and values proved to give good performance for a wide variety of applications. Tweaking individual settings might lead to further improvements for some applications. For a development job MP_EAGER_LIMIT should be changed to 0.

  #@ shell = /bin/ksh
  #
  #@ job_name = myrun
  #
  #@ job_type = parallel
  #@ cpus = 256
  #@ node_usage = not_shared
  #
  #@ network.MPI = csss,shared,US
  #@ bulkxfer = yes
  #
  #@ wall_clock_limit = 00:20:00
  #@ account_no = z001
  #
  #@ output = $(job_name).$(schedd_host).$(jobid).out
  #@ error  = $(job_name).$(schedd_host).$(jobid).err
  #@ notification = never
  #
  #@ queue
  
  # suggested environment settings:
  export MP_EAGER_LIMIT=65536
  export MP_SHARED_MEMORY=yes
  export MEMORY_AFFINITY=MCM
  export MP_TASK_AFFINITY=MCM

  poe ./my_executable

Note that all the lines in this command file are required, unless stated below. The meaning of each line in this command file is as follows:

shell = /bin/ksh
Specifies the shell to be used for the job.

job_name = my_run
This allows you to give a name to the job. This is not mandatory, but is useful for identifying output files (see later).

job_type = parallel
This informs LoadLeveler that this is a parallel job which requires scheduling on multiple processors.

cpus = 256
Specify the number of CPUs you need. This can be any number, however since the machine is partitioned into LPARs of 16 CPUs, we recommend multiples of 16, e.g. 16, 32, 48, 64, .... If you select a number of CPUs different from the above, you will be charged for the next full multiple of 16. To be specific, if you ask for 59 CPUs, you will be charged for 4 x 16 = 64 CPUs. This is because your jobs is granted exclusive access to the frames it occupies.

node_usage = not_shared
This ensures that your job will have exclusive use of the LPAR.

network.MPI = csss,shared,US
By default, MPI jobs on HPCx use IP communications not the dedicated US user-space communications. As US has better latency, we recommend that this should be used in MPI scripts by default. However, if you use a single LPAR you may see something similar to the following error:

US is not valid for single nodes. Delete "<whatever #@network is given>"
or use IP
If this is the case then do not include this line as US communications do not function on a single LPAR. By leaving out the line, IP communications will be used and this error should not occur.

bulkxfer = yes
Switches on Remote Direct Memory Access (RDMA) to boost the maximum bandwidth available via the switch network. We expect this to be beneficial to most codes, but a few applications might see a decrease in performance. We recommend comparing the performance of your application with and without this line in the script.

wall_clock_limit = 00:20:00
Specifies a wall clock limit of 20 minutes for the job. The wall clock limit has the format hh:mm:ss or mm:ss. The time choosen has to be shorter than the time limit for the number of tasks you are using. See here for details.

account_no = z001
You must set account_no to a valid budget for your project (e.g. y001, y002, ...). You can determine the budgets you have access to with the budgets command:

  $ budgets
  z001:    1000000 AU      416666:40:0

output = $(job_name).$(schedd_host).$(jobid).out
error = $(job_name).$(schedd_host).$(jobid).err
These lines specify the files to which stdout and stderr from your job will be redirected. There is no default, so you must set something here. The use of $(schedd_host).$(jobid) is recommended as this matches the hostid/jobid reported by LoadLeveler (see below).

notification = never
Suppresses email notification of job completion. To receive email notification, please see below.

queue
This line tells LoadLeveler to enqueue the job: this is essential!

export MP_EAGER_LIMIT=65536
Tweak the MPI library. Most applications perform best for a value of 65536, however you might want to check your application. When running on more than 256 CPUs, poe will reduce the MP_EAGER_LIMIT to a lower value. Ignore the corresponding warning in your error file. For development work set export MP_EAGER_LIMIT=0. If your code doesn't work for MP_EAGER_LIMIT=0, there is a problem with the way it uses MPI, which needs mending.

export MP_SHARED_MEMORY=yes
Use shared memory inside your logical partition (frame). Don't change.

export MEMORY_AFFINITY=MCM
Use the memory closest to the cpu.

export TASK_AFFINITY=MCM
Ensure that processes do not migrate between MCMs.

poe ./my_executable
This line executes an MPI executable callel my_executable in the current directory. poe is the MPI job launcher: note that you do not have to specify the number of processes here: it is automatically derived from the number of requested with the LoadLeveler keyword cpus.

By default LoadLeveler uses your login shell to interpret the command file. You may need to modifiy this line to use the syntax of your login shell for setting environment variables. Alternatively, you can specify a different shell to interpret the command file by adding a line such as:

#@ shell = /bin/ksh

MPI job with less than 16 tasks per LPAR

Some applications perform better when leaving a number of processors on your LPAR unused. This leaves these processors free to deal with the needs of the operating system and threads originating from the MPI library. Keep in mind that you will be charged for these processors.

Adding the optional line

  #@ tasks_per_node = 15
above the ``#@ queue'', will place 15 MPI-tasks on a single LPAR. In this case you should specify multiples of 15 for the required number of CPUs (LoadLeveler keyword cpus). Remember you still have to pay for 16 CPUs per LPAR.

OpenMP job command files

The HPCx Phase 2a system is operated as a cluster of 16-way SMPs. This makes it a more attractive resource to run applications in shared memory using OpenMP. Below is a simple sample script to run an OpenMP code with 16 threads.

  #@ shell = /bin/ksh
  #
  #@ job_name = my_openmp_run
  #
  #@ job_type = parallel
  #@ cpus = 1
  #@ node_usage = not_shared
  #
  #@ wall_clock_limit = 00:20:00
  #@ account_no = z001
  #
  #@ output = $(job_name).$(schedd_host).$(jobid).out
  #@ error  = $(job_name).$(schedd_host).$(jobid).err
  #@ notification = never
  #
  #@ queue
  #
  
  export XLSMPOPTS=spins=0:yields=0
  export OMP_NUM_THREADS=16

  ./my_omp_executable

Comment:
The lines

  #@ job_type = parallel
  #@ cpus = 1
  #@ node_usage = not_shared
ensure LoadLeveler dedicates a full frame with 16 processors to your job but doesn't place more than 1 task. Obviously your account gets charged for all 16 processors of the frame dedicated to your job. The command export XLSMPOPTS=spins=0:yields=0 forces the threads to adopt busy-waiting where they keep executing in a tight loop looking for new work instead of going to sleep.

Mixed MPI/OpenMP job command files

Please email support@hpcx.ac.uk if you are interested in running mixed MPI/OpenMP applications.


Enabling core files

By default a failing job on HPCx will not produce a core file. To enable the production of core files, you need to add a line similar to
  #@ core_limit = 10mb
to the header of your LoadLeveler script. The specified size applies to each MPI task individually. The files get truncated to this size. For large core files this might need adjusting.


Enabling Simultaneous Multi-Threading (SMT)

Simultaneous Multi-Threading is now a user configurable option within LoadLeveler, and is available for all jobs using ``#@ node_usage = not_shared''. This excludes serial and interactive jobs. SMT allows two separate threads or tasks to be executed simultaneously on each physical processor, splitting it into two logical processors.

To enable SMT for a job, you must include the SMT feature keyword in your job submission script:

  #@ requirements = ( Feature == "SMT" )

SMT will be enabled prior to the beginning of your job, and disabled at completion, making 32 processors available per node. For MPI jobs, you should therefore increase your tasks per node by setting:

  #@ tasks_per_node = 32

Similarly, for Mixed or OpenMP jobs,

  export OMP_NUM_THREADS=32

As usual, nodes can be underpopulated if you require this (MPI tasks or OpenMP threads less than 32).

Many applications stand to benefit significantly from SMT, but please note that this is an advanced feature, and the performance of some applications may not improve. For more details of SMT and its potential performance benefits, see:

http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0604.pdf

Important Note: When using SMT, you should be careful how you interpret timing calls to your code. The "User" and "System" times commonly reported accumulate only when the thread concerned is actually running, hence a processor running two threads will report only about half of its true execution time with a CPU timer. It is important to make sure you use a "wallclock" or "elapsed time" timer, which measures the amount of real time spent running your application. For this purpose, we recommend the MPI_Wtime() and irtc() routines, both of which are good wallclock timers.

Please email support@hpcx.ac.uk if you experience problems when using SMT.

Serial job command files

  #@ shell = /bin/ksh
  #
  #@ job_name = my_serial_run
  #
  #@ job_type = serial
  #@ node_usage = shared
  #
  #@ wall_clock_limit = 00:20:00
  #@ account_no = z001
  #
  #@ output = $(job_name).$(schedd_host).$(jobid).out
  #@ error  = $(job_name).$(schedd_host).$(jobid).err
  #@ notification = never
  #
  #@ queue
  #
  
  ./my_serial_executable

This job will run in a serial class: see Section 8.3.

For serial jobs ``#@ node_usage = shared'' has to be specified.

Chaining serial and parallel jobs

LoadLeveler allows a batch job to be broken into a series of dependent job-steps. Each job step progresses through the queues independantly but will not start until all of the job-steps it depends on have completed. This can be very useful for adding serial pre/post-processing steps to a parallel job. This is particularly important on HPCx if you wish to copy data in and out from a remote system because only the login node and the node that runs the serial classes can make network connections to remote systems.

#@ wall_clock_limit = 00:10:00
#@ account_no = z001
#@ job_name = hello
#@ output = $(job_name).$(jobid).$(stepid).out
#@ error = $(job_name).$(jobid).$(stepid).err
#@ job_type = serial
#@ node_usage = shared
#@ executable = serial_script1.sh
#@ step_name = serial_1
#@ queue
#@ executable = parallel_script.sh
#@ node_usage = not_shared
#@ dependency = serial_1 == 0
#@ job_type = parallel
#@ step_name = parallel_1
#@ cpus=16
#@ queue
#@ node_usage = shared
#@ executable = serial_script2.sh
#@ dependency = parallel_1 == 0
#@ job_type = serial
#@ step_name = serial_2
#@ queue
The LoadLeveler script consists of a series of job-steps seperated by the queue keyword. Any keyword that is not explicitly overridden is inherited by the following steps. The commands to be run at each step needs to be places in a seperate script. In the above example each step only continues if the previous step exited with a zero return value.


Batch job allocation and queueing policy

Job allocation and queueing policy are controlled to some extent by machine configuration which can change to accommodate changing user demand, so only a brief overviw of a `typical' configuration is given here

LoadLeveler execution queues are called classes.

Jobs submitted via llsubmit (see below) are automatically assigned to the appropriate class based on the job type, number of CPUs and wall clock limit requested in the LoadLeveler command file. This also applies to serial jobs.

Any given job will be allocated to the class whose number of CPUs and wall clock time is the next above that which has been requested in the LoadLeveler script file.

Allocation of processors for parallel jobs is by node (i.e. in multiples of 16), and is exclusive, even if not all the processors on the node are used by a job. The relevant budget will be charged with the wall clock time multiplied by the number of nodes. Regardless of how many processors in each node are actually used each node will be charges as 16 CPUs. The same conditions apply to all parallel jobs.

LoadLeveler is operated with a backfilling policy in operation. This means that for jobs requesting large number of nodes, LoadLeveler will identify a time slot in the future for the job to run in. It will then prevent smaller jobs from running which would impinge on this time slot. Therefore, small jobs may not run when there appear to be sufficient free processors available. This policy means that large jobs will not get ``frozen out'' because sufficient processor are never available at the same time.

Of course, this sometimes has an impact on the throughput of smaller jobs, but users should never ask for a longer wall-clock time than a job needs in the hope that this will speed up the throughput of the job. It may well do so occasionally, but overall, the ``optimal'' strategy is always to ask for just as much wall clock time as it is believed the job will need.

The available classes can be viewed using the llclass command. The column ``Description'' informs you about the time and CPU limit of each class.

Job submission - the llsubmit command

Jobs are submitted to the system via the llsubmit command, or via the xloadl GUI (see Section 8.9).

For example:

$ llsubmit hello.ll

where hello.ll is the LoadLeveler command file. The filter does a number of checks on your command file. If it spots a problem, you will receive an error message and the job will not be submitted.

Job status information - the llq command

The llq command displays information on the current status of LoadLeveler jobs, as does the xloadl GUI.

The status column (ST) provides details of the current status of all the jobs on the system. The most common states are: R (Running), I (Idle), ST (Starting) and C (completed). Queueing jobs will usually be idle, signifying that they are being considered to run, but a machine has not yet been selected to run the job on.

You may find out further details on one of your own specific jobs using the -s flag.

Cancelling a Job - the llcancel command

To cancel a job you should use the llcancel command:

  $ llcancel l3f42.124.0
  llcancel: Cancel command has been sent to the central manager.

Machine Status information - the llstat command

Use the llstat command to display information about the overall status of the system, including which jobs are running on which frames.

The llstatus command can be used to display a much more comprehensive (and arguably, less comprehensible) summary of the machines state.

Notification of job completion

All the sample LoadLeveler script files in this section contain the line:

  #@ notification = never

This means that users are not informed (by email) when a batch job completes.

To receive an email message on a remote machine when a batch job completes, edit the LoadLeveler script and replace:

  #@ notification = never
with:
  #@ notify_user = email-address@another.machine.somewhere.else
  #@ notification = complete
where email-address@another.machine.somewhere.else is the address to which notification is to be sent. One can also use always instead of complete.


Using the xloadl GUI

The main xloadl window has three panels. The top panel show the current queue status (equivalent to the output from llq), the centre panel shows the machine status (equivalent to the output from llstatus), and the bottom panel shows messages from LoadLeveler, including the output from LoadLeveler commands executed via xloadl.

To submit a job, click on File -> Submit a job... in any of the panels. This pops up a new window, which allows you to select, edit and submit a command file.

To view job details, click on the job in the top panel and then click on Actions -> Details.

To cancel a job, select Actions -> Cancel in the top panel.


next up previous
Next: Interactive Execution Up: User's Guide to the Previous: Porting Codes
Andrew Turner
2010-01-14