Q: My program crashed, or is giving nonsensical numerical
results. How can I find out what happened and where?
A: A number of ways of tracking down program errors are described
in the text below:
Imagine that you have run a program on HPCx and it has crashed, or
started to produce floating-point errors. Naturally you want to know
where the program was when the problem happened. In this short note we
cover how to find out this information from the core file (including how
to force a core to be produced), and how to get the program itself to
print out the location of the crash.
Examining core files
Often, a program will automatically dump core when it crashes. The
Fortran program memerror.f90 is designed to
perform an illegal memory access. If you compile and run as follows:
user@l1f01$ xlf90_r -qsuffix=f=f90 -g -q64 -o memerror memerror.f90
** memerror === End of Compilation 1 ===
** segv === End of Compilation 2 ===
1501-510 Compilation successful for file memerror.f90.
user@l1f01$ ./memerror
Calling segv ...
Segmentation fault (core dumped)
you see that it dumps a core. You can examine this core using the
utility program coretrace located in
/usr/local/packages/trace/ (here's a copy of the coretrace script if you're interested). You must
specify the name of the executable, eg
user@l1f01$ /usr/local/packages/trace/coretrace memerror
Reading core file 'core' associated with serial executable memerror
reading symbolic information ...
[using memory image in core]
Segmentation fault in segv at line 35 in file "memerror.f90"
35 array(i) = array(i) + float(i)
segv(0x0, 0x0), line 35 in "memerror.f90"
memerror(), line 17 in "memerror.f90"
You can see from the loop bounds that the error results from "i" being
far outside the extent of the array.
Note that coretrace is simply a wrapper around the familiar dbx
debugger. You can call coretrace from within a LoadLeveler batch script,
whereas dbx itself normally operates in an interactive mode. However,
you should use dbx directly if you want to find out more detailed
information. For convenience, you may wish to add
/usr/local/packages/trace to your default PATH.
Parallel code
If an error such as a segmentation fault occurs in a parallel program
then all the processes that encountered the error will dump individual
cores in separate directories labelled by their rank, ie coredir.0,
coredir.1, etc. The coretrace program recognises parallel executables
and, by default, looks for the core file from rank 0. If you want to
look at a different process then simply supply this as an extra
argument, eg for rank 1:
user@l1f01$ ./coretrace memerror 1
Reading core file 'coredir.1/core' associated with parallel executable memerror
...
Floating-Point Errors
It is very useful to be able to identify exactly when the first
floating-point error occurred. The Fortran program numerror.f90 is designed to perform a division
by zero. If you compile and run as follows:
user@l1f01$ xlf90_r -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
** numerror === End of Compilation 1 ===
** divzero === End of Compilation 2 ===
1501-510 Compilation successful for file numerror.f90.
user@l1f01$ ./numerror
Calling divzero ...
... finished
x(1) = INF
you may be surprised to see that the program does not crash when it does
a division by zero. In order to ensure that a core is dumped you must
compile with extra options:
xlf90_r -qflttrap=overflow:underflow:zerodivide:invalid:enable -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
This turns trapping on for all floating-point exceptions other than
inexact (which is safe to ignore unless you really care that 1/3 isn't
exactly representable in binary!). The program now runs as follows:
user@l1f01$ ./numerror
Calling divzero ...
Trace/BPT trap (core dumped)
and you can locate the error as before:
user@l1f01$ ./coretrace numerror
Reading core file 'core' associated with serial executable numerror
reading symbolic information ...
[using memory image in core]
Trace/BPT trap in divzero at line 32 in file "numerror.f90"
32 array(i) = 1.0/array(i)
divzero(0x0, 0x0), line 32 in "numerror.f90"
numerror(), line 14 in "numerror.f90"
Traceback from within the program
The default error handler is called xl__dump, which causes the program
to dump core. You can replace this with xl__trce by compiling with the
sigtrap option:
xlf90_r -qsigtrap -qflttrap=overflow:underflow:zerodivide:invalid:enable -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
This now produces a traceback, using the handler xl__trce, rather than
dumping a core:
user@l1f01$ ./numerror
Calling divzero ...
Signal received: SIGTRAP - Trace trap
Signal generated for floating-point exception:
FP division by zero
Instruction that generated the exception:
fdiv fr01,fr01,fr02
Source Operand values:
fr01 = 1.00000000000000e+00
fr02 = 0.00000000000000e+00
Traceback:
Offset 0x00000098 in procedure divzero, near line 32 in file numerror.f90
Offset 0x0000012c in procedure numerror, near line 14 in file numerror.f90
--- End of call chain ---
You can install other handlers using this same option, eg to write a
traceback and dump a core as well:
-qsigtrap=xl__trcedump
For full details see the AIX Fortran manual.
Trapping other errors
Unfortunately, it is not possible to produce a traceback for other
errors such as segmentation violations using only compiler options. IBM
provides a simple way in Fortran to indicate to the Operating System
what routines should be used, and they also provide a set of signal
handling routines that will print tracebacks.
You should uncomment the two lines at the top of memerror.f90
include 'fexcp.h'
call signal(11, xl__trce)
and compile as before
user@l1f01$ xlf90_r -qsuffix=f=f90 -g -q64 -o memerror memerror.f90
This tells the operating system to call the xl__trce function (which we
have already encountered) whenever it receives signal 11 (a segmentation
fault). Running the program now gives:
user@l1f01$ ./memerror
Calling segv ...
Signal received: SIGSEGV - Segmentation violation
Traceback:
Offset 0x00000078 in procedure segv, near line 35 in file memerror.f90
Offset 0x00000150 in procedure memerror, near line 17 in file memerror.f90
--- End of call chain ---
which is the same information we got previously
when we examined the
core file. You should note, however, that these routines are specific to
IBM so the code is non-portable.
C Programs
You will find that coretrace works equally well with serial or parallel
cores from C programs. You can also dump core on encountering
floating-point exceptions using the same -qflttrap options as for
Fortran. However, it is not so easy to change which handler is called -
we are currently looking at simple ways to control this.
General Notes
Performance
There is a performance impact when you trap floating-point errors as
they have to be monitored continuously. This can be reduced by only
looking for the specific error you are interested in. However, if
performance still isn't acceptable then it is possible only to check for
errors at the end of a routine (rather than for every statement) using
the 'imprecise' option. For example, to check solely for division by
zero and only at the end of a routine:
xlf90_r -qsigtrap -qflttrap=zerodivide:imprecise:enable -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
which produces:
Segmentation fault (core dumped)
user@l1f01$ ./numerror
Calling divzero ...
Signal received: SIGTRAP - Trace trap
Signal generated for floating-point exception:
FP division by zero
Traceback:
Offset 0x000000dc in procedure divzero, near line 36 in file numerror.f90
Offset 0x00000108 in procedure numerror, near line 14 in file numerror.f90
--- End of call chain ---
where line 36 is the return statement of the divzero subroutine and not
the actual line where the error occurred.
Accuracy of information
You will always get more information if you compile with "-g". Without
this, you will get the name of the routine where the error occurred but
not the line number or file name.
At higher levels of optimisation where code can be significantly
rearranged and routines may be inlined, locating the exact source of an
error can be more problematic. For example, enabling trap handling in
numerror.f90 at optimisation level -O4 we get
user@l1f01$ ./numerror
Calling divzero ...
Signal received: SIGTRAP - Trace trap
Signal generated for floating-point exception:
FP division by zero
Traceback:
Offset 0x00000128 in procedure numerror, near line 32 in file numerror.f90
--- End of call chain ---
The procedure name is now reported as the main program (presumably due
to inlining) but the line number is still correct.
Signal numbers
If you want to find out the numbers of the various signals then youc an
list them from the "kill" command:
user@l1f01$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGEMT 8) SIGFPE
9) SIGKILL 10) SIGBUS 11) SIGSEGV 12) SIGSYS
13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGURG
17) SIGSTOP 18) SIGTSTP 19) SIGCONT 20) SIGCHLD
21) SIGTTIN 22) SIGTTOU 23) SIGIO 24) SIGXCPU
25) SIGXFSZ 27) SIGMSG 28) SIGWINCH 29) SIGPWR
30) SIGUSR1 31) SIGUSR2 32) SIGPROF 33) SIGDANGER
34) SIGVTALRM 35) SIGMIGRATE 36) SIGPRE 37) SIGVIRT
38) SIGALRM1 39) SIGWAITING 60) SIGKAP 61) SIGRETRACT
62) SIGSOUND 63) SIGSAK



|
http://www.hpcx.ac.uk/support/FAQ/crash/
|
contact email -
www@hpcx.ac.uk
|
© UoE HPCX Ltd
|