HPCx homepage
Services User support Projects Research About us Sitemap Search  
  line          
Helpdesk User Guide Notices Bulletin Documentation
Training Porting Filestore Batch Interactive FAQ
               
home > support > FAQ > crash
spacer
hr
spacer

Where and why did my program crash?

spacer

Q: My program crashed, or is giving nonsensical numerical results. How can I find out what happened and where?
A: A number of ways of tracking down program errors are described in the text below:

Imagine that you have run a program on HPCx and it has crashed, or started to produce floating-point errors. Naturally you want to know where the program was when the problem happened. In this short note we cover how to find out this information from the core file (including how to force a core to be produced), and how to get the program itself to print out the location of the crash.

Examining core files

Often, a program will automatically dump core when it crashes. The Fortran program memerror.f90 is designed to perform an illegal memory access. If you compile and run as follows:
  user@l1f01$ xlf90_r -qsuffix=f=f90 -g -q64 -o memerror memerror.f90
  ** memerror   === End of Compilation 1 ===
  ** segv   === End of Compilation 2 ===
  1501-510  Compilation successful for file memerror.f90.
  user@l1f01$ ./memerror 
   Calling segv ...
  Segmentation fault (core dumped)
you see that it dumps a core. You can examine this core using the utility program coretrace located in /usr/local/packages/trace/ (here's a copy of the coretrace script if you're interested). You must specify the name of the executable, eg
  user@l1f01$ /usr/local/packages/trace/coretrace memerror
  Reading core file 'core' associated with serial executable memerror
  reading symbolic information ...
  [using memory image in core]

  Segmentation fault in segv at line 35 in file "memerror.f90"
     35       array(i) = array(i) + float(i)
  segv(0x0, 0x0), line 35 in "memerror.f90"
  memerror(), line 17 in "memerror.f90"
You can see from the loop bounds that the error results from "i" being far outside the extent of the array.

Note that coretrace is simply a wrapper around the familiar dbx debugger. You can call coretrace from within a LoadLeveler batch script, whereas dbx itself normally operates in an interactive mode. However, you should use dbx directly if you want to find out more detailed information. For convenience, you may wish to add /usr/local/packages/trace to your default PATH.

Parallel code

If an error such as a segmentation fault occurs in a parallel program then all the processes that encountered the error will dump individual cores in separate directories labelled by their rank, ie coredir.0, coredir.1, etc. The coretrace program recognises parallel executables and, by default, looks for the core file from rank 0. If you want to look at a different process then simply supply this as an extra argument, eg for rank 1:
  user@l1f01$ ./coretrace memerror 1
  Reading core file 'coredir.1/core' associated with parallel executable memerror
  ...

Floating-Point Errors

It is very useful to be able to identify exactly when the first floating-point error occurred. The Fortran program numerror.f90 is designed to perform a division by zero. If you compile and run as follows:
  user@l1f01$ xlf90_r -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
  ** numerror   === End of Compilation 1 ===
  ** divzero   === End of Compilation 2 ===
  1501-510  Compilation successful for file numerror.f90.
  user@l1f01$ ./numerror 
   Calling divzero ...
   ... finished
   x(1) =  INF
you may be surprised to see that the program does not crash when it does a division by zero. In order to ensure that a core is dumped you must compile with extra options:
  xlf90_r -qflttrap=overflow:underflow:zerodivide:invalid:enable -qsuffix=f=f90 -g -q64 -o numerror numerror.f90 
This turns trapping on for all floating-point exceptions other than inexact (which is safe to ignore unless you really care that 1/3 isn't exactly representable in binary!). The program now runs as follows:
  user@l1f01$ ./numerror 
   Calling divzero ...
  Trace/BPT trap (core dumped)
and you can locate the error as before:
  user@l1f01$ ./coretrace numerror
  Reading core file 'core' associated with serial executable numerror
  reading symbolic information ...
  [using memory image in core]

  Trace/BPT trap in divzero at line 32 in file "numerror.f90"
     32       array(i) = 1.0/array(i)
  divzero(0x0, 0x0), line 32 in "numerror.f90"
  numerror(), line 14 in "numerror.f90"

Traceback from within the program

The default error handler is called xl__dump, which causes the program to dump core. You can replace this with xl__trce by compiling with the sigtrap option:
  xlf90_r -qsigtrap -qflttrap=overflow:underflow:zerodivide:invalid:enable -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
This now produces a traceback, using the handler xl__trce, rather than dumping a core:
  user@l1f01$ ./numerror 
   Calling divzero ...

    Signal received: SIGTRAP - Trace trap
      Signal generated for floating-point exception:
        FP division by zero

    Instruction that generated the exception:
      fdiv fr01,fr01,fr02
      Source Operand values:
        fr01 =   1.00000000000000e+00
        fr02 =   0.00000000000000e+00

    Traceback:
      Offset 0x00000098 in procedure divzero, near line 32 in file numerror.f90
      Offset 0x0000012c in procedure numerror, near line 14 in file numerror.f90
      --- End of call chain ---
You can install other handlers using this same option, eg to write a traceback and dump a core as well:
  -qsigtrap=xl__trcedump
For full details see the AIX Fortran manual.

Trapping other errors

Unfortunately, it is not possible to produce a traceback for other errors such as segmentation violations using only compiler options. IBM provides a simple way in Fortran to indicate to the Operating System what routines should be used, and they also provide a set of signal handling routines that will print tracebacks. You should uncomment the two lines at the top of memerror.f90
  include 'fexcp.h'
  call signal(11, xl__trce)
and compile as before
  user@l1f01$ xlf90_r -qsuffix=f=f90 -g -q64 -o memerror memerror.f90
This tells the operating system to call the xl__trce function (which we have already encountered) whenever it receives signal 11 (a segmentation fault). Running the program now gives:
  user@l1f01$ ./memerror 
   Calling segv ...

    Signal received: SIGSEGV - Segmentation violation

    Traceback:
      Offset 0x00000078 in procedure segv, near line 35 in file memerror.f90
      Offset 0x00000150 in procedure memerror, near line 17 in file memerror.f90
      --- End of call chain ---
which is the same information we got previously when we examined the core file. You should note, however, that these routines are specific to IBM so the code is non-portable.

C Programs

You will find that coretrace works equally well with serial or parallel cores from C programs. You can also dump core on encountering floating-point exceptions using the same -qflttrap options as for Fortran. However, it is not so easy to change which handler is called - we are currently looking at simple ways to control this.

General Notes

Performance

There is a performance impact when you trap floating-point errors as they have to be monitored continuously. This can be reduced by only looking for the specific error you are interested in. However, if performance still isn't acceptable then it is possible only to check for errors at the end of a routine (rather than for every statement) using the 'imprecise' option. For example, to check solely for division by zero and only at the end of a routine:
  xlf90_r -qsigtrap -qflttrap=zerodivide:imprecise:enable -qsuffix=f=f90 -g -q64 -o numerror numerror.f90
which produces:
Segmentation fault (core dumped)
user@l1f01$ ./numerror 
 Calling divzero ...

  Signal received: SIGTRAP - Trace trap
    Signal generated for floating-point exception:
      FP division by zero

  Traceback:
    Offset 0x000000dc in procedure divzero, near line 36 in file numerror.f90
    Offset 0x00000108 in procedure numerror, near line 14 in file numerror.f90
    --- End of call chain ---
where line 36 is the return statement of the divzero subroutine and not the actual line where the error occurred.

Accuracy of information

You will always get more information if you compile with "-g". Without this, you will get the name of the routine where the error occurred but not the line number or file name. At higher levels of optimisation where code can be significantly rearranged and routines may be inlined, locating the exact source of an error can be more problematic. For example, enabling trap handling in numerror.f90 at optimisation level -O4 we get
  user@l1f01$ ./numerror 
   Calling divzero ...

    Signal received: SIGTRAP - Trace trap
      Signal generated for floating-point exception:
        FP division by zero

    Traceback:
      Offset 0x00000128 in procedure numerror, near line 32 in file numerror.f90
      --- End of call chain ---
The procedure name is now reported as the main program (presumably due to inlining) but the line number is still correct.

Signal numbers

If you want to find out the numbers of the various signals then youc an list them from the "kill" command:
user@l1f01$ kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
 5) SIGTRAP      6) SIGABRT      7) SIGEMT       8) SIGFPE
 9) SIGKILL     10) SIGBUS      11) SIGSEGV     12) SIGSYS
13) SIGPIPE     14) SIGALRM     15) SIGTERM     16) SIGURG
17) SIGSTOP     18) SIGTSTP     19) SIGCONT     20) SIGCHLD
21) SIGTTIN     22) SIGTTOU     23) SIGIO       24) SIGXCPU
25) SIGXFSZ     27) SIGMSG      28) SIGWINCH    29) SIGPWR
30) SIGUSR1     31) SIGUSR2     32) SIGPROF     33) SIGDANGER
34) SIGVTALRM   35) SIGMIGRATE  36) SIGPRE      37) SIGVIRT
38) SIGALRM1    39) SIGWAITING  60) SIGKAP      61) SIGRETRACT
62) SIGSOUND    63) SIGSAK      
spacer
hr
spacer
http://www.hpcx.ac.uk/support/FAQ/crash/ contact email - www@hpcx.ac.uk © UoE HPCX Ltd