[Gate-users] Segmentation errors when using multiprocessing on a CPU cluster

David Boersma david.boersma at physics.uu.se
Mon Feb 27 17:33:17 CET 2017


Hi Nico,

There can be many causes for such errors. Many of these causes are 
boring. So my apologies: I am going to ask you a series of boring 
questions. :)

* bad core / CPU

Maybe one of the cores/nodes in your cluster is bad (or has bad RAM, or 
somesuch). Maybe you can add some lines to your script so that you know 
of each Gate process on which core/node it is running. If the crash 
always happens on the same one, that's a strong hint for a hardware problem.

* enough RAM?

How much RAM do you have available per core, and how much are your 
processes using? How much swap space is available? If your cluster is 
tight on RAM then I'd expect a different error than a segfault, but it's 
not impossible.

* multiprocessing interference?

How exactly is your "python multiprocessing script" invoking Gate? Does 
Gate get called from python through e.g. os.system("Gate -a etc"), or 
does your python script e.g. generate small shell scripts that you 
submit separately (not through python)? I have had a weird case myself 
where this seemed to make a big difference, even if in principle it 
totally shouldn't.

* diagnostics

If the above don't help:

- When do the crashes happen? Immediately during startup of a Gate 
process, during termination, or during regular running?

- Can provide some versioning specs: which OS, which version of GATE, 
which Geant4, which ROOT, which gcc?

- How often did you actually observe this? Did you try several times 
with 64 processes in parallel and observe the crashes, or did it happen 
just once?

- How many processes (out of 1201) were crashing?

- if the desperation level rises high enough then you could recompile 
GATE with CMAKE_BUILD_TYPE=Debug and then run all Gate processes under 
"gdb", using small crude shell scripts like this:

a=$(function_to_compute_angle $i)
# i is your process counter (run_id?)
# function_to_compute_angle is a function that outputs the i'th angle
gdb Gate << GDBEND
   set args "-a [rot_angle,$a][run_id,$i] mac/ffdCT.mac"
   run
   bt
GDBEND

The backtraces could potentially point at a culprit in the Gate code. 
But you write that the segfaults do not happen when you re-run the 
failed processes. That makes it less likely that is due to some specific 
bug in the code. But it's not impossible, if it's a piece of code that 
is rarely reached, depending on random processes. So maybe another way 
to investigate is to always explicitly set the random number seed for 
each process; then you can try rerunning the crashed processes to see if 
they fail again.



As you can see, I have had waaaayyyy too much fun in my life with 
chasing down segfaults.... :)

Good luck,
David

Den 27/02/2017 kl. 15:17, skrev Triltsch, Nicolas:
> Hello everyone,
>
> I use a python multiprocessing script to run 1201 CT projections on a 64
> core CPU cluster. I run 64 processes (CT projections) on the cluster in
> parallel. One process could look like this:
>
> *"Gate -a [rot_angle,8.9925062448][run_id,30] mac/ffdCT.mac"*
>
> But for some projections the simulation crashes and I obtain a
> segmentation fault error. I restarted the failed projections with only
> 32 parallel processes (still on the 64 core cluster) and they ran
> through without errors.
>
> Can anyone of you explain this strange behavior?
>
> Details of my simulation:
> - Fixed forced detection actor
> - 10**5 photons
> - histogram spectrum
> - Nested parameterization
> - cone beam setup
> - 64x64 resolution for the detector
> - 64x64x64 volume
>
> I am grateful for any help!
> Nico
>
> --
> B.Sc. Nicolas Triltsch
> Masterand
>
> Technische Universität München
> Physik-Department
> Lehrstuhl für Biomedizinische Physik E17
>
> James-Franck-Straße 1
> 85748 Garching b. München
>
> Tel: +49 89 289 12591
>
> nicolas.triltsch at tum.de
> www.e17.ph.tum.de
>
>
>
> _______________________________________________
> Gate-users mailing list
> Gate-users at lists.opengatecollaboration.org
> http://lists.opengatecollaboration.org/mailman/listinfo/gate-users
>


More information about the Gate-users mailing list