[Gate-users] Segmentation errors when using multiprocessing on a CPU cluster
David Boersma
david.boersma at physics.uu.se
Mon Feb 27 17:33:17 CET 2017
Hi Nico,
There can be many causes for such errors. Many of these causes are
boring. So my apologies: I am going to ask you a series of boring
questions. :)
* bad core / CPU
Maybe one of the cores/nodes in your cluster is bad (or has bad RAM, or
somesuch). Maybe you can add some lines to your script so that you know
of each Gate process on which core/node it is running. If the crash
always happens on the same one, that's a strong hint for a hardware problem.
* enough RAM?
How much RAM do you have available per core, and how much are your
processes using? How much swap space is available? If your cluster is
tight on RAM then I'd expect a different error than a segfault, but it's
not impossible.
* multiprocessing interference?
How exactly is your "python multiprocessing script" invoking Gate? Does
Gate get called from python through e.g. os.system("Gate -a etc"), or
does your python script e.g. generate small shell scripts that you
submit separately (not through python)? I have had a weird case myself
where this seemed to make a big difference, even if in principle it
totally shouldn't.
* diagnostics
If the above don't help:
- When do the crashes happen? Immediately during startup of a Gate
process, during termination, or during regular running?
- Can provide some versioning specs: which OS, which version of GATE,
which Geant4, which ROOT, which gcc?
- How often did you actually observe this? Did you try several times
with 64 processes in parallel and observe the crashes, or did it happen
just once?
- How many processes (out of 1201) were crashing?
- if the desperation level rises high enough then you could recompile
GATE with CMAKE_BUILD_TYPE=Debug and then run all Gate processes under
"gdb", using small crude shell scripts like this:
a=$(function_to_compute_angle $i)
# i is your process counter (run_id?)
# function_to_compute_angle is a function that outputs the i'th angle
gdb Gate << GDBEND
set args "-a [rot_angle,$a][run_id,$i] mac/ffdCT.mac"
run
bt
GDBEND
The backtraces could potentially point at a culprit in the Gate code.
But you write that the segfaults do not happen when you re-run the
failed processes. That makes it less likely that is due to some specific
bug in the code. But it's not impossible, if it's a piece of code that
is rarely reached, depending on random processes. So maybe another way
to investigate is to always explicitly set the random number seed for
each process; then you can try rerunning the crashed processes to see if
they fail again.
As you can see, I have had waaaayyyy too much fun in my life with
chasing down segfaults.... :)
Good luck,
David
Den 27/02/2017 kl. 15:17, skrev Triltsch, Nicolas:
> Hello everyone,
>
> I use a python multiprocessing script to run 1201 CT projections on a 64
> core CPU cluster. I run 64 processes (CT projections) on the cluster in
> parallel. One process could look like this:
>
> *"Gate -a [rot_angle,8.9925062448][run_id,30] mac/ffdCT.mac"*
>
> But for some projections the simulation crashes and I obtain a
> segmentation fault error. I restarted the failed projections with only
> 32 parallel processes (still on the 64 core cluster) and they ran
> through without errors.
>
> Can anyone of you explain this strange behavior?
>
> Details of my simulation:
> - Fixed forced detection actor
> - 10**5 photons
> - histogram spectrum
> - Nested parameterization
> - cone beam setup
> - 64x64 resolution for the detector
> - 64x64x64 volume
>
> I am grateful for any help!
> Nico
>
> --
> B.Sc. Nicolas Triltsch
> Masterand
>
> Technische Universität München
> Physik-Department
> Lehrstuhl für Biomedizinische Physik E17
>
> James-Franck-Straße 1
> 85748 Garching b. München
>
> Tel: +49 89 289 12591
>
> nicolas.triltsch at tum.de
> www.e17.ph.tum.de
>
>
>
> _______________________________________________
> Gate-users mailing list
> Gate-users at lists.opengatecollaboration.org
> http://lists.opengatecollaboration.org/mailman/listinfo/gate-users
>
More information about the Gate-users
mailing list