[Gate-users] Fwd: Parallel processes sleeping instead of running

David Boersma david.boersma at igp.uu.se
Fri Jun 2 16:50:06 CEST 2017


Hi Clemens,

Preamble:

For parallel processing (even on a single multicore machine) it could be 
good to install a job management system. For instance Condor:

sudo apt-get install htcondor htcondor-doc
http://research.cs.wisc.edu/htcondor/

To answer your question:

We need more information. I see that you are redirecting all output to 
/dev/null; instead it would be nice to see where it actually gets stuck. 
Maybe a few processes are running while others are waiting. I hacked 
your shell script a bit to collect all log files. If you use a 
"verbose.mac" file (like most Gate examples) then maybe you can crank up 
the verbosity a little bit here or there. If the logs are not 
enlightening to you then maybe we could have a look (make a compressed 
tarball, and if it's big then you can maybe share them through a dropbox 
folder or via a cloud service like "wetransfer.com"); in that case it 
would be good if you could share your main.mac as well.

Preferably you would also share the "main.mac" file with us. If the logs 
are not too massive, then maybe you could make a tar ball of the logs 
directory.

Some random guesses, ideas:

* If your input files are on some network disk (as opposed as a local, 
internal disk on your big machine) then your processes may slow each 
other down because the same data is being pulled 10-100 times through 
the same network connection. If other users are using that same network 
disk (from another computer, possibly) then this would also slow you down.

* If the output of your jobs has to go to a nonlocal disk then it's more 
efficient to write output first to local disk. Only after the job has 
finished you copy the output to its desired destination. (Copying a file 
as a whole is a much more efficient operation than writing the data 
piecewise by any program.)

* You could have a look at the output of 'ulimit -a' to check if you are 
actually allowed to run so many processes. If your script opens many 
files then any limit on file handles may also slow your jobs down.

* You should maybe not rely too much on htop to count available cores. 
My own machine has 8 real cores but htop reports 16, due to 
hyperthreading. There is enough RAM to run 16 jobs in parallel. But each 
individual Gate process runs more than 2x faster if I make sure that 
only 8 processes are running simultaneously (to enforce that in condor I 
set 'COUNT_HYPERTHREAD_CPUS=False' in the condor config file), so that 
is actually more efficient. I am not a CPU guru but it seems to me that 
hyperthreading is actually counterproductive for number crunching jobs 
like Gate simulations. Maybe your machine is very heavily 
hyperthreading, like 2x, 4x, 8x?

* Is there any visualization stuff in your macro? For batch processing 
like this it should be switched off.

Good luck,
David B.

Den 02/06/2017 kl. 14:48, skrev Clemens S.:
> Dear fellow Gate users,
> 
> I am running multiple (10-100) simulations in parallel on a 64-thread 
> server (Ubuntu 16.04). They operate from the same input data, but write 
> to independent output files. I am using Gate version 7.2.
> The bash script with which I start the simulations is attached to this 
> email.
> 
> My problem is that the processes take a very long time to switch from 
> "sleeping" to "running", and that there are at most ~10 processes 
> running (as opposed to sleeping) at any given time (see attached 
> screenshot of htop). It seems to me that the server is not working to 
> full capacity because of this.
> 
> Is this normal behavior, or am I doing something wrong?
> 
> Thank you very much for your help.
> 
> Kind regards
> Clemens Schmid
> 
> 
> 
> 
> 
> _______________________________________________
> Gate-users mailing list
> Gate-users at lists.opengatecollaboration.org
> http://lists.opengatecollaboration.org/mailman/listinfo/gate-users
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run_with_logs.sh
Type: application/x-sh
Size: 225 bytes
Desc: not available
URL: <http://lists.opengatecollaboration.org/mailman/private/gate-users/attachments/20170602/c9c340bb/attachment.sh>


More information about the Gate-users mailing list