[Gate-users] Fwd: Parallel processes sleeping instead of running
David Boersma
david.boersma at igp.uu.se
Fri Jun 2 16:50:06 CEST 2017
Hi Clemens,
Preamble:
For parallel processing (even on a single multicore machine) it could be
good to install a job management system. For instance Condor:
sudo apt-get install htcondor htcondor-doc
http://research.cs.wisc.edu/htcondor/
To answer your question:
We need more information. I see that you are redirecting all output to
/dev/null; instead it would be nice to see where it actually gets stuck.
Maybe a few processes are running while others are waiting. I hacked
your shell script a bit to collect all log files. If you use a
"verbose.mac" file (like most Gate examples) then maybe you can crank up
the verbosity a little bit here or there. If the logs are not
enlightening to you then maybe we could have a look (make a compressed
tarball, and if it's big then you can maybe share them through a dropbox
folder or via a cloud service like "wetransfer.com"); in that case it
would be good if you could share your main.mac as well.
Preferably you would also share the "main.mac" file with us. If the logs
are not too massive, then maybe you could make a tar ball of the logs
directory.
Some random guesses, ideas:
* If your input files are on some network disk (as opposed as a local,
internal disk on your big machine) then your processes may slow each
other down because the same data is being pulled 10-100 times through
the same network connection. If other users are using that same network
disk (from another computer, possibly) then this would also slow you down.
* If the output of your jobs has to go to a nonlocal disk then it's more
efficient to write output first to local disk. Only after the job has
finished you copy the output to its desired destination. (Copying a file
as a whole is a much more efficient operation than writing the data
piecewise by any program.)
* You could have a look at the output of 'ulimit -a' to check if you are
actually allowed to run so many processes. If your script opens many
files then any limit on file handles may also slow your jobs down.
* You should maybe not rely too much on htop to count available cores.
My own machine has 8 real cores but htop reports 16, due to
hyperthreading. There is enough RAM to run 16 jobs in parallel. But each
individual Gate process runs more than 2x faster if I make sure that
only 8 processes are running simultaneously (to enforce that in condor I
set 'COUNT_HYPERTHREAD_CPUS=False' in the condor config file), so that
is actually more efficient. I am not a CPU guru but it seems to me that
hyperthreading is actually counterproductive for number crunching jobs
like Gate simulations. Maybe your machine is very heavily
hyperthreading, like 2x, 4x, 8x?
* Is there any visualization stuff in your macro? For batch processing
like this it should be switched off.
Good luck,
David B.
Den 02/06/2017 kl. 14:48, skrev Clemens S.:
> Dear fellow Gate users,
>
> I am running multiple (10-100) simulations in parallel on a 64-thread
> server (Ubuntu 16.04). They operate from the same input data, but write
> to independent output files. I am using Gate version 7.2.
> The bash script with which I start the simulations is attached to this
> email.
>
> My problem is that the processes take a very long time to switch from
> "sleeping" to "running", and that there are at most ~10 processes
> running (as opposed to sleeping) at any given time (see attached
> screenshot of htop). It seems to me that the server is not working to
> full capacity because of this.
>
> Is this normal behavior, or am I doing something wrong?
>
> Thank you very much for your help.
>
> Kind regards
> Clemens Schmid
>
>
>
>
>
> _______________________________________________
> Gate-users mailing list
> Gate-users at lists.opengatecollaboration.org
> http://lists.opengatecollaboration.org/mailman/listinfo/gate-users
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run_with_logs.sh
Type: application/x-sh
Size: 225 bytes
Desc: not available
URL: <http://lists.opengatecollaboration.org/mailman/private/gate-users/attachments/20170602/c9c340bb/attachment.sh>
More information about the Gate-users
mailing list