[Gate-users] checkpointing and restarting a job in SGE cluster

Ashok Tiwari tiwarias at yahoo.com
Fri Oct 23 23:33:08 CEST 2020


Dear users, 
I'm running a large simulation in the university HPC cluster (with SGE engine) and we do not have a specific queue in the HPC cluster. Even without the specific queue, i.e., using the freely available queue, I can split my simulation and submit it to the cluster. But the problem with the freely available queue is the simulations are likely to be killed if the simulations are longer and sometimes even the smaller simulation jobs could get killed when all other queues consumed the resources. So, I want to utilize the checkpointing and restarting functionality of SGE to the killed job based on its job id (or some other parameter), so that I could complete a simulation without investing in node purchase. Does anyone of you have experience doing this stuff and want to share your knowledge?
Thank you for your help in advance.
Regards, Ashok


---------------------
Ashok Tiwari
Ph.D. Candidate
Department of Radiology & Physics
University of Iowa Hospitals and Clinics
Office: P158, MRF
200 Hawkins Drive
Iowa City, IA 52242
ashok-tiwari at uiowa.edu




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opengatecollaboration.org/pipermail/gate-users/attachments/20201023/74bc08b0/attachment-0001.html>


More information about the Gate-users mailing list