[Gate-users] GATE performances and ROOT output saving optimisation

Tue Jan 21 17:33:29 CET 2020

Dear GATE users,

Once again, i am looking for your educated experiences regarding the
optimization of simulations in GATE. I am also sharing my results in the
hope that my findings could be useful to the current and future users.

I recently installed GATE on a powerful workstation. Despite the expected
increase from going from 4 physical cores (MacBook pro) to 36
(workstation), I could only get worse global simulation times. Therefore I
performed several tests, and while they allowed me to find the reason
behind the slow simulations (spoiler, it is writing the ROOT output on the
HDD), they also shown other "features" of GATE.

In my attempt to keep my thought process clear, I will first present both
machines specs, then the method for the simulations, followed by the
results, and ending with my remarks/questions.

Machines Specifications
*MacBook* *Pro*: Intel i7 2,8GHz, *4 physical cores*, 8 logical cores, *SSD*
*Workstation*: Intel Xeon Gold 3,00GHz, *36 physical cores* (2 sockets, 18
physical each), 72 logical cores, *HHD*

Simulation method
*The same 4 tests were performed of both machines.* They consist of a PET
camera, a digitizer and a source. In order to *split the simulation on
several cores and run them* *simultaneously*, I simply generate the
according number of main files by splitting the simulation time, and run
one instance of GATE for each main file. In the case of *ROOT output, Hits,
Singles and Coincidences are saved. * All obtained data sizes are from the
.root files generated by the simulation.

*The test where considering 4 and 8 cores, with and without ROOT output*

I) MacBook Pro results

   1. 4 cores, no output: 8m38s
   2. 8 cores, no output: 8m30s
   3. 4 cores, ROOT output: 21m37, 3690 MB --> 2,8MB/s
   4. 8 cores, ROOT output: 14m10s, 3690 MB --> 4,3MB/s
   5. 4 cores, writing time only (test 3. - test 1. ): 14m59s for 3690 MB
   --> 4,7MB/s
   6. 8 cores, writing time only (test 4. - test 2. ):  5m40s for 3690 MB
   --> 10,8MB/s

II) Workstation results

   1. 4 cores, no output: 6m33s
   2. 8 cores, no output: 3m20s
   3. 4 cores, ROOT output: 47m35, 3690 MB --> 1,3MB/s
   4. 8 cores, ROOT output: 28m58s, 3690 MB --> 2,1MB/s
   5. 4 cores, writing time only (test 3. - test 1. ): 41m02s for 3690 MB
   --> 1,5MB/s
   6. 8 cores, writing time only (test 4. - test 2. ):  25m38s for 3690 MB
   --> 2,4MB/s

III) Extra results
 My initial simulation does not have enough slices to be split in 72
process, so I did a new one (being 4,5 times longer than the original 16
slices one) without output to check Hyperthreading effect on the
workstation:

   1. 36 cores, no output: 4m50s
   2. 72 cores, no output: 3m14s

Remarks

   - The workstation's cores are indeed better for simulation (compare*
   I/1. *& *II/1. *)
   - The bottleneck on the workstation is the HHD (compare* I/3. *& *II/3. *)
   : there is now way around it I suppose, a SSD is required
   - If there are any free physical cores upon starting a simulation, GATE
   will be assigned to them in priority (general note, but can be seen on both
   machines using *1. & 2.*)
   - As stated in this section of GATE documentation
   <https://opengate.readthedocs.io/en/latest/how_to_use_gate_on_a_cluster.html#id3>,
   only physical cores are beneficial to the simulation of particle
   interaction/transport when using standard CPU architecture  (compare*
   I/1. *& *I/2. *).
   - However, *Hyperthreaded cores can still be used in order to manage the
   I/O *(compare* I/3. *& *I/4. *). I did not see this noted anywhere in
   the documentation, and might still be useful information as it divided my
   output saving time by about 2 compared to the original time on MacBook Pro
   (compare* I/5. *& *I/6. *)
   - *The gain in writing speed is not linear when using more physical
   cores* (compare* II/5. *& *II/ 6. *). Maybe induced by how ROOT manages
   core usage in order to compress/save data?
   - Keeping in mind all the previous remarks, we can see in *III/1. &
   III/2. *that *on the workstation, ALL the cores (physical and
   hyperthreaded) are beneficial to the simulation of particle
   interaction/transport *(linear speed increase all the way from 8 cores
   to 72). It is due to the architecture of the processor (server-like) ?

In the end, my main questions are:

   - Did anyone give a try to optimise the output writing speed (by any
   means other than getting a high bandwith MoBo/SSD)?
   - Excluding HDD bandwith, what is the bottleneck in writting the data
   (maybe such as number of processors assigned, ROOT working principle)?
   - Is it possible to delay the output saving and keep the to-be-saved
   data in the RAM for a while, and then write the output data in batch
   (assuming you manage your RAM to not have overflow problems)?

Any suggestion which might improve the global simulation time is welcome,
and failed tries regarding output saving optimisation would be equally
appreciated (as it would save me time not having to look these directions).
If I did any error in my method/deductions, please let me know. I am still
quite new to GATE, and could have done some obvious conceptual/practical
mistakes.

Best regards,
Antoine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opengatecollaboration.org/pipermail/gate-users/attachments/20200121/3129ea09/attachment.html>