<div dir="ltr">Dear GATE users,<div><br></div><div>Once again, i am looking for your educated experiences regarding the optimization of simulations in GATE. I am also sharing my results in the hope that my findings could be useful to the current and future users.</div><div><br></div><div>I recently installed GATE on a powerful workstation. Despite the expected increase from going from 4 physical cores (MacBook pro) to 36 (workstation), I could only get worse global simulation times. Therefore I performed several tests, and while they allowed me to find the reason behind the slow simulations (spoiler, it is writing the ROOT output on the HDD), they also shown other "features" of GATE. </div><div><br></div><div>In my attempt to keep my thought process clear, I will first present both machines specs, then the method for the simulations, followed by the results, and ending with my remarks/questions.</div><div><br></div><div><font size="4">Machines Specifications</font></div><div><b>MacBook</b> <b>Pro</b>: Intel i7 2,8GHz, <b>4 physical cores</b>, 8 logical cores, <b>SSD</b></div><div><b>Workstation</b>: Intel Xeon Gold 3,00GHz, <b>36 physical cores</b> (2 sockets, 18 physical each), 72 logical cores, <b>HHD</b></div><div><br></div><div><font size="4">Simulation method</font></div><div><b>The same 4 tests were performed of both machines.</b> They consist of a PET camera, a digitizer and a source. In order to <b>split the simulation on several cores and run them</b> <b>simultaneously</b>, I simply generate the according number of main files by splitting the simulation time, and run one instance of GATE for each main file. In the case of <b>ROOT output, Hits, Singles and Coincidences are saved. </b> All obtained data sizes are from the .root files generated by the simulation. </div><div><b><br></b></div><div><b>The test where considering 4 and 8 cores, with and without ROOT output</b></div><div><ol></ol></div><div><font size="4">I) MacBook Pro results</font></div><div><ol><li>4 cores, no output: 8m38s</li><li>8 cores, no output: 8m30s</li><li>4 cores, ROOT output: 21m37, 3690 MB --> 2,8MB/s</li><li>8 cores, ROOT output: 14m10s, 3690 MB --> 4,3MB/s</li><li>4 cores, writing time only (test 3. - test 1. ): 14m59s for 3690 MB --> 4,7MB/s</li><li>8 cores, writing time only (test 4. - test 2. ):  5m40s for 3690 MB --> 10,8MB/s</li></ol></div><div><font size="4">II) Workstation results</font></div><div><ol><li>4 cores, no output: 6m33s</li><li>8 cores, no output: 3m20s</li><li>4 cores, ROOT output: 47m35, 3690 MB --> 1,3MB/s</li><li>8 cores, ROOT output: 28m58s, 3690 MB --> 2,1MB/s</li><li>4 cores, writing time only (test 3. - test 1. ): 41m02s for 3690 MB --> 1,5MB/s</li><li>8 cores, writing time only (test 4. - test 2. ):  25m38s for 3690 MB --> 2,4MB/s</li></ol><font size="4">III) Extra results</font></div><div> My initial simulation does not have enough slices to be split in 72 process, so I did a new one (being 4,5 times longer than the original 16 slices one) without output to check Hyperthreading effect on the workstation:</div><div><ol><li>36 cores, no output: 4m50s</li><li>72 cores, no output: 3m14s </li></ol><div><font size="4">Remarks</font></div></div><div><ul><li>The workstation's cores are indeed better for simulation (compare<b> I/1. </b>& <b>II/1. </b>)</li><li>The bottleneck on the workstation is the HHD (compare<b> I/3. </b>& <b>II/3. </b>) : there is now way around it I suppose, a SSD is required</li><li>If there are any free physical cores upon starting a simulation, GATE will be assigned to them in priority (general note, but can be seen on both machines using <b>1. & 2.</b>)</li><li>As stated in <a href="https://opengate.readthedocs.io/en/latest/how_to_use_gate_on_a_cluster.html#id3" target="_blank">this section of GATE documentation</a>, only physical cores are beneficial to the simulation of particle interaction/transport when using standard CPU architecture  (compare<b> I/1. </b>& <b>I/2. </b>).</li><li>However, <b>Hyperthreaded cores can still be used in order to manage the I/O </b>(compare<b> I/3. </b>& <b>I/4. </b>). I did not see this noted anywhere in the documentation, and might still be useful information as it divided my output saving time by about 2 compared to the original time on MacBook Pro (compare<b> I/5. </b>& <b>I/6. </b>)</li><li><b>The gain in writing speed is not linear when using more physical cores</b> (compare<b> II/5. </b>& <b>II/ 6. </b>). Maybe induced by how ROOT manages core usage in order to compress/save data? </li><li>Keeping in mind all the previous remarks, we can see in <b>III/1. & III/2. </b>that <b>on the workstation, ALL the cores (physical and hyperthreaded) are beneficial to the simulation of particle interaction/transport </b>(linear speed increase all the way from 8 cores to 72). It is due to the architecture of the processor (server-like) ?</li></ul><div>In the end, my main questions are: </div><div><ul><li>Did anyone give a try to optimise the output writing speed (by any means other than getting a high bandwith MoBo/SSD)? </li><li>Excluding HDD bandwith, what is the bottleneck in writting the data (maybe such as number of processors assigned, ROOT working principle)? </li><li>Is it possible to delay the output saving and keep the to-be-saved data in the RAM for a while, and then write the output data in batch (assuming you manage your RAM to not have overflow problems)?</li></ul><div><br></div>Any suggestion which might improve the global simulation time is welcome, and failed tries regarding output saving optimisation would be equally appreciated (as it would save me time not having to look these directions). If I did any error in my method/deductions, please let me know. I am still quite new to GATE, and could have done some obvious conceptual/practical mistakes.<br></div></div><div><font size="4"><br></font></div><div>Best regards,</div><div>Antoine</div></div>