==== Performance of the LOTAAS v.1 pipeline on cartesius ====

**Time taken by the individual pipeline components per beam (24-core node)**\\
fits2fil: 6min\\
rfifind: 15min\\
mpiprepsubband (253 trials): 3min\\
single pulse search: 1min\\
realfft: 10sec\\
rednoise: 10sec\\
accelsearch (-zmax=0 ; –numharm=16): 1min20sec\\
accelsearch (-zmax=50 ; –numharm=16): 12min\\
accelsearch (-zmax=50 ; –numharm=8): 5min\\
accelsearch (-zmax=50 ; –numharm=8): 26min\\
plots: 20sec\\
python sifting and folding: 21min\\
pfd scrunching: 5sec\\
data copying: a few secs\\
candidate scoring: a few secs\\

**Total time spent for the first large set of DM trials (0-4000)**\\
mpiprepsubband: 40min\\
sp: 16min\\
realfft: 3.5min\\
rednoise: 3.5min\\
accelsearch (zmax=0;numharm=16): 21min\\
accelsearch (zmax=50;numharm=16): 192min\\
accelsearch (zmax=50;numharm=8): 80min\\
accelsearch (zmax=200;numharm=8): 416min\\

**Total time spent for the second large set of DM trials (4000-10000)**\\
mpiprepsubband: 24min\\
sp: 8min\\
realfft: 2min\\
rednoise: 2min\\
accelsearch (zmax=0;numharm=16): 11min\\
accelsearch (zmax=50;numharm=16): 96min\\
accelsearch (zmax=50;numharm=8): 40min\\
accelsearch (zmax=200;numharm=8): 208min\\


^ % time alloc. ^ zmax=0;numharm=16 ^ zmax=50;numharm=16 ^ zmax=50;numharm=8 ^ zmax=200;numharm=8 ^
^ fil conversion ^  3  ^  1   ^  2  ^  <1  ^
^ rfifind ^  9  ^  3  ^  6  ^  2  ^
^ dedispersion ^  37  ^  16  ^  25  ^  8  ^
^ sp search ^  14  ^  5   ^  9  ^  3  ^
^ realfft ^  3  ^  1  ^  2  ^  <1  ^
^ rednoise ^  3  ^  1   ^  2  ^  <1  ^
^ accelsearch ^  18  ^  67  ^  46  ^  81  ^
^ folding ^  12  ^  5  ^  8  ^  3  ^
^ data copying/etc ^  1  ^  1  ^  1  ^  <1  ^


Total processing time per beam (zmax=0;numharm=16): ~3hours\\
Total processing time per beam (zmax=50;numharm=16): ~7hours\\
Total processing time per beam (zmax=50;numharm=8): ~5hours\\
Total processing time per beam (zmax=200;numharm=8): ~13h40m\\

==== Performance of the LOTAAS v.1 GPU pipeline on cartesius ====

mpiprepsubband (253 trials): 38sec


==== Data transferring (CEP$/LTA) ====

32-bit to 8-bit downsampling on CEP2 (per observation): 6-8 hours\\
Transferring from CEP2 to LTA (per observation): 2-3 hours\\
Observation downloading on cartesius (1-core): ~8hours\\
Observation downloading on cartesius (home area, 8jobs in parallel.sh):<2hours\\

==== Benchmarks for filterbank creation with psrfits2fil ====


psrfits2fil was executed with different numbers of parallel processes. The following plot shows the amount of time needed in order to create the fil files for various cases of parallel psrfits2fil instances.\\

Using the same disk the following cases were tried: 1,3,4,5,8,12,16. Anything above 16 is just an extrapolation\\
for 2 disks: 1,4,8,12,16,20,24,28,32

{{dragnet:benchmarks:psrfits2fil1a.png?400}}

Using multithreading with 2 disks, gives a smooth linear performance up to 24 cores, and then it turns slightly worse, probably due to I/O.

Using the above results, I extrapolated the time needed with each work strategy in order to compute 32 filtebanks.\\

{{dragnet:benchmarks:psrfits2fil1b.png?400}}

When using the same disk, the fastest execution time is achieved having 4 psrfits2fil instances running in parallel. Above that, probably disk I/O normalises all the results and the performance decreases gradually, probably due to the increased I/O calls, since the throughput must already be saturated.\\

Using 2 disks, the performance is significantly better, and the best results are achieved using 24 psrfits2fil instances in parallel, although the difference remains small.\\

==== rfifind benchmarks ====
I ran the same tests twice. 

I created rfi masks running rfifind in parallel for 4,8,12,16,20,24,28 and 32 cores (>16 hyperthreaded).\\
In the following plots I plot the number of parallel instances of rfifind executed (x-axis) and the time taken for these to be completed (y-axis).\\ 

{{dragnet:benchmarks:rfifind1a.png?400}}
{{dragnet:benchmarks:rfifind2a.png?400}}

In the following plots, I extrapolated the above results in order to find the optimal number of parallel jobs in order to compute 32 rfi masks

{{dragnet:benchmarks:rfifind1b.png?400}}
{{dragnet:benchmarks:rfifind2b.png?400}}

From the above, we can conclude that using 1 or 2 disks does not make a big difference. Also, hyperthreading works smoothly, and indeed the best strategy is to have the maximum possible number of rfifind instances running in parallel.

==== Cartesius Benchmarks ====

Processing 1 full pointing on cartesius using either /dev/shm or HDDs

{{dragnet:benchmarks:cartesius_bm1.png?400}}