Performance
To compare 32-bit vs. 64-bit vs. -fPIC performance, I have run the DMI/test/ncperf -a benchmark on birch (32-bit) and cedar (64-bit) machines. Here are the results (units are ops per CPU second). For reference, I have included the lofar10 numbers as well (Athlon X2 4400+, 32 bit mode):
benchmark
32 bit
64 bit
64 bit w/-fPIC
lofar10
HIID2S
435117
747000
573913
482609
S2HIID
199333
337000
304333
231000
RSINIT
1673333
2563333
2536667
2023333
RSSCAN
10106667
9946667
9666667
11006667
RBINIT
506536
653595
629139
664452
RBSCAN
4515050
3650000
3604651
5733333
R3INIT
70304
104412
94554
87880
R3SCAN
109850
175760
157136
127615
R3AFIX
343667
588963
490970
429000
R3RFIX
356667
600000
519000
442809
R1SCAN
302326
467320
404636
359211
R1AFIX
749833
1288294
988333
961204
R1RFIX
830333
1356333
1148667
1022667
ADSM1M
983
663
987
1090
ADSA1M
81
120
119
91
ADSI1M
997
1003
997
1096
ADAM1M
47
73
74
81
ADAA1M
29.5082
50
50
54
ADAI1M
47
75
75
81
ADAA25M
1.2012
2.14067
2.14724
2.18750
ADAI25M
1.88679
3.03951
2.99003
3.19489
The benchmark desciptions are as follows:
HIID2S: convert HIIDs to strings S2HIID: convert strings to HIIDs RSINIT: init 3-element record RSSCAN: access 3-element record RBINIT: init 10000-element record RBSCAN: access 10000-element record R3INIT: init nested record of 26x26x26 scalar integer fields (fps) R3SCAN: sequential scan of 26x26x26 record (fps) R3AFIX: assigning to fixed field of 3-nested record (fps) R3RFIX: reading fixed field of 3-nested record (fps) R1SCAN: sequential scan of 26-field record (fps) R1AFIX: assigning to fixed field of record (fps) R1RFIX: reading fixed field of record (fps) ADSM1M: sum of 1000x1000 doubles, via Matrix(i,j) (ops) ADSA1M: sum of 1000x1000 doubles, via sum() (ops) ADSI1M: sum of 1000x1000 doubles, via NCIter (ops) ADSH1M: sum of 1000x1000 doubles, via hooks (ops) ADAM1M: addition of 1000x1000 doubles, via Matrix(i,j) (ops) ADAA1M: addition of 1000x1000 doubles, via array math (ops) ADAI1M: addition of 1000x1000 doubles, via NCIters (ops) ADAH1M: addition of 1000x1000 doubles, via hooks (ops)
Conclusion: impact of -fPIC on DMI ops is up to 10%. 64-bit code is up to 30~40% faster on some DMI benchmarks, with a few exceptions.
Here's some Vells math benchmarks from MEQ/test/vellsperf:
benchmark
32 bit
64 bit
64 bit w/-fPIC
lofar10
SEXPSC
7132107
5205000
5256187
7787000
SEXPSV
340199
448667
491333
390000
SEXPVV
6266667
6033333
5960265
6900000
SEXPV1
6266667
5980066
5933333
6854305
SEXPVM
4901961
4576659
4504505
5952381
SUM_V1
93400000
117466667
116500000
102374582
SUM_1V
89300000
113966667
114347826
98903654
SUMV1V
22727273
26229508
25641026
33003300
SUM_MM
18348624
25316456
24615385
29032258
SEXPSC: sum of exp(complex scalar) SEXPSV: sum of exp(complex scalar vells) SEXPVV: sum of exp(vector vells) SEXPV1: sum of exp(1xN vector vells) SEXPVM: sum of exp(matrix vells) SUM_V1: sum of vector vells SUM_1V: sum of 1xN vector vells SUMV1V: sum of vector and 1xN vector vells SUM_MM: sum of matrix vells
Conclusion: impact of -fPIC on Vells math is minimal. Vector benchmarks are slightly slower on 64 bits, but all matrix ops are faster. We could consider building everything with -fPIC. It's hard to predict the impact on an actual tree, we'll need to build a full system to measure this.
