Differences

This shows you the differences between two versions of the page.

--- dragnet:cluster_benchmark [2016-07-27 18:23] – [10 Gbit/s Ethernet] amesfoort
+++ dragnet:cluster_benchmark [2017-09-19 21:40] (current) – [RDMA] fix typo amesfoort
@@ Line 6: / Line 6: @@
 [[dragnet:hardware_specs|Cluster Specifications]]
+====== CPU computing and memory ======
+Computing and memory numbers N/A
+====== GPU computing, memory, PCIe ======
+Computing numbers N/A
+===== Memory and PCIe Bandwidth =====
+For PCIe bandwidth, there is a substantial difference between the 2 local GPUs and the 2 GPUs local to the other CPU in the same node.
+NVIDIA's ''bandwidthTest'' is not meant for performance (because of GPU Boost) (as it says), but here are the numbers anyway:
+  [amesfoort@drg23 bandwidthTest]$ ./bandwidthTest --device=0
+  [CUDA Bandwidth Test] - Starting...
+  Running on...
+   Device 0: GeForce GTX TITAN X
+   Quick Mode
+   Host to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			6325.2
+   Device to Host Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			4279.2
+   Device to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			250201.6
+  Result = PASS
+  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
+  [amesfoort@drg23 bandwidthTest]$ ./bandwidthTest --device=1
+  [CUDA Bandwidth Test] - Starting...
+  Running on...
+   Device 1: GeForce GTX TITAN X
+   Quick Mode
+   Host to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			6178.8
+   Device to Host Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			4412.2
+   Device to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			249804.8
+  Result = PASS
+  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
+  [amesfoort@drg23 bandwidthTest]$ ./bandwidthTest --device=2
+  [CUDA Bandwidth Test] - Starting...
+  Running on...
+   Device 2: GeForce GTX TITAN X
+   Quick Mode
+   Host to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			10547.1
+   Device to Host Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			10892.2
+   Device to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			249503.5
+  Result = PASS
+  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
+  [amesfoort@drg23 bandwidthTest]$ ./bandwidthTest --device=3
+  [CUDA Bandwidth Test] - Starting...
+  Running on...
+   Device 3: GeForce GTX TITAN X
+   Quick Mode
+   Host to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			10626.5
+   Device to Host Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			10903.4
+   Device to Device Bandwidth, 1 Device(s)
+   PINNED Memory Transfers
+     Transfer Size (Bytes)	Bandwidth(MB/s)
+     33554432			249342.6
+  Result = PASS
+  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
@@ Line 15: / Line 135: @@
 ===== Infiniband =====
-Each ''drgXX'' node has an FDR (54.545 Gbit/s) HCA (Host Channel Adapter) local the second CPU (i.e. GPU 1). The 36 port cluster switch is connected to the COBALT switch with 5 aggregated lines (272.727 Gbit/s). See below what can be achieved under ideal circumstances.
+Each ''drgXX'' node has an FDR (54.545 Gbit/s) HCA (Host Channel Adapter) local to the second CPU (i.e. CPU id 1). The 36 port cluster switch is connected to the COBALT switch with 5 aggregated lines (272.727 Gbit/s). See below what can be achieved under ideal circumstances.
 ==== IPoIB: TCP and UDP ====
-An application that uses the Infiniband (''ib'') network normally uses IPoIB (IP-over-Infiniband) to transfer data via TCP or UDP. DRAGNET IPoIB settings have been optimized for TCP (at the cost of UDP performance). We (mostly) use TCP and will not receive UDP data from LOFAR stations directly.
+An application that uses the Infiniband (''ib'') network normally uses IPoIB (IP-over-Infiniband) to transfer data via TCP or UDP. DRAGNET IPoIB settings have been optimized for TCP (vs UDP performance) (IPoIB connected-mode enabled). We (mostly) use TCP and will not receive UDP data from LOFAR stations directly.
 We used the ''iperf3'' benchmark and got the following bandwidth numbers between two ''drgXX'' nodes:
@@ Line 50: / Line 170: @@
 RDMA (Remote Direct Memory Access) allows an application to directly access memory on another node. Although some initial administration is set up via the OS kernel, the actual transfer commands and completion handling does not go via the kernel. This also saves data copying on sender and receiver and CPU usage.
-Typical applications that may use RDMA are applications that use MPI (Message Passing Interface) (such as COBALT), or (hopefully) the LUSTRE client. NFS can also be set up to use RDMA. You can program directly into the ''verbs'' and ''rdma-cm'' C APIs and link to those libraries, but be aware that extending some code to do this is not a 1 hr task... (Undoubtedly, there is also a Python module that either only wraps or even makes makes life easier.)
+Typical applications that may use RDMA are applications that use MPI (Message Passing Interface) (such as COBALT), or (hopefully) the LUSTRE client. NFS can also be set up to use RDMA. You can program directly into the ''verbs'' and ''rdma-cm'' C APIs and link to those libraries, but be aware that extending some code to do this is not a 1 hr task... (Undoubtedly, there is also a Python module that either only wraps or even makes life easier.)
 We used the ''qperf'' benchmark and got the following bandwidth and latency numbers between two ''drgXX'' nodes (TCP/UDP/SCTP over IP also included, but not as fast as mentioned above):
@@ Line 125: / Line 245: @@
 ===== 10 Gbit/s Ethernet =====
-The ''drgXX'' and ''dragproc'' nodes have a 10 Gbit/s ethernet adapter local to the first CPU (i.e. GPU 0). A 48 port Ethernet switch is connected to a LOFAR core switch with 6 aggregated lines (60 Gbit/s). See below what can be achieved under ideal circumstances.
+The ''drgXX'' and ''dragproc'' nodes have a 10 Gbit/s ethernet adapter local to the first CPU (i.e. CPU id 0). A 48 port Ethernet switch is connected to a LOFAR core switch with 6 aggregated lines (60 Gbit/s). See below what can be achieved under ideal circumstances.
@@ Line 157: / Line 277: @@
                49.52 (4 full bw, 3 2-3G, 1 near <1G)
               48.87 (3 full bw, the rest 2-3 G)
+From 16 CEP2 nodes equally spread over the 4 CEP2 switches to 16 DRAGNET nodes (Sep 2015), iperf got us in 3 test runs:
+  Total iperf (synthetic/benchmark) bandwidth for each test after initial ramp-up (~8 s):
+.02 Gbit/s
+.97 Gbit/s
+.84 Gbit/s
+  If you'd transfer large, equal sized files, some files in a set of 16 would be transferred way earlier than others, since some indiv streams reached only 1 Gbit/s, while others reached 5 Gbit/s.
+Doing this with 14 LOTAAS .raw files data using ''scp'' from CEP2 -> DRAGNET is not going to be blazingly fast:
+  time \
+  scp locus001:/data/L370522/L370522_SAP000_B000_S0_P000_bf.raw drg01-10g:/data1/ & \
+  scp locus004:/data/L370522/L370522_SAP000_B002_S0_P000_bf.raw drg02-10g:/data1/ & \
+  scp locus026:/data/L370522/L370522_SAP000_B021_S0_P000_bf.raw drg03-10g:/data1/ & \
+  scp locus027:/data/L370522/L370522_SAP000_B022_S0_P000_bf.raw drg04-10g:/data1/ & \
+  scp locus028:/data/L370522/L370522_SAP000_B023_S0_P000_bf.raw drg05-10g:/data1/ & \
+  scp locus029:/data/L370522/L370522_SAP000_B024_S0_P000_bf.raw drg06-10g:/data1/ & \
+  scp locus051:/data/L370522/L370522_SAP000_B045_S0_P000_bf.raw drg07-10g:/data1/ & \
+  scp locus052:/data/L370522/L370522_SAP000_B046_S0_P000_bf.raw drg08-10g:/data1/ & \
+  scp locus053:/data/L370522/L370522_SAP000_B047_S0_P000_bf.raw drg09-10g:/data1/ & \
+  scp locus054:/data/L370522/L370522_SAP000_B048_S0_P000_bf.raw drg10-10g:/data1/ & \
+  scp locus076:/data/L370522/L370522_SAP000_B068_S0_P000_bf.raw drg11-10g:/data1/ & \
+  scp locus077:/data/L370522/L370522_SAP000_B069_S0_P000_bf.raw drg12-10g:/data1/ & \
+  scp locus078:/data/L370522/L370522_SAP000_B070_S0_P000_bf.raw drg13-10g:/data1/ & \
+  scp locus079:/data/L370522/L370522_SAP000_B071_S0_P000_bf.raw drg14-10g:/data1/
+  real    4m34.894s  ( 274.894 s )
+  user    0m0.000s
+  sys     0m0.000s
+  Total size: 14x 18982895616 bytes = 247.5087890625 Gbyte
+  => 7.2 Gbit/s (14 scp streams (idle sys), no dynamic load-balancing)
+====== Storage ======
+Only rough write tests have been done with a sequential dd(1). Disk I/O bandwidth changes across the platters. Actual file I/O also depends on how the filesystem lays out the data.
+===== drgXX nodes =====
+Scratch space on ''drgXX'' at ''/data1'' and ''/data2'' (individually). A transfer size of 4k vs 64k does not appear to matter. We reach up to 288 MiB/s. Copying a large file using cp(1) reaches 225 - 279 MiB/s.
+Another cp test on a 85+% full target filesystem:
+  [amesfoort@drg23 data1]$ time (cp /data2/L412984/L412984_SAP000_B045_S0_P000_bf.raw /data2/L412984/L412984_SAP000_B046_S0_P000_bf.raw Ltmp && sync)
+  real	9m56.566s
+  user	0m0.799s
+  sys	4m20.007s
+With a file size of 2 * 75931582464 bytes, that's a read/write speed of 242.8 MiB/s.
+===== dragproc node =====
+On ''dragproc'' at ''/data'', a transfer size of 64k may perform somewhat better than 4k, but not consistently. We reach 490 - 530 MiB/s.