Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
dragnet:cluster_benchmark [2016-07-27 18:22] – add 10G benchmarking numbers amesfoort | dragnet:cluster_benchmark [2017-09-19 21:40] (current) – [RDMA] fix typo amesfoort | ||
---|---|---|---|
Line 6: | Line 6: | ||
[[dragnet: | [[dragnet: | ||
+ | |||
+ | |||
+ | ====== CPU computing and memory ====== | ||
+ | |||
+ | Computing and memory numbers N/A | ||
+ | |||
+ | |||
+ | ====== GPU computing, memory, PCIe ====== | ||
+ | |||
+ | Computing numbers N/A | ||
+ | |||
+ | |||
+ | ===== Memory and PCIe Bandwidth ===== | ||
+ | |||
+ | For PCIe bandwidth, there is a substantial difference between the 2 local GPUs and the 2 GPUs local to the other CPU in the same node. | ||
+ | NVIDIA' | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
Line 15: | Line 135: | ||
===== Infiniband ===== | ===== Infiniband ===== | ||
- | Each '' | + | Each '' |
==== IPoIB: TCP and UDP ==== | ==== IPoIB: TCP and UDP ==== | ||
- | An application that uses the Infiniband ('' | + | An application that uses the Infiniband ('' |
We used the '' | We used the '' | ||
Line 50: | Line 170: | ||
RDMA (Remote Direct Memory Access) allows an application to directly access memory on another node. Although some initial administration is set up via the OS kernel, the actual transfer commands and completion handling does not go via the kernel. This also saves data copying on sender and receiver and CPU usage. | RDMA (Remote Direct Memory Access) allows an application to directly access memory on another node. Although some initial administration is set up via the OS kernel, the actual transfer commands and completion handling does not go via the kernel. This also saves data copying on sender and receiver and CPU usage. | ||
- | Typical applications that may use RDMA are applications that use MPI (Message Passing Interface) (such as COBALT), or (hopefully) the LUSTRE client. NFS can also be set up to use RDMA. You can program directly into the '' | + | Typical applications that may use RDMA are applications that use MPI (Message Passing Interface) (such as COBALT), or (hopefully) the LUSTRE client. NFS can also be set up to use RDMA. You can program directly into the '' |
We used the '' | We used the '' | ||
Line 125: | Line 245: | ||
===== 10 Gbit/s Ethernet ===== | ===== 10 Gbit/s Ethernet ===== | ||
- | The '' | + | The '' |
One should be able to achieve (near) 10 Gbit/s using a TCP socket stream. The '' | One should be able to achieve (near) 10 Gbit/s using a TCP socket stream. The '' | ||
- | The 6x 10G trunk is apparently not able to utilize all 6 links simultaneously (some info on the why is available, but off-topic here). The following is what can be max expected for applications between DRAGNET and COBALT across 10G: | + | The 6x 10G trunk is apparently not able to utilize all 6 links simultaneously |
DRAGNET -> COBALT, 8x iperf TCP: 38.8 Gbit/s. Second run: 39.7 Gbit/s. (Aug 2015) | DRAGNET -> COBALT, 8x iperf TCP: 38.8 Gbit/s. Second run: 39.7 Gbit/s. (Aug 2015) | ||
Line 157: | Line 277: | ||
8 49.52 (4 full bw, 3 2-3G, 1 near <1G) | 8 49.52 (4 full bw, 3 2-3G, 1 near <1G) | ||
10 48.87 (3 full bw, the rest 2-3 G) | 10 48.87 (3 full bw, the rest 2-3 G) | ||
+ | |||
+ | From 16 CEP2 nodes equally spread over the 4 CEP2 switches to 16 DRAGNET nodes (Sep 2015), iperf got us in 3 test runs: | ||
+ | Total iperf (synthetic/ | ||
+ | 48.02 Gbit/s | ||
+ | 41.97 Gbit/s | ||
+ | 44.84 Gbit/s | ||
+ | If you'd transfer large, equal sized files, some files in a set of 16 would be transferred way earlier than others, since some indiv streams reached only 1 Gbit/s, while others reached 5 Gbit/s. | ||
+ | |||
+ | Doing this with 14 LOTAAS .raw files data using '' | ||
+ | time \ | ||
+ | scp locus001:/ | ||
+ | scp locus004:/ | ||
+ | scp locus026:/ | ||
+ | scp locus027:/ | ||
+ | scp locus028:/ | ||
+ | scp locus029:/ | ||
+ | scp locus051:/ | ||
+ | scp locus052:/ | ||
+ | scp locus053:/ | ||
+ | scp locus054:/ | ||
+ | scp locus076:/ | ||
+ | scp locus077:/ | ||
+ | scp locus078:/ | ||
+ | scp locus079:/ | ||
+ | real 4m34.894s | ||
+ | user 0m0.000s | ||
+ | sys | ||
+ | | ||
+ | Total size: 14x 18982895616 bytes = 247.5087890625 Gbyte | ||
+ | => 7.2 Gbit/s (14 scp streams (idle sys), no dynamic load-balancing) | ||
+ | |||
+ | |||
+ | ====== Storage ====== | ||
+ | |||
+ | Only rough write tests have been done with a sequential dd(1). Disk I/O bandwidth changes across the platters. Actual file I/O also depends on how the filesystem lays out the data. | ||
+ | |||
+ | |||
+ | ===== drgXX nodes ===== | ||
+ | |||
+ | Scratch space on '' | ||
+ | |||
+ | Another cp test on a 85+% full target filesystem: | ||
+ | |||
+ | [amesfoort@drg23 data1]$ time (cp / | ||
+ | | ||
+ | real 9m56.566s | ||
+ | user 0m0.799s | ||
+ | sys 4m20.007s | ||
+ | |||
+ | With a file size of 2 * 75931582464 bytes, that's a read/write speed of 242.8 MiB/s. | ||
+ | ===== dragproc node ===== | ||
+ | |||
+ | On '' | ||