Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
dragnet:cluster_benchmark [2016-07-27 17:54] – created DRAGNET cluster benchmark page: part 1 intro and networking amesfoort | dragnet:cluster_benchmark [2017-09-19 21:40] (current) – [RDMA] fix typo amesfoort | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== DRAGNET Cluster Benchmark Numbers ====== | + | ======= DRAGNET Cluster Benchmark Numbers |
Parts of the cluster and interconnects have been stress tested to optimize configuration and to find upper performance bounds that can be useful for application optimization and rough capacity estimates. | Parts of the cluster and interconnects have been stress tested to optimize configuration and to find upper performance bounds that can be useful for application optimization and rough capacity estimates. | ||
- | The tests described tend to cover *maximum* achievable performance results on a synthetic ideal workload. It is very likely that your (real) application will never reach these numbers, as the workload is non-ideal, and reaching peak performance can take a lot of effort. | + | The tests described tend to cover *maximum* achievable performance results on a synthetic ideal workload. It is very likely that your (real) application will never reach these numbers, as the workload is non-ideal, and reaching peak performance can take a lot of effort. But these numbers can serve as a top reference. |
[[dragnet: | [[dragnet: | ||
+ | |||
+ | |||
+ | ====== CPU computing and memory ====== | ||
+ | |||
+ | Computing and memory numbers N/A | ||
+ | |||
+ | |||
+ | ====== GPU computing, memory, PCIe ====== | ||
+ | |||
+ | Computing numbers N/A | ||
+ | |||
+ | |||
+ | ===== Memory and PCIe Bandwidth ===== | ||
+ | |||
+ | For PCIe bandwidth, there is a substantial difference between the 2 local GPUs and the 2 GPUs local to the other CPU in the same node. | ||
+ | NVIDIA' | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
+ | |||
+ | [amesfoort@drg23 bandwidthTest]$ ./ | ||
+ | [CUDA Bandwidth Test] - Starting... | ||
+ | Running on... | ||
+ | | ||
+ | | ||
+ | Quick Mode | ||
+ | | ||
+ | Host to Device Bandwidth, 1 Device(s) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | Result = PASS | ||
+ | | ||
+ | NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. | ||
Line 15: | Line 135: | ||
===== Infiniband ===== | ===== Infiniband ===== | ||
- | Each '' | + | Each '' |
==== IPoIB: TCP and UDP ==== | ==== IPoIB: TCP and UDP ==== | ||
- | An application that uses the Infiniband ('' | + | An application that uses the Infiniband ('' |
We used the '' | We used the '' | ||
Line 40: | Line 160: | ||
- | A real application (not a synthetic benchmark) likely does something else except for data transfer and may have trouble reaching these numbers, because CPU load is a limiting factor and the clock frequency boost of 1 core on DRAGNET | + | A real application (not a synthetic benchmark) likely does something else except for data transfer and may have trouble reaching these numbers, because CPU load is a limiting factor and the clock frequency boost of 1 core on '' |
Line 50: | Line 170: | ||
RDMA (Remote Direct Memory Access) allows an application to directly access memory on another node. Although some initial administration is set up via the OS kernel, the actual transfer commands and completion handling does not go via the kernel. This also saves data copying on sender and receiver and CPU usage. | RDMA (Remote Direct Memory Access) allows an application to directly access memory on another node. Although some initial administration is set up via the OS kernel, the actual transfer commands and completion handling does not go via the kernel. This also saves data copying on sender and receiver and CPU usage. | ||
- | Typical applications that may use RDMA are applications that use MPI (Message Passing Interface) (such as COBALT), or (hopefully) the LUSTRE client. NFS can also be set up to use RDMA. You can program directly into the '' | + | Typical applications that may use RDMA are applications that use MPI (Message Passing Interface) (such as COBALT), or (hopefully) the LUSTRE client. NFS can also be set up to use RDMA. You can program directly into the '' |
We used the '' | We used the '' | ||
Line 121: | Line 241: | ||
ver_rc_fetch_add: | ver_rc_fetch_add: | ||
msg_rate | msg_rate | ||
+ | |||
+ | |||
+ | ===== 10 Gbit/s Ethernet ===== | ||
+ | |||
+ | The '' | ||
+ | |||
+ | |||
+ | One should be able to achieve (near) 10 Gbit/s using a TCP socket stream. The '' | ||
+ | |||
+ | The 6x 10G trunk is apparently not able to utilize all 6 links simultaneously with up to 10 streams (some info on the why is available, but off-topic here). The following is what can be max expected for applications between DRAGNET and COBALT across 10G: | ||
+ | |||
+ | DRAGNET -> COBALT, 8x iperf TCP: 38.8 Gbit/s. Second run: 39.7 Gbit/s. (Aug 2015) | ||
+ | |||
+ | And a week later we achieved 10G higher throughput using 10 COBALT and 10 DRAGNET nodes. Commands: '' | ||
+ | DRAGNET -> COBALT (1 iperf TCP stream per node) (6x10G) (otherwise idle clusters and network) (Sep 2015) | ||
+ | # | ||
+ | 1 9.91 (1 full bw) | ||
+ | 2 19.82 (2 full bw) | ||
+ | 3 29.73 (3 full bw) | ||
+ | 4 29.76 (2 full bw, 2 ~half bw) | ||
+ | 5 29.68 (1 full bw, 4 ~half bw) | ||
+ | 6 39.54 (2 full bw, 4 ~half bw) | ||
+ | 7 49.49 (3 full bw, 4 ~half bw) (best result, 47-49.5) | ||
+ | 8 46.19 (1 full bw, 7(?) ~half bw) (best result after 6 runs, half of the runs didn't do 8 streams) | ||
+ | 9 49.28 (1 full bw, 8 ~half bw) | ||
+ | 10 49.13;50.52 (0 full bw, 10 ~half bw) | ||
+ | | ||
+ | COBALT -> DRAGNET (1 iperf TCP stream per node) (6x10G) (otherwise idle clusters and network) (Sep 2015) | ||
+ | # | ||
+ | 1 9.91 (1 full bw) | ||
+ | 2 19.82 (2 full bw) | ||
+ | 3 | ||
+ | 4 29.72 (2 full bw, 2 ~half) | ||
+ | 5 39.64 (3 full bw, 2 ~half) (better than dragnet-> | ||
+ | 8 49.52 (4 full bw, 3 2-3G, 1 near <1G) | ||
+ | 10 48.87 (3 full bw, the rest 2-3 G) | ||
+ | |||
+ | From 16 CEP2 nodes equally spread over the 4 CEP2 switches to 16 DRAGNET nodes (Sep 2015), iperf got us in 3 test runs: | ||
+ | Total iperf (synthetic/ | ||
+ | 48.02 Gbit/s | ||
+ | 41.97 Gbit/s | ||
+ | 44.84 Gbit/s | ||
+ | If you'd transfer large, equal sized files, some files in a set of 16 would be transferred way earlier than others, since some indiv streams reached only 1 Gbit/s, while others reached 5 Gbit/s. | ||
+ | |||
+ | Doing this with 14 LOTAAS .raw files data using '' | ||
+ | time \ | ||
+ | scp locus001:/ | ||
+ | scp locus004:/ | ||
+ | scp locus026:/ | ||
+ | scp locus027:/ | ||
+ | scp locus028:/ | ||
+ | scp locus029:/ | ||
+ | scp locus051:/ | ||
+ | scp locus052:/ | ||
+ | scp locus053:/ | ||
+ | scp locus054:/ | ||
+ | scp locus076:/ | ||
+ | scp locus077:/ | ||
+ | scp locus078:/ | ||
+ | scp locus079:/ | ||
+ | real 4m34.894s | ||
+ | user 0m0.000s | ||
+ | sys | ||
+ | | ||
+ | Total size: 14x 18982895616 bytes = 247.5087890625 Gbyte | ||
+ | => 7.2 Gbit/s (14 scp streams (idle sys), no dynamic load-balancing) | ||
+ | |||
+ | |||
+ | ====== Storage ====== | ||
+ | |||
+ | Only rough write tests have been done with a sequential dd(1). Disk I/O bandwidth changes across the platters. Actual file I/O also depends on how the filesystem lays out the data. | ||
+ | |||
+ | |||
+ | ===== drgXX nodes ===== | ||
+ | |||
+ | Scratch space on '' | ||
+ | |||
+ | Another cp test on a 85+% full target filesystem: | ||
+ | |||
+ | [amesfoort@drg23 data1]$ time (cp / | ||
+ | | ||
+ | real 9m56.566s | ||
+ | user 0m0.799s | ||
+ | sys 4m20.007s | ||
+ | |||
+ | With a file size of 2 * 75931582464 bytes, that's a read/write speed of 242.8 MiB/s. | ||
+ | ===== dragproc node ===== | ||
+ | |||
+ | On '' | ||
+ |