The following sections describe how to verify that the interconnect is setup correctly, which means that all Cluster Nodes can communicate with all other Cluster Nodes via the PCI Express interconnect by sending low-level control packets and performing remote memory access.
Without the required drivers and services running on a Cluster Node, the node will fail to communicate with other nodes. On the Cluster Nodes, the kernel services (interconnect resources driver) and (upper level hardware services) need to be running.
Because the drivers do also appear as services, you can query their status with the usual tools of the installed operating system distribution.
If any of the required services is not running, you will find more information on the problem that may have occurred in the system log facilities. Call to inspect the kernel messages, and check for related messages.
The static interconnect test makes sures that all PCIe communication hardware are working correctly by performing a self-test, and determines if the setup and the PCIe routing is correct (matches the actual hardware topology). It will also check all PCIe connections, but this has already been done in the PCIe Connection Test. The tool to perform this test is dis_diag (default location ).
Running dis_diag on a Cluster Node will perform a self test on the local adapter(s) and list all remote adapters that this adapter can see via the PCI Express interconnect. This means, to perform the static interconnect test on a full cluster, you will basically need to run dis_diag on each Cluster Node and see if any problems with the adapter are reported, and if the adapters in each Cluster Node can see all remote adapters installed in the other Cluster Nodes.
Normally you should invoke dis_diag with no arguments, and it will do a general test and only show the most interesting information. Advanced users may want to enable the full verbose mode by using the -V 9 command line option:
$ sp(dis_diag "-V 9")
The -V 9 option will generate a lot of information, some parts of the information requires knowledge about the PCIe chipset and the PCIe specification in general. The diagnostic module will collect various usage and error information over time. This can be cleared by using the -clear command line option:
$ sp(dis_diag "-clear")
An example output of dis_diag for a Cluster Node which is part of a 2 node cluster and using one PXH830 adapter per Cluster Node looks like this:
[root@Hetty ~]# /opt/DIS/sbin/dis_diag ================================================================================ Dolphin diagnostic tool -- dis_diag version 5.5.0 ( Thu Jan 10th 16:23:13 CET 2018 ) ================================================================================ dis_diag compiled in 64 bit mode Driver : Dolphin IRM (GX) 5.5.0.0 Jan 10th 2018 (rev unknown) Date : Wed Feb 1 14:34:26 CET 2018 System : Linux Hetty 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Number of configured local adapters found: 1 Adapter 0 > Type : PXH830 NodeId : 4 Serial number : PXH830-BC-000116 PXH chip family : PLX_DRACO_2 PXH chip vendorId : 0x10b5 PXH chip device : 0x8733 PXH chip revision : 0xCA EEPROM version NTB mode : 05 EEPROM vendor info : 0x0000 Firmware version : 05.01 Card revision : BC Topology type : Direct 2 nodes Topology Autodetect : No Number of enabled links : 1 Max payload size (MPS) : 256 Multicast group size : 2 MB Prefetchable memory size : 32768 MB (BAR2) Non-prefetchable size : 64 MB (BAR4) Clock mode slot : Port Clock mode link : Global PCIe slot state : x16, Gen3 (8 GT/s) PCIe slot capabilities : x16, Gen3 (8 GT/s) ************************* PXH ADAPTER 0 LINK 0 STATE ************************* Link 0 uptime : 91371 seconds Link 0 state : ENABLED Link 0 state : x16, Gen3 (8 GT/s) Link 0 required : x16, Gen3 (8 GT/s) Link 0 capabilities : x16, Gen3 (8 GT/s) Link 0 cable inserted : 1 Link 0 active : 1 Link 0 configuration : NTB **************************** PXH ADAPTER 0 STATUS **************************** Chip temperature : 87 C Board temperature : 50 C *************** PXH ADAPTER 0, PARTNER INFORMATION FOR LINK 0 *************** Partner adapter type : PXH830 Partner serial number : PXH830-000117 Partner link no : 0 Partner number of ports : 1 ***************************** TEST OF ADAPTER 0 ***************************** OK: PXH chip alive in adapter 0. OK: Link alive in adapter 0. ==> Local adapter 0 ok. ************************ TOPOLOGY SEEN FROM ADAPTER 0 ************************ Adapters found: 2 ----- List of all nodes found: Nodes detected: 0004 0008 *********************** SESSION STATUS FROM ADAPTER 0 *********************** Node 4: Session valid Node 8: Session valid ---------------------------------- dis_diag discovered 0 note(s). dis_diag discovered 0 warning(s). dis_diag discovered 0 error(s). TEST RESULT: *PASSED*
The static interconnect test passes if dis_diag delivers TEST RESULT: *PASSED*
and reports the same topology (remote adapters) on all Cluster Nodes.
Once the correct installation and setup and the basic functionality of the interconnect have been verified, it is possible to perform a set of low-level benchmarks to determine the base-line performance of the interconnect without any additional software layers. The tests that are relevant for this are scibench2 (streaming remote memory PIO access performance), scipp (request-response remote memory PIO write performance), dma_bench (streaming remote memory DMA access performance) and intr_bench (remote interrupt performance).
All these tests need to run on two Cluster Nodes (A and B) and are started in the same manner:
Determine the NodeId of both Cluster Nodes using the query command (default path ). The NodeId is reported as "Local node-id".
On node A, start the server-side benchmark with the options -server
and -rn <NodeId of B>
, like:
$ sp(scibench2 "-server -rn 8")
On Cluster Node B, start the client-side benchmark with the options -client
and -rn <NodeId of A>
, like:
$ sp(scibench2 "-client -rn 4")
The test results are reported by the client.
Scibench2 measures the streaming bandwidth using CPU based PIO transfers (memcopy)
The following results are measured using the PXH810 card (Gen3, x8)
--------------------------------------------------------------- Segment Size: Average Send Latency: Throughput: --------------------------------------------------------------- 4 0.07 us 58.31 MBytes/s 8 0.07 us 117.14 MBytes/s 16 0.07 us 231.06 MBytes/s 32 0.07 us 445.08 MBytes/s 64 0.08 us 838.84 MBytes/s 128 0.09 us 1483.27 MBytes/s 256 0.11 us 2408.40 MBytes/s 512 0.15 us 3497.44 MBytes/s 1024 0.23 us 4530.20 MBytes/s 2048 0.39 us 5294.99 MBytes/s 4096 0.77 us 5308.03 MBytes/s 8192 1.54 us 5306.65 MBytes/s 16384 3.10 us 5291.49 MBytes/s 32768 6.19 us 5294.48 MBytes/s 65536 12.39 us 5289.90 MBytes/s
Average Send latency is the wall time to write 4 bytes to remote memory
Throughput is the streaming performance using PIO writes to remote memory.
dma_bench measures the streaming DMA bandwidth available through the SISCI API.
The following results are measured using the PXH830 card (Gen3, x16)
------------------------------------------------------------------------------- Message Total Vector Transfer Latency Bandwidth size size length time per message ------------------------------------------------------------------------------- 64 16384 256 35.76 us 0.14 us 458.18 MBytes/s 128 32768 256 36.81 us 0.14 us 890.24 MBytes/s 256 65536 256 37.16 us 0.15 us 1763.43 MBytes/s 512 131072 256 39.36 us 0.15 us 3329.83 MBytes/s 1024 262144 256 41.34 us 0.16 us 6340.40 MBytes/s 2048 524288 256 54.75 us 0.21 us 9576.21 MBytes/s 4096 524288 128 51.83 us 0.40 us 10116.51 MBytes/s 8192 524288 64 50.46 us 0.79 us 10390.38 MBytes/s 16384 524288 32 49.69 us 1.55 us 10551.60 MBytes/s 32768 524288 16 49.30 us 3.08 us 10634.86 MBytes/s 65536 524288 8 49.07 us 6.13 us 10684.71 MBytes/s 131072 524288 4 48.90 us 12.23 us 10721.20 MBytes/s 262144 524288 2 48.89 us 24.44 us 10724.27 MBytes/s 524288 524288 1 48.98 us 48.98 us 10704.78 MBytes/s
--------------------------------------------------------------- Segment Size: Average Send Latency: Throughput: --------------------------------------------------------------- 4 0.07 us 58.31 MBytes/s 8 0.07 us 117.14 MBytes/s 16 0.07 us 231.06 MBytes/s 32 0.07 us 445.08 MBytes/s 64 0.08 us 838.84 MBytes/s 128 0.09 us 1483.27 MBytes/s 256 0.11 us 2408.40 MBytes/s 512 0.15 us 3497.44 MBytes/s 1024 0.23 us 4530.20 MBytes/s 2048 0.39 us 5294.99 MBytes/s 4096 0.77 us 5308.03 MBytes/s 8192 1.54 us 5306.65 MBytes/s 16384 3.10 us 5291.49 MBytes/s 32768 6.19 us 5294.48 MBytes/s 65536 12.39 us 5289.90 MBytes/s
The scipp SISCI benchmark sends a message of the specified size to the remote system. The remote system is polling for incoming data and will send a similar message back to the first node.
The minimal round-trip latency for writing to remote memory is extremely low using PCI Express networks.
The following results are typical for a PCI Express Gen3 x8 link
Ping Pong data transfer: size retries latency (usec) latency/2 (usec) 0 2486 1.079 0.539 4 2406 1.078 0.539 8 2442 1.090 0.545 16 2454 1.098 0.549 32 2482 1.117 0.558 64 2562 1.151 0.575 128 2608 1.176 0.588 256 2667 1.247 0.624 512 2866 1.331 0.666 1024 3064 1.492 0.746 2048 3773 1.880 0.940 4096 4850 2.659 1.330 8192 7364 4.247 2.123
The interrupt latency is affected by the operating system and can therefore vary.
Average unidirectional interrupt time : 2.515 us. Average round trip interrupt time : 5.030 us.