When installing the eXpressWare software stack (which includes SISCI, SuperSockets and the IPoPCIe (TCP/IP driver) via the SIA, the basic functionality and performance is verified at the end of the installation process by some of the same tests that are described in the following sections. This means, that if the tests performed by the SIA did not report any errors, it is very likely that both, the software and hardware work correctly.
The following sections describe how to verify that the interconnect is setup correctly, which means that all Cluster Nodes can communicate with all other Cluster Nodes via the PCI Express interconnect by sending low-level control packets and performing remote memory access.
The Cluster Management Node functionality is optional for SISCI based applications but currently mandatory for SuperSockets operation. The Cluster Management Node will automatically distribute configuration files and simplify diagnostic of the cluster. On the Cluster Management Node, only the user-space service dis_networkmgr
(the central Network Manager) needs to be running.
Without the required drivers and services running on a Cluster Node, the node will fail to communicate with other nodes. On the Cluster Nodes, the kernel module dis_kosif (providing operating system dependent functionality to all other kernel modules), the kernel services dis_irm
(interconnect resources driver) and dis_sisci
(upper level hardware services) and dis_ssocks
need to be running. Next to these kernel drivers, the user-space service dis_nodemgr
(node manager, which talks to the central Network Manager) needs to be active for configuration and monitoring.
Because the drivers do also appear as services, you can query their status with the usual tools of the installed operating system distribution. I.e., for Red Hat-based Linux distributions, you can do
# service dis_irm status Dolphin IRM 5.5.0 ( January 10th 2018 ) is running.
Dolphin provides a script dis_services that performs this task for all Dolphin services installed on a machine. It is used in the same way as the individual service command provided by the distribution:
# dis_services status Dolphin KOSIF 5.5.0 is running Dolphin IX 5.5.0 ( January 10th 2018 ) is running. Dolphin IRM 5.5.0 ( January 10th 2018 ) is running. Dolphin Node Manager is running (pid 3172). Dolphin SISCI 5.5.0 ( January 10th 2018 ) is running. Dolphin SuperSockets 5.5.0 "Express Train", January 10th 2018 (built January 10th 2018) running.
If any of the required services is not running, you will find more information on the problem that may have occurred in the system log facilities. Call dmesg to inspect the kernel messages, and check /var/log/messages
for related messages.
To ensure that the cluster is cabled correctly, please perform the PCIe connection test as described in Chapter 4, Initial Installation, Section 3.8.4, “PCIe connection Test”.
The static interconnect test makes sures that all PCIe communication hardware are working correctly by performing a self-test, and determines if the setup and the PCIe routing is correct (matches the actual hardware topology). It will also check all PCIe connections, but this has already been done in the PCIe Connection Test. The tool to perform this test is dis_diag (default location /opt/DIS/sbin/dis_diag
).
Running dis_diag on a Cluster Node will perform a self test on the local adapter(s) and list all remote adapters that this adapter can see via the PCI Express interconnect. This means, to perform the static interconnect test on a full cluster, you will basically need to run dis_diag on each Cluster Node and see if any problems with the adapter are reported, and if the adapters in each Cluster Node can see all remote adapters installed in the other Cluster Nodes.
Normally you should invoke dis_diag with no arguments, and it will do a general test and only show the most interesting information. Advanced users may want to enable the full verbose mode by using the -V 9 command line option:
dis_diag -V 9
The -V 9 option will generate a lot of information, some parts of the information requires knowledge about the PCIe chipset and the PCIe specification in general. The diagnostic module will collect various usage and error information over time. This can be cleared by using the -clear command line option:
dis_diag -clear
An example output of dis_diag for a Cluster Node which is part of a 19 Node cluster and using one adapter per Cluster Node looks like this:
[root@Dellix-01 ~]# dis_diag ================================================================================ Dolphin diagnostic tool -- dis_diag version 5.5.0 ( Thu Jan 10 13:47:42 CET 2018 ) ================================================================================ dis_diag compiled in 64 bit mode Driver : Dolphin IRM (GX) 5.5.0.0 Jan 10th 2018 (rev 38956) Date : Wed Feb 1 15:23:15 CET 2018 System : Linux Dellix-01 3.10.0-327.36.1.el7.x86_64 #1 SMP Sun Sep 18 13:04:29 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Number of configured local adapters found: 1 Adapter 0 > Type : IXH610 NodeId : 4 Serial number : IXH610-DE-410785 IXH chip family : IDT_LOGAN IXH chip vendorId : 0x111d IXH chip device : 0x8091 IXH chip revision : 0x2 (ZC) EEPROM version NTB mode : 27 EEPROM vendor info : 0x0000 EEPROM swmode[3:0] : 1100 EEPROM images : 01 Card revision : DE Topology type : Switch Topology Autodetect : No Number of enabled links : 1 Upstream Cable DIP-sw : ON Upstream Edge DIP-sw : OFF EEPROM-select DIP-sw : OFF Link LED yellow : OFF Link LED green : ON Cable link state : UP Max payload size (MPS) : 128 Multicast group size : 2 MB Prefetchable memory size : 512 MB (BAR2) Non-prefetchable size : 64 MB (BAR4) Clock mode slot : Port Clock mode link : Global PCIe slot state : x8, Gen2 (5 GT/s) PCIe slot capabilities : x8, Gen2 (5 GT/s) ************************* IXH ADAPTER 0 LINK 0 STATE ************************* Link 0 uptime : 522116 seconds Link 0 state : ENABLED Link 0 state : x8, Gen2 (5 GT/s) Link 0 required : x8, Gen2 (5 GT/s) Link 0 capabilities : x8, Gen2 (5 GT/s) Link 0 cable inserted : 1 Link 0 active : 1 Link 0 configuration : NTB **************************** IXH ADAPTER 0 STATUS **************************** Chip temperature : 79 C *************** IXH ADAPTER 0, PARTNER INFORMATION FOR LINK 0 *************** Partner board type : IXS600 Partner switch no : SUB5, Port 6 Partner number of ports : 8 ***************************** TEST OF ADAPTER 0 ***************************** OK: IXH chip alive in adapter 0. OK: Link alive in adapter 0. ==> Local adapter 0 ok. ************************ TOPOLOGY SEEN FROM ADAPTER 0 ************************ Adapters found: 19 ----- List of all nodes found: Nodes detected: 0004 0008 0012 0016 0020 0024 0028 0032 0036 0040 0044 Nodes detected: 0048 0052 0056 0060 0064 0068 0072 0076 *********************** SESSION STATUS FROM ADAPTER 0 *********************** Node 4: Session valid Node 8: Session valid Node 12: Session valid Node 16: Session valid Node 20: Session valid Node 24: Session valid Node 28: Session valid Node 32: Session valid Node 36: Session valid Node 40: Session valid Node 44: Session valid Node 48: Session valid Node 52: Session valid Node 56: Session valid Node 60: Session valid Node 64: Session valid Node 68: Session valid Node 72: Session valid Node 76: Session valid ---------------------------------- dis_diag discovered 0 note(s). dis_diag discovered 0 warning(s). dis_diag discovered 0 error(s). TEST RESULT: *PASSED* [root@Dellix-01 ~]#
The static interconnect test passes if dis_diag delivers TEST RESULT: *PASSED*
and reports the same topology (remote adapters) on all Cluster Nodes.
While the static interconnect test sends very a few packets over the links to probe remote nodes, the Interconnect Load Test puts significant stress on the interconnect and observes if any data transmissions have to be retried due to link errors. This can happen if cables are not correctly connected, i.e. plugged in without connector latches locking correctly. Before running this test, make sure your cluster is connected and configured correctly by running the tests described in the previous sections.
This test can be performed from within the Dolphin dis_admin GUI tool. Please refer to Appendix B, dis_admin
Reference for details.
Once the correct installation and setup and the basic functionality of the interconnect have been verified, it is possible to perform a set of low-level benchmarks to determine the base-line performance of the interconnect without any additional software layers. The tests that are relevant for this are scibench2 (streaming remote memory PIO access performance), scipp (request-response remote memory PIO write performance), dma_bench (streaming remote memory DMA access performance) and intr_bench (remote interrupt performance).
All these tests need to run on two Cluster Nodes (A and B) and are started in the same manner:
Determine the NodeId of both Cluster Nodes using the query command (default path /opt/DIS/bin/query
). The NodeId is reported as "Local node-id".
On node A, start the server-side benchmark with the options -server
and -rn <NodeId of B>
, like:
$ scibench2 -server -rn 8
On Cluster Node B, start the client-side benchmark with the options -client
and -rn <NodeId of A>
, like:
$ scibench2 -client -rn 4
The test results are reported by the client.
dma_bench measures the streaming bandwidth using DMA.
The following results are measured using the IXH610 card (Gen2, x8)
------------------------------------------------------------------------------- Message Total Vector Transfer Latency Bandwidth size size length time per message ------------------------------------------------------------------------------- 64 16384 256 156.02 us 0.61 us 105.01 MBytes/s 128 32768 256 164.63 us 0.64 us 199.04 MBytes/s 256 65536 256 175.28 us 0.68 us 373.90 MBytes/s 512 131072 256 196.96 us 0.77 us 665.47 MBytes/s 1024 262144 256 242.32 us 0.95 us 1081.80 MBytes/s 2048 524288 256 336.01 us 1.31 us 1560.34 MBytes/s 4096 524288 128 260.19 us 2.03 us 2015.01 MBytes/s 8192 524288 64 223.41 us 3.49 us 2346.70 MBytes/s 16384 524288 32 205.15 us 6.41 us 2555.64 MBytes/s 32768 524288 16 195.78 us 12.24 us 2677.93 MBytes/s 65536 524288 8 191.15 us 23.89 us 2742.81 MBytes/s 131072 524288 4 188.83 us 47.21 us 2776.46 MBytes/s 262144 524288 2 187.60 us 93.80 us 2794.73 MBytes/s 524288 524288 1 187.06 us 187.06 us 2802.70 MBytes/s
The scipp SISCI benchmark sends a message of the specified size to the remote system. The remote system is polling for incoming data and will send a similar message back to the first node.
The minimal round-trip latency for writing to remote memory is extremely low using PCI Express networks.
The following results are typical for a PCI Express Gen2 x8 link (IXH610)
Ping Pong data transfer: size retries latency (usec) latency/2 (usec) 0 3201 1.425 0.713 4 3154 1.442 0.721 8 3175 1.445 0.722 16 3179 1.447 0.724 32 3213 1.464 0.732 64 3340 1.535 0.768 128 3448 1.573 0.787 256 3558 1.672 0.836 512 3994 1.841 0.920 1024 4468 2.162 1.081 2048 5579 2.889 1.444 4096 7260 4.323 2.162 8192 11148 7.122 3.561
The interrupt latency is affected by the operating system and can therefore vary.
Average unidirectional interrupt time : 2.515 us. Average round trip interrupt time : 5.030 us.
To simply gather all relevant low-level performance data, the script sisci_benchmarks.sh
can be called in the same way. It will run all of the described tests.