Network Performance on Nehalem servers

Introduction

Today Sun announced several Intel Nehalem based servers. I had early access to a (2 socket) Lynx 2 (Sun Fire
X4270 server)
as well as two Virgo (Sun Blade X6270 server module) to study network performance. For a quick introduction, Lynx 2 is a 2 socket 2U server. Virgo is a 2 socket server blade module that can be used on Sun Blade 6000 and 6048 chassis. I tested Virgo on Sun Blade 6000. Both Lynx 2 and Virgo support a maximum of 144GB
of memory. Each socket is a quad core Nehalem. With Hyper-Threading turned on from BIOS, the operating system sees
16 virtual CPUs. I had Hyper-Threading turned on for all my tests.

Lynx 2 has 4 Gigabit Ethernet ports (code name ‘Zoar’) on-board, and 6 PCI Express slots that can plug in PCIe 1.1 or 2.0 cards. The Solaris driver for on-board Gigabit Ethernet is called igb. I also tested Sun 10 GbE with Intel 82598EB 10 Gigabit Ethernet Controller (code name ‘Oplin’; part number X1106A-z for single-port version, X1107A-z for dual-port version) for 10 Gigabit Ethernet performance. Oplin is PCIe 1.1 compliant. Its Solaris driver is called ixgbe, which is available since Solaris 10 Update 6 and Open Solaris 2008.11. Latest Linux and Windows driver for Oplin can be downloaded from Intel.

Virgo has 2 Gigabit Ethernet ports on-board, and can either plug in up to 2 PCI ExpressModules (EM) to expand dedicated I/O, or add up to 2 NEM (Network Express Modules) to expand common I/O. I tested Sun Dual 10GbE ExpressModule with Intel 82598EB 10GE controller (code name ‘Oplin EM’; part number x1108A-z) for dedicated 10 Gigabit Ethernet performance, and Sun Blade 6000 Virtualized Multi-Fabric 10GbE NEM (code name ‘NEMHydra’) for shared 10 Gigabit Ethernet performance.

Placement of 10 Gigabit Ethernet Cards or ExpressModules

Although the Nehalem servers support PCIe 2.0, Oplin is x8 PCIe 1.1 card, so the rated bandwidth is 16 Gbit/s per card (from 2 ports) in each direction. After protocol overhead, the measured bandwidth is ~ 12 Gbit/s in each direction per card. In comparison, two 10GbE ports on separate cards can achieve line rate. So for maximum throughput over multiple 10GbE ports, use 2N single-port Oplin rather than N dual-port Oplin. This also applies to Oplin EM: maximum two port throughput is achieved with two Oplin EM, one port per EM, and not with using dual ports from one Oplin EM.

If you use PCIe 2.0 10GbE card on Nehalem servers, the usable bandwidth is 4 Gbits/lane instead of 2, so it’s possible to achieve line rate with dual-port cards.

Lynx 2 had PCIe slots on either active riser or passive riser. You can place 10GbE cards in any slot and won’t see material performance difference.

Tuning ixgbe on Solaris

ixgbe driver supports both Oplin and Oplin ExpressModule. Hardware LSO is enabled by default. For Solaris Nevada 110 or later, only bcopy threshold and number of MSI-X per port needs to be tuned.

/etc/system
set ddi_msix_alloc_limit=4
/kernel/drv/ixgbe.conf
tx_copy_threshold=1024;

Multiple transmit DMA channels are supported in Solaris Nevada but not in Solaris 10 yet, so there is more contention for transmit DMA channel on S10 with large number of connections. For Solaris 10 Update 7, more tunables are required to target interrupts to different cpu (apic_intr_policy) and allow more MSI-X per port.

/etc/system
set ddi_msix_alloc_limit=4
set pcplusmp:apic_multi_msi_max=4
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_intr_policy=1
/kernel/drv/ixgbe.conf
tx_copy_threshold=1024;
rx_queue_number=4;

There was a bug in ixgbe for Solaris 10 Update 6 that prevents multiple receive DMA channels to work correctly, so the only recommended tuning on S10U6 is:

/kernel/drv/ixgbe.conf
tx_copy_threshold=1024;

Oplin Performance on Lynx 2

I tested 2.66 GHz Lynx 2 with 24 GB main memory on Solaris Nevada 106. Oplin can transmit or receive 64KByte messages at line rate over one socket connection. Scaling performance is examined in two ways: with number of ports and number of connections. The maximum throughput using 4 Oplin cards, one port per card, is 36.3 Gbit/second TCP transmit, and 32.1 Gbit/second TCP receive on Solaris Nevada build 106, so the throughput scales very well to 4 ports.

The data below shows throughput scales well with number of connections for TCP transmit, and less so with TCP receive.

TCP TX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 31981.64 41 (1/40)
32 256k 35972.53 59 (2/57)
100 256k 36380.51 67 (2/65)
1000 32k 34716.86 97 (3/94)
4000 32k 31180.33 94 (3/91)

TCP RX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 23947.91 61 (2/58)
32 256k 32127.24 100 (3/96)
100 256k 22793.04 100 (1/99)
1000 32k 22844.36 100 (3/97)
4000 32k 22079.65 100 (3/97)

For UDP traffic, 4 ports can transmit an impressive 4.3 million 64 byte payload packets/second, or 1460 byte datagram at 25 Gbit/second. The reason why TCP throughput is higher than UDP is because Solaris supports Large Segment Offload for TCP, but not for UDP at this time.

To measure latency, I connected Lynx 2 to Virgo back-to-back and ran netpipe. Single thread round trip latency for TCP small packets are measured at 25 microseconds.

Oplin Express Module Performance on Virgo

I tested 2.8 GHz Virgo with 48GB of memory on Solaris Nevada 111. Oplin EM can transmit or receive 64KByte messages at line rate over one socket connection. Throughput remains at line rate with increased number of connections for 1 port. 2 port scaling with number of connections is shown below.

TCP TX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 16533.43 25 (0/25)
32 256k 18266.85 34 (1/34)
100 256k 18354.36 37 (1/36)
1000 32k 18478.93 51 (2/49)
4000 32k 18296.30 64 (2/63)

Section: TCP RX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 18034.17 52 (2/50)
32 256k 18369.19 69 (3/66)
100 256k 18324.59 81 (3/78)
1000 32k 18425.31 91 (4/87)
4000 32k 17670.74 100 (5/95)

For UDP traffic, 2 ports can transmit 3.9 million 64 byte packets/second, a little higher than 1 port at 3.5 million packets/second. UDP transmit throughput with 1460 byte datagram is 18.7 Gbit/second.

Summary

Single thread throughput can reach 9+ Gbit/second on Nehalem servers. Lynx can scale throughput to 4 10GbE ports, and Virgo can scale to 2 10GbE ports near line rate on Solaris. They provide excellent network performance for web and HPC applications.

About these ads
This entry was posted in Sun. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s