Out of box Network Performance on Nehalem servers

With the upcoming OpenSolaris 2009.06 release, how does networking perform on Intel Nehalem servers vs. Linux? One of the network performance questions asked by customers is "Can a system achieve X Gbit/s?" or "Can a system send/receive y packets/sec?" after they have characterized the workload and find that networking may be the limiting factor. I’ll take a look at out of box performance using a micro-benchmark tool uperf that shows the server’s capabilities.

Benchmark Considerations

Intel Nehalem servers have sufficient cpu cycles and I/O bandwidth to drive one 10GbE port at line rate for large packet sizes. To see a meaningful difference between Solaris and Linux, I chose a four port configuration that can show difference in throughput. When PCIe gen. 2 10 Gigabit Ethernet is widely available, the four ports can come from two NIC and take only two PCIe slots. Since I use PCIe gen. 1 10 Gigabit Ethernet, the four ports come from four NIC to avoid PCIe bandwidth becoming the bottleneck.

Which Linux?

SUSE Linux Enterprise 11 is chosen as SUSE is widely used and it is based on a more recent Linux kernel than latest Red Hat distro. Compared to SLES 10 or RHEL 5.3, SLES 11 includes TX multiqueue support. This helps TX scales better with more connections. For simplicity, no tuning is done for etiher Linux or OpenSolaris. Tuning Linux will improve some results.

Test Results

Nehalem TX throughput

Nehalem RX throughput

Engineering Notes

S10 Update 7 throughput drops significantly with more connections because of high rate of interrupts. Tuning to interrupt more cpu:

/etc/system
set ddi_msix_alloc_limit=4
set pcplusmp:apic_intr_policy=1
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_multi_msi_max=4
/kernel/drv/ixgbe.conf
rx_queue_number=8;

improves performance to be more in line with OpenSolaris. TCP RX throughput more than doubles to 20.6 Gbit/s, and TCP TX throughput is 26.6 Gbit/s at 1000 connections.

Uperf Profiles

uperf uses XML to describe the workload. The profile that describes TCP TX throughput is here. The  profile that describes TCP RX throughput is here. The profiles are written based on four clients; if different number of clients are used, modify the number of groups to match the number of clients.

Test Set-up

  • SUT: Sun Fire X4270, dual socket Intel Nehalem @ 2.66 GHz. Hyper-threading is on (i.e. change BIOS default). 24 GB of memory.
  • OS: OpenSolaris 2009.06; Solaris 5/09; SUSE Linux Enterprise 11 (2.6.27.19-5 kernel)
  • Four X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller installed on SUT. ixgbe driver version 1.3.56.17-NAPI compiled on SUSE 11.
  • Clients: Four Sun Fire X4150, dual socket Intel Xeon @ 3.16 GHZ, 8 GB main memory. Three of the X4150 runs Solaris Nevada 109 with Sun Multithreaded 10 GbE Networking Card. One X4150 runs SUSE 10 SP2 with X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller. Each 10 GbE port is connected back-back with one 10GbE port on SUT.
    • Tuning for Solaris clients:
      /etc/system

    • set ddi_msix_alloc_limit=8
      set pcplusmp:apic_multi_msi_max=8
      set pcplusmp:apic_msix_max=8
      set pcplusmp:apic_intr_policy=1
      set nxge:nxge_msi_enable=2
    • /kernel/drv/nxge.conf

    • soft-lso-enable = 1;
Posted in Sun | 2 Comments

Network Performance on Nehalem servers

Introduction

Today Sun announced several Intel Nehalem based servers. I had early access to a (2 socket) Lynx 2 (Sun Fire
X4270 server)
as well as two Virgo (Sun Blade X6270 server module) to study network performance. For a quick introduction, Lynx 2 is a 2 socket 2U server. Virgo is a 2 socket server blade module that can be used on Sun Blade 6000 and 6048 chassis. I tested Virgo on Sun Blade 6000. Both Lynx 2 and Virgo support a maximum of 144GB
of memory. Each socket is a quad core Nehalem. With Hyper-Threading turned on from BIOS, the operating system sees
16 virtual CPUs. I had Hyper-Threading turned on for all my tests.

Lynx 2 has 4 Gigabit Ethernet ports (code name ‘Zoar’) on-board, and 6 PCI Express slots that can plug in PCIe 1.1 or 2.0 cards. The Solaris driver for on-board Gigabit Ethernet is called igb. I also tested Sun 10 GbE with Intel 82598EB 10 Gigabit Ethernet Controller (code name ‘Oplin'; part number X1106A-z for single-port version, X1107A-z for dual-port version) for 10 Gigabit Ethernet performance. Oplin is PCIe 1.1 compliant. Its Solaris driver is called ixgbe, which is available since Solaris 10 Update 6 and Open Solaris 2008.11. Latest Linux and Windows driver for Oplin can be downloaded from Intel.

Virgo has 2 Gigabit Ethernet ports on-board, and can either plug in up to 2 PCI ExpressModules (EM) to expand dedicated I/O, or add up to 2 NEM (Network Express Modules) to expand common I/O. I tested Sun Dual 10GbE ExpressModule with Intel 82598EB 10GE controller (code name ‘Oplin EM'; part number x1108A-z) for dedicated 10 Gigabit Ethernet performance, and Sun Blade 6000 Virtualized Multi-Fabric 10GbE NEM (code name ‘NEMHydra’) for shared 10 Gigabit Ethernet performance.

Placement of 10 Gigabit Ethernet Cards or ExpressModules

Although the Nehalem servers support PCIe 2.0, Oplin is x8 PCIe 1.1 card, so the rated bandwidth is 16 Gbit/s per card (from 2 ports) in each direction. After protocol overhead, the measured bandwidth is ~ 12 Gbit/s in each direction per card. In comparison, two 10GbE ports on separate cards can achieve line rate. So for maximum throughput over multiple 10GbE ports, use 2N single-port Oplin rather than N dual-port Oplin. This also applies to Oplin EM: maximum two port throughput is achieved with two Oplin EM, one port per EM, and not with using dual ports from one Oplin EM.

If you use PCIe 2.0 10GbE card on Nehalem servers, the usable bandwidth is 4 Gbits/lane instead of 2, so it’s possible to achieve line rate with dual-port cards.

Lynx 2 had PCIe slots on either active riser or passive riser. You can place 10GbE cards in any slot and won’t see material performance difference.

Tuning ixgbe on Solaris

ixgbe driver supports both Oplin and Oplin ExpressModule. Hardware LSO is enabled by default. For Solaris Nevada 110 or later, only bcopy threshold and number of MSI-X per port needs to be tuned.

/etc/system
set ddi_msix_alloc_limit=4
/kernel/drv/ixgbe.conf
tx_copy_threshold=1024;

Multiple transmit DMA channels are supported in Solaris Nevada but not in Solaris 10 yet, so there is more contention for transmit DMA channel on S10 with large number of connections. For Solaris 10 Update 7, more tunables are required to target interrupts to different cpu (apic_intr_policy) and allow more MSI-X per port.

/etc/system
set ddi_msix_alloc_limit=4
set pcplusmp:apic_multi_msi_max=4
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_intr_policy=1
/kernel/drv/ixgbe.conf
tx_copy_threshold=1024;
rx_queue_number=4;

There was a bug in ixgbe for Solaris 10 Update 6 that prevents multiple receive DMA channels to work correctly, so the only recommended tuning on S10U6 is:

/kernel/drv/ixgbe.conf
tx_copy_threshold=1024;

Oplin Performance on Lynx 2

I tested 2.66 GHz Lynx 2 with 24 GB main memory on Solaris Nevada 106. Oplin can transmit or receive 64KByte messages at line rate over one socket connection. Scaling performance is examined in two ways: with number of ports and number of connections. The maximum throughput using 4 Oplin cards, one port per card, is 36.3 Gbit/second TCP transmit, and 32.1 Gbit/second TCP receive on Solaris Nevada build 106, so the throughput scales very well to 4 ports.

The data below shows throughput scales well with number of connections for TCP transmit, and less so with TCP receive.

TCP TX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 31981.64 41 (1/40)
32 256k 35972.53 59 (2/57)
100 256k 36380.51 67 (2/65)
1000 32k 34716.86 97 (3/94)
4000 32k 31180.33 94 (3/91)

TCP RX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 23947.91 61 (2/58)
32 256k 32127.24 100 (3/96)
100 256k 22793.04 100 (1/99)
1000 32k 22844.36 100 (3/97)
4000 32k 22079.65 100 (3/97)

For UDP traffic, 4 ports can transmit an impressive 4.3 million 64 byte payload packets/second, or 1460 byte datagram at 25 Gbit/second. The reason why TCP throughput is higher than UDP is because Solaris supports Large Segment Offload for TCP, but not for UDP at this time.

To measure latency, I connected Lynx 2 to Virgo back-to-back and ran netpipe. Single thread round trip latency for TCP small packets are measured at 25 microseconds.

Oplin Express Module Performance on Virgo

I tested 2.8 GHz Virgo with 48GB of memory on Solaris Nevada 111. Oplin EM can transmit or receive 64KByte messages at line rate over one socket connection. Throughput remains at line rate with increased number of connections for 1 port. 2 port scaling with number of connections is shown below.

TCP TX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 16533.43 25 (0/25)
32 256k 18266.85 34 (1/34)
100 256k 18354.36 37 (1/36)
1000 32k 18478.93 51 (2/49)
4000 32k 18296.30 64 (2/63)

Section: TCP RX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 18034.17 52 (2/50)
32 256k 18369.19 69 (3/66)
100 256k 18324.59 81 (3/78)
1000 32k 18425.31 91 (4/87)
4000 32k 17670.74 100 (5/95)

For UDP traffic, 2 ports can transmit 3.9 million 64 byte packets/second, a little higher than 1 port at 3.5 million packets/second. UDP transmit throughput with 1460 byte datagram is 18.7 Gbit/second.

Summary

Single thread throughput can reach 9+ Gbit/second on Nehalem servers. Lynx can scale throughput to 4 10GbE ports, and Virgo can scale to 2 10GbE ports near line rate on Solaris. They provide excellent network performance for web and HPC applications.

Posted in Sun | Leave a comment

Monitoring packet distribution on Sun Multithreaded 10 Gigabit Ethernet

10 Gigabit Ethernet NIC relies on multiple transmit and receive DMA channels to achieve good performance. One performance problem I have encountered sometimes in the past is traffic imbalance on receive side, leading to some CPU becoming interrupt-bound. I have written a Perl script for Sun Multithreaded 10GbE that can display the packet distribution by TX and RX DMA channels and can help spot the problem easily. The script used kstat provided by nxge. Usage:

nxge-kstat-channel.pl -i <instance#> -I <interval> 

For example, to see packet distribution on nxge0,

nxge-kstat-channel.pl -i 0

The script nxge-kstat-channel.pl is shown below:

: # *-*-perl-*-*
    eval 'exec perl -S $0 "$@"'
    if $running_under_some_shell;
use Getopt::Std;
getopts("i:I:");
sub usage
{
        # command usage list
        printf("Usage: nxge-kstat-channel.pl -i <instance#> -I <interval>\n");
        exit(-1);
}
if (!$opt_I) { $opt_I = 1;}
$cmd_tx = "kstat nxge:".$opt_i." | grep tdc_pack";
# print $cmd_tx;
$cmd_rx = "kstat nxge:$opt_i | grep rdc_pack";
@lines = `$cmd_tx`;
for ($i=0; $i < scalar(@lines); ++$i) {
        @tmp = split /[ ]+/,$lines[$i];
        chop $tmp[1];
        $tx_pkts[$i] = $tmp[1];
}
@lines = `$cmd_rx`;
for ($i=0; $i < scalar(@lines); ++$i) {
        @tmp = split /[ ]+/,$lines[$i];
        chop $tmp[1];
        $rx_pkts[$i] = $tmp[1];
}
print "Packets/sec distribution on nxge".$opt_i."\n";
while(1) {
        @tx_lines = `$cmd_tx`;
        @rx_lines = `$cmd_rx`;
        print "TX: ";
        $nl=0;
        for ($i=0; $i < scalar(@tx_lines); ++$i) {
                @tmp = split /[ ]+/,$tx_lines[$i];
                chop $tmp[1];
                printf("Ring %2d: %-8d", $i, ($tmp[1] - $tx_pkts[$i]) / $opt_I);
                if (($i % 4) == 3) { print "\n    "; $nl=1;}
                $tx_pkts[$i] = $tmp[1];
        }
        if (!$nl) { print "\n"; }
        print "\r";
        print "RX: ";
        for ($i=0; $i < scalar(@rx_lines); ++$i) {
                @tmp = split /[ ]+/,$rx_lines[$i];
                chop $tmp[1];
                printf("Ring %2d: %-8d", $i, ($tmp[1] - $rx_pkts[$i]) / $opt_I);
                if (($i % 4) == 3) { print "\n    "; }
                $rx_pkts[$i] = $tmp[1];
        }
        print "\n";
        # sleep 1;
        # print "\nSleep $opt_i seconds ...\n";
        sleep $opt_I;
}

The script displays packet rate on all nxge interfaces if no arguments are given.

If packet distribution on receive channels are not balanced, you can tune receive load balancing policy to balance them by ndd:

ndd /dev/nxge1 class_opt_ipv4_tcp <LB policy>
0x0010:         use MAC Port (for flow key)
0x0020:         use L2DA (for flow key)
0x0040:         use VLAN (for flow key)
0x0080:         use proto (for flow key)
0x0100:         use IP src addr (for flow key)
0x0200:         use IP dest addr (for flow key)
0x0400:         use Src Port (for flow key)
0x0800:         use Dest Port (for flow key)
0xF80:          use IP 5-tuple (for flow key)
Posted in Sun | 2 Comments

Network Performance on Open Storage Appliance

Sun Storage 7000 Unified Storage Systems is launched this Monday. This is  the industry’s first
open storage appliance and will deliver breakthrough performance, speed and simplicity
. I will share an inside view of how network performance is optimized for it. Most of the optimization are generic and can be applied to other storage systems running Solaris.

1 GBytes/sec and Beyond

Early on there is the question of whether networking performance work should focus on 10GbE or GigE. If the system can sustain 1 GBytes/sec throughput or more, then it makes sense to focus on 10GbE performance first, and multiple GigE performance will be easy. 1 GBytes/sec is a nice round number to remember. Can it be done? Using uperf as micro-benchmark tool to model NFS workload, throughput on a 2 socket quad core AMD Barcelona system can reach 900 MBytes/s, and on a 4 socket quad core Barcelona can reach 10GbE line rate, so the answer is ‘yes’. Results from uperf are later validated with NFS cached read throughput that exceeds 1 GBytes/sec. Later work focuses on reducing CPU utilization and latency while not missing the 1 GBytes/sec mark.

A lot of hard work goes into reaching 1 GBytes/sec on one NAS node. The key metrics from a network performance perspective are throughput,
CPU efficiency, and latency. Packet rate is not a key metric because IOPS/sec for storage are much lower
than packets/sec for network. I’ll explain how the tuning decisions are made.

CPU Efficiency

Networking is CPU intensive, and leaving enough CPU cycles to do other work is important for NAS. CPU efficiency measures the amount of work done divided by the amount of CPU used to do the work. If the same amount of work is done using less CPU, then it’s more CPU efficient and more desirable. CPU efficiency for networking can be measured by bytes/cycle, i.e. amount of bytes transferred divided by CPU cycles. Roch has written a DTrace script that measures the inverse of bytes/cycle – cycles/byte, and it is used to help make tuning decisions.

Large Segment Offload

Large Segment Offload works by sending large buffers (buffer size can be up to 64 KBytes in
Solaris) down the protocol stack and letting the lower layer fragment
it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network
interface card to split the buffer into separate packets, while
software LSO relies on the network driver to fragment into packets. The 10GbE for Sun Storage 7000 Unified Storage Systems (the 10GbE driver is called nxge) supports software LSO, and it helps CPU efficiency and throughput for transmit. LSO is enabled by default for nxge, and may be incorporated later for GigE (e1000g).

Jumbo Frames 

If your network infrastructure allows jumbo frames, it is helpful to improve CPU efficiency. Jumbo frames are enabled by default, as auto-negotiation for MTU between client and server should just work.

Interrupts vs. Soft Rings 

At 10GbE speed a NIC can generate more interrupts than single CPU can handle. Solaris provides 2 mechanisms to offload the interrupt CPU so that it’s not a performance bottleneck: interrupt multiple CPU using MSI/X (Message Signaled Interrupt and its successor, MSI-X), and soft rings. Interrupting multiple cpu relies on the multiple DMA channels on a NIC to generate interrupts, and the interrupts from each DMA channel can be directed to a different CPU, so packet receive processing is done in interrupt context on multiple CPU. Soft rings don’t rely on NIC hardware to have multiple DMA channels, and fan out packets to kernel threads from interrupt service routine for  receive processing. Three combinations are compared for NFS workload: MSI-X only, soft rings only, and MSI-X + soft rings. The results are as expected: MSI-X only provides the best CPU efficiency(~8% better) at a similar throughput.

Number of Transmit DMA Channels

One anomaly that stood out in our performance study is CPU efficiency for nxge drops significantly as the number of clients increase. mpstat shows there is significant increase in context switch, and CPU performance counter shows Hyper-transport utilization is above recommended 60%. Although multiple transmit DMA channels (aka TX rings as they are ring buffers) are needed for maximum packet rate, reducing the number of  TX rings reduces context switch and thread migration. Experiment shows 1 TX ring limits throughput, and 2 TX ring reaches maximum throughput while improving CPU efficiency by up to 50%.

Configuration conn     Gbits/s          cycles/byte
all TX rings 12     4.9 11.1
40     5.8 15.7
2 TX rings        12     6.3 9.0
40     8.2 10.8

Latency 

Latency matters for storage clients because it can become a limiting factor for client throughput. One important component of network latency is interrupt blanking: instead of interrupting the CPU immediately, a NIC can delay generating a interrupt so that if more packets arrive during that interval, they can be processed together and be more CPU efficient. Reducing interrupt interval to minimum value for nxge (~25us) improves
throughput by up to 30% for 8 or less connections; CPU cost (cycles/byte) can
increase by up to 10%.

Posted in Sun | Leave a comment

10 Gigabit Ethernet on UltraSPARC T2 Plus

10 Gigabit Ethernet on UltraSPARC T2 Plus Systems

Overview 

Sun has launched the industry’s
first dual-socket chip multi-threading technology (CMT) platforms — a
new family of virtualized servers based on the UltraSPARC T2 Plus
(Victoria Falls) processor. The Sun SPARC Enterprise T5140 and T5240
servers – code-named "Maramba" – are Sun’s third-generation CMT
systems, extending the product family beyond the previously-announced
Sun SPARC Enterprise T5120 and T5220
servers, with which they share numerous features.

The UltraSPARC T2 Plus based systems have Neptune chip integrated on-board for flexible GbE/10GbE deployment. 4 Gigabit Ethernet ports are ready to use out of the box.
If one optional XAUI card is added, there will be one 10Gigabit Ethernet port
and 3 of the 4 Gigabit Ethernet ports will be available. When two XAUI cards are added, there will be two 10Gigabit Ethernet ports and 2 Gigabit Ethernet ports available.

The two XAUI slots connect to the same x8 PCI Express switch, so
aggregate throughput from both XAUI cards is limited by PCI Express bandwidth. According to PCI Express® 1.1 specification, bandwidth for payload on x8 lanes is 16 Gbits/s per direction. In reality, single x8 PCI Express switch bandwidth is about 12 Gbits/s transmit, 13
Gbits/s receive, and 18.2 Gbits/s bi-directional, measured on T5240 under Solaris 10 Update 5. Since T5140/T5240 systems have two PCI Express switches, to get maximum aggregate throughput with two 10GbE ports, the recommended configuration is as follows: one XAUI card, plus one Atlas (Sun Dual 10GbE PCI Express Card) on a different PCI Express switch, which would be one of PCI Express slot 1, 2, or 3(T5240 only).

Software Large Segment Offload

Large segment offload (LSO) is an offloading technique to increase transmit throughput by reducing CPU overhead. The technique is also called TCP segmentation offload (TSO) when it applies to TCP. It works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets.

The benefits of LSO are:

  • better single thread transmit throughput (+30% on T5240 for 64KB TCP message.)
  • better transmit throughput (+20% on multi-port T5240)
  • lower CPU utilization. (-10% on T5220 NIU)

Software LSO is a new feature in nxge driver. It’s available in Solaris Nevada since build 82. It is available as patch 138048 for Solaris 10 Update 5, and is part of Solaris 10 Update 6.

Tuning

The 10GbE requires minimal tuning for throughput in Solaris 10 Update 7 or earlier. In /etc/system

set ip:ip_soft_rings_cnt=16

uses 16 soft rings per 10GbE port. If you don’t want to reboot the system, you can use ndd to change the number of soft rings

ndd -set /dev/ip ip_soft_rings_cnt 16

and plumb nxge for the tuning to take effect. The default for Solaris 10 Update 8 or later is 16 soft rings for 10GbE, so the tuning is no longer needed. You can read more about soft rings in my blog about 10GbE on UltraSPARC T2.

Software LSO is disabled by default. To maximize throughput and/or reduce CPU utilization, it should be enabled. To enable it, edit /platform/sun4v/kernel/drv/nxge.conf and uncomment the line

soft-lso-enable = 1;

Performance

The system used is a 1.4GHz, 8 core T5240, 128 GB memory, 4 Sun Dual 10GbE PCI Express Card, Solaris 10 Update 4 with patches, software LSO enabled. Performance is measured by uPerf 0.2.6.

T5240 10GbE throughput (single
thread)
TX (MTU 1500) RX (MTU 1500) TX (jumbo frame) RX (jumbo frame)
Gbit/s 1.40 1.50 2.40 2.01

Maximum network throughput is achieved with multiple threads (i.e. multiple socket connections):

T5240 10GbE
throughput (multiple threads)
# ports metric TX RX
1 Gbit/s 9.30 9.46
cpu util. (%) 24 25
2 Gbit/s 18.47 17.44
cpu util. (%) 46 49
4 Gbit/s 21.78 20.44
cpu util. (%) 77 83

There is no material difference between the performance of on-board XAUI card and Sun Multi-threaded 10GbE PCI Express Card.

4 GB DIMM gives higher receive performance and memory bandwidth

Different TCP receive throughput was observed on two different T5240, and it was due to different memory bandwidth using different DIMM size. 2 GB DIMM currently shipping with T5240 have 4 banks, while 4 GB DIMM have 8
banks, and more banks give more memory bandwidth. Comparing STREAM benchmark results between 16 x 2GB and 32 x 4 GB memory configuration on the same T5240, 4GB DIMM configuration gives 18-25% higher memory bandwidth. The increased memory bandwidth from 4GB DIMM gives the TCP receive throughput above. Using 2 GB DIMM, with the same or higher memory interleave factor (16 x 2 GB, 32 x 2 GB vs. 16 x 4 GB), will lead to a 30% throughput drop to 14.2 Gb/s for 4 10GbE ports. So using 4GB DIMM is recommended for maximum TCP receive throughput.

The size of DIMM and memory bandwidth have little impact on TCP transmit throughput: less than 2%.

Links

Index of blogs about UltraSPARC T2 Plus

T5140 architecture whitepaper

10 Gigabit Ethernet on UltraSPARC T2

Sun Multi-threaded 10GbE Tuning Guide

Neptune and NIU Hardware Specification

Posted in Sun | 6 Comments

10 Gigabit Ethernet on UltraSPARC T2

Tuning on-chip 10GbE (NIU) on T5120/T5220

The UltraSPARC T2 has integrated
dual 10GbE Network Interface Unit (NIU) on-chip. It requires minimal tuning for throughput. In /etc/system

set ip:ip_soft_rings_cnt=16

uses 16 soft rings per 10GbE port. If you don’t want to reboot the system, you can use ndd to change the number of soft rings

ndd -set /dev/ip ip_soft_rings_cnt 16

and plumb nxge for the tuning to take effect. Solaris 10 Update 8 or later uses 16 soft rings for 10GbE by default, so the tuning is no longer needed.

Soft rings are kernel threads that offload processing of received packets from interrupt CPU, thus preventing interrupt CPU from becoming the bottleneck. The trade-off is increased latency from switching from interrupt thread to soft ring thread. If your workload is latency sensitive, you may want to see if turning off soft rings help meet your latency needs while also delivering required packet rate or throughput.

Soft rings can be used with any GLDv3 network drivers, such as nxge, e1000g or bge. A little known trick about soft rings is that it can be configured on a per port basis, so you can, for example, configure NIU with 16 soft rings and on-board 1GbE with 2 soft rings. To continue the example:

ndd -set /dev/ip ip_soft_rings_cnt 16
ifconfig nxge0 plumb
ndd -set /dev/ip ip_soft_rings_cnt 2
ifconfig e1000g1 plumb

Tuning Sun Multi-threaded 10GbE PCI-E NIC on T5120/T5220

If your T5120/T5220 has plugged in Sun Multi-threaded 10GbE PCI Express NIC, then two tunables are recommended for throughput:

set ip:ip_soft_rings_cnt=16
set ddi_msix_alloc_limit=8

ddi_msix_alloc_limit is a system-wide limit of how many MSI(Message Signaled Interrupt) or MSI-X a PCI device can allocate. The default is to allow maximum 2 MSI per device. Since each 10GbE port of Sun Multi-threaded 10GbE PCI-E has 8 receive DMA channels and each channel can generate one interrupt, it can generate up to 8 interrupts. To avoid interrupt CPU becoming the performance bottleneck, it is recommended to set ddi_msix_alloc_limit to 8 so that network receive interrupts can target 8 different CPU. This tunable will be unnecessary with a patch to Solaris 10 update 4.

Which CPU are taking interrupts?

If your application threads are pinned too often by interrupts and it becomes a problem, you can create processor set to dedicate these CPU for interrupt processing.
To find out which CPU are taking interrupts, you can use intrstat. In the intrstat output below, you can see CPU 27-34 is interrupted by NIU (niumx is the name of nexus driver for NIU):

...
device |     cpu24 %tim     cpu25 %tim     cpu26 %tim     cpu27 %tim
-------------+------------------------------------------------------------
niumx#0 |         0  0.0         0  0.0         0  0.0      2086 80.9
device |     cpu28 %tim     cpu29 %tim     cpu30 %tim     cpu31 %tim
-------------+------------------------------------------------------------
niumx#0 |      1885 82.4      2057 80.1      2189 77.2      2019 79.6
device |     cpu32 %tim     cpu33 %tim     cpu34 %tim     cpu35 %tim
-------------+------------------------------------------------------------
niumx#0 |      1993 81.8      2073 79.7      1948 81.7         0  0.0
...

You can see interrupts from Sun Multi-threaded 10GbE PCI-E using mdb command as well:

echo ::interrupts | mdb -k

but interrupts from NIU are not included at this time.

NIU or PCI-E 10GbE NIC?

UltraSPARC T2 comes with on-chip dual 10GbE Network Interface Unit(NIU), but you can also plug in 10GbE NIC on PCI-E slots. Here is a summary of performance features:

Features NIU PCI-E NIC
# 10GbE ports 2 2
# transmit DMA channels/port 8 12
# receive DMA channels/port 8 8
integrated on-chip? US-T2 no
bus interface? no 8 lane PCI Express
bus bandwidth limit? no 16 Gbits/s each direction*
transmit packet classification software software
receive packet classification hardware hardware

* Note: PCI Express Specification 1.1 specifies 2 Gbit/s payload per lane full-duplex, so 8 lanes can reach 8 x 2 = 16 Gbit/s. Actual bus bandwidth on T5120/S10U4 is measured at 12 Gbits/s from host to device.

The same driver, nxge, is used for both NIU and Sun Multi-threaded 10GbE PCI-E NIC. If you have both installed, you can tell which instance is NIU by looking into /etc/path_to_inst: the instance with /niu prefix is NIU.

$grep nxge /etc/path_to_inst
"/pci@0/pci@0/pci@8/pci@0/pci@8/network@0" 2 "nxge"
"/pci@0/pci@0/pci@8/pci@0/pci@8/network@0,1" 3 "nxge"
"/niu@80/network@0" 0 "nxge"
"/niu@80/network@1" 1 "nxge"

In the example above, nxge0 and nxge1 are NIU, and nxge2 and nxge3 are PCI-E NIC.

You may wonder: is there performance difference between NIU and PCI-E NIC? The short answer is: NIU wins in all micro-benchmarks except one – single 10GbE port transmitting UDP small packets.

Sun Multi-threaded 10GbE PCI-E NIC can transmit an impressive 2.1 million 64 byte UDP packets per second, 50% more than NIU(1.4 million pps) out of one port, due to the fact that it has 50% more transmit DMA channels than NIU (12 vs. 8 per port). So if your workload is mostly sending small UDP packets, Sun Multi-threaded 10GbE PCI-E NIC may deliver higher packet rate than NIU on UltraSPARC T2 systems.

In all other scenarios, NIU gives higher throughput or lower CPU utilization at similar throughput. On 2 10GbE port throughput test, 2 NIU can achieve an impressive 14.6 Gbits/s on TCP transmit, or 18.2 Gbits/s on TCP receive using 8Kbytes messages and 145 connections on 1.4GHz T5120. For TCP transmit, the CPU efficiency (measured by Gbps/GHz) of NIU is 23% higher than PCI-E NIC at maximum throughput for 2 ports, and 46% higher at maximum throughput for 1 port. These clearly demonstrates the performance advantage of integrating 10GbE on-chip.

Links

T5120, T5220, T6320 System and blades Launch blogs

Sun Multi-threaded 10GbE Tuning Guide

Neptune and NIU Hardware Specification

Posted in Sun | 4 Comments