In this blogpost, I'll try to make you understand the engineering challenge behind processing 10Gbit/s wirespeed, at the smallest Ethernet packet size.

The peak packet rate is 14.88 Mpps (million packets per sec) uni-directional on 10Gbit/s with the smallest frame size.

Details: What is the smalles Ethernet frame
Ethernet frame overhead:

  • Ethernet specific (20 bytes)
  • Ethernet frame (64 bytes)
    • 14 bytes = MAC header
    • 46 bytes = Minimum payload size
    • 4 bytes = Ethernet CRC

Thus, the minimim size Ethernet frame is: 84 bytes (20 + 64)

Max 1500 bytes MTU Ethernetframe size is: 1538 bytes (calc: (12+8) + (14) + 1500 + (4) = 1538 bytes)

Packet rate calculations

Peak packet rate calculated as:  (10*10^9) bits/sec / (84 bytes * 8) = 14,880,952 pps
1500 MTU packet rate calculated as: (10*10^9) bits/sec / (1538 bytes * 8) = 812,744 pps

Time budget
This is the important part to wrap-your-head around.

With 14.88 Mpps the time budget for processing a single packet is:

  • 67.2 ns (nanosecond) (calc as: 1/14880952*10^9 ns)

This corrospond to approx: 201 CPU cycles on a 3GHz CPU (assuming only one instruction per cycle, disregarding superscalar/pipelined CPUs). Only having 201 clock-cycles processing time per packet is very little.

Relate these numbers to something
This 67.2ns number is hard to use for anything, if we cannot relate this to some other time measurements.

cache-misses
A single cache-miss takes: 32 ns (measured on a E5-2650 CPU). Thus, with just two cache-misses (2x32=64ns), almost the total 67.2 ns budget is gone. The Linux skb (sk_buff) is 4 cache-lines (on 64-bit), and the kernel e.g. insists on writing zeros to these cache-lines, during allocation of an skb.

cache-references
We might not "suffer" a full cache-miss, sometimes the memory is available in L2 or L3 cache.  Thus, it is useful to know these time measurements.  Measured on my E5-2630 CPU (with lmbench command "lat_mem_rd 1024 128"), L2 access costs 4.3ns, and L3 access costs 7.9ns.

The "LOCK" operation
Assembler instructions can be prefixed with a "LOCK" operation, which means that they perform an atomic operation. This is uses every time e.g. a spinlock is locked or unlocked, cmpxchg and atomic_inc (some operations are even implicitly LOCK prefixed, like xchg).

I've measured the cost of this atomic "LOCK" operation to be 8.25ns on my CPU (with this program). Even for the completely optimal situation of a spinlock only being touch by one CPU, we have two LOCK calls which costs 16.5ns.

System call overhead
A FreeBSD case study of sendto(), in Luigi Rizzo netmap paper, shows that the cost of only the system call is 96ns, which is above the 67.2 ns budget.  The total overhead of sendto() were 950 ns.  These 950ns corrospond to 1,052,631 pps (calc as 1/(950/10^9)).
On Linux I measured the system call getuid(2), to take 87.77 ns and 201 CPU-cycles (TSC measurement) (the CPU efficiency were 1.42 insns per cycle, measured with perf stat). Thus, the syscall itself eats up the entire budget.

  • Update: Most of the syscall overhead comes from kernel option CONFIG_AUDITSYSCALL, without it, the syscall overhead drops to 41.85 ns.


How to overcome this syscall problem?  We can amortize the cost, by sending several packets in a single syscall.  It is not very well known, but we actually already have a syscall to send several packets with a single syscall, called "sendmmsg(2)". Notice the extra "m" (and the corresponding receive version "recvmmsg(2)"). Not many examples exists on the Internet for using these syscalls. Thus, I've provided some example code here for sendmmsg and recvmmsg.

RAW socket speeds
Daniel Borkmann and I recently optimized AF_PACKET, to scale to several CPUs (trafgen, kernel qdisc bypass and trafgen use qdisc bypass). But let us look at the performance numbers for only a single CPU:

  • Qdisc path = 1,226,776 pps => 815 ns per packet (calc: 1/pps*10^9)
  • Qdisc bypass = 1,382,075 pps => 723 ns per packet (calc: 1/pps*10^9)

This is also interesting, because this show us the cost of the qdisc code path, which costs 92 ns.  In this 10Gbit/s context it is fairly large, e.g. corresponding to almost 3 cache-line misses (92/32=2.9).

View comments

    Loading