Thursday, December 30, 2010

Buffer Bloat: The calculations

The buffer-bloat blog posts by Jim Gettys[1][2], are very interesting, and relevant for everybody with a broadband connection (especially asymmetric links). They are well written, but also too long and too many posts.

In this blog post, I'll explain what is going on, by showing how you can calculate your own latency issues.

The issue raised by Gettys, is that buffer-bloat (too big buffers on the network path) has fundamentally broken Internet broadband connections [2].
As buffer-bloat can introduce enough latency to cripple your line. Basically killing the possibility of interactive and realtime services, being delivered (e.g. by companies) over your broadband connection.

I'm very happy to see that Gettys is bring this issue up again.
Back in 2005, I discovered the same issues as Gettys. I wrote my masters thesis[3] about the issue, and even created an Open Source "mitigation" solution the ADSL-optimizer[4]. It seems my solution has not gained a wider use.

The major contribution for my side, is to take the ADSL overhead into account when doing QoS packet shaping. (Everything is in mainline, just use/add the TC options "linklayer adsl" and "overhead", if you already have a Linux box doing QoS on your line.)

The issue is that:
A single TCP upload cause a delay of 1.2 seconds (on a 512 Kbit/s ADSL line)
(see thesis[3] page 21).

Lets calculate what is happening, without going into details of why TCP/IP miss-behaves and cause queues to build (details are in my thesis[3]).

Before starting the calculations, here is a beautiful cite by Jim Gettys[1]:
"Large network buffers can be thought of as 'dark buffers', analogous to 'dark matter' in the universe; they are undetectable under many/most circumstances, and you can detect them only by indirect means. Buffers do not cause problems when they are empty. But when they fill they introduce additional latency (and create other problems, possibly very severe) to other traffic sharing the link."

Given the line speed and the delay, we can calculate the buffer size
(this is the bandwidth-delay product). Due to ADSL overhead the
effective bandwidth is actually 454 Kbit/s of the 512 Kbit/s line,
and the measured delay was 1138 ms.
454 Kbit/s * 1138 ms = 64581 bytes
This, corresponds to the TCP window-size. Thus, this is not the maximum buffer-size of the modem. (Use several TCP connection or UDP to find your maximum ping RTT, and calc your buffer size).

Where does the delay come from?!

The delay consists of different components, the important one in our case is the transmission delay (combined with the packets in queue).

The transmission delay of a 1500 bytes (MTU) packet is:
1500 bytes / 454 Kbit/s = 26.34 ms

Thus, the experienced delay is the time it takes to empty the packets in the queue, which is greatly dependend on the line speed.
E.g. 64000 bytes / 454 Kbit/s = 1127 ms.

With a RTT delay of 150 ms, your interactive SSH connection will feel sluggish.
(A side note on ADSL is that; the processing delay in the ADSL modem can get as large as 60 ms, and is caused by the interleaving depth, but its "fortunately" a constant fixed delay on the path)

Increasing the bandwidth, will reduce the latency, but its not the
solution. Besides, ADSL technology is often limited to a 1024 Kbit/s
upstream link.

The Point:
"ISPs SHOULD configure the buffer size based upon the link bandwidth"

I have a feeling that the ISP just configure a default queue size, and tune the queue size based upon max throughput on their largest product.

The line I did my measurements on, I could see delay on 3.3 sec, thus a buffer-bloat size of 187332 bytes (454 Kbit/s * 3300 ms), or 125 packets at 1500 bytes (MTU). Simply crazy!

For more details on formulas and calculation see thesis page 19 to 27.

Links:
[1] Jim Gettys: introducing-the-criminal-mastermind-bufferbloat
[2] Jim Gettys: whose-house-is-of-glasse-must-not-throw-stones-at-another
[3] http://www.adsl-optimizer.dk/thesis/
[4] http://www.adsl-optimizer.dk/
[5] http://en.wikipedia.org/wiki/Jim_Gettys

8 comments:

  1. Aren't we oversimplifying things by referring to just "a buffer"?

    A sophisticated QoS policy (difficult to do on the Internet, because applications can be tough to recognize) should define lots of different traffic classes, and distributed them over several different buffers.

    The buffers should be sized individually according to the expected traffic, and different AQM thresholds should be distributed among the traffic classes.

    If traffic were marked according to RFC 4594, queues sized appropriately, AQM (RED) applied with different thresholds by drop precedence, and per-queue service priority (bandwidth) configured, would we still have these problems?

    I guess I'm trying to say that I don't think buffer /bloat/ is the problem, rather it's buffer /management/.

    Is the high latency that we've been talking about a problem for the stream that's caused the problem? Similarly, do I care if there's a couple of seconds worth of netflix streaming video "in flight" (queued in packet shufflers)?

    I don't think so. Those are elastic flows. The only downsides that jump out at me are:
    - Extra buffering is required to compensate for the large bandwidth*delay, but memory is cheap.
    - When the flow ultimately overruns the available bandwidth (which will happen regardless of bufferbloat), the stack can't respond (clamp cwin) as quickly. OTOH, this may be a benefit, because cwin will get clamped less often.

    I only care that *other*applications* - interactive sessions, setup of new flows and the like don't suffer the same latency.

    I work in enterprise environments on network equipment with per-interface buffers measured in hundreds of MB. Realtime database replication, offsite backups, voice calls, and interactive video all coexist on the same links with no problems. ...But getting there took a lot of work.

    I concede that handling these issues (traffic marking and AQM) in a consumer environment sounds like a nightmare. ...But I'm not sure that smaller buffers is the best answer, either.

    ReplyDelete
  2. Chris, I totally agree with you!

    I have oversimplified the concept of buffers here, and just setting the buffers to a minimum is not the best answer. Different traffic have different needs.

    However, I do claim that we have buffer bloat at end-user, which is basically misconfigured and SHOULD BE LOWERED, but I don't agree we have buffer bloat in the core network (like Getty does).

    Like you argue, I also believe, that we should have separate bandwidth queues, for different types of traffic, especially in the core network, but also at the end-user. This is actually what the ADSL-optimizer QoS setup is trying to implement.

    Like you mention, the most challenging part in a consumer environment is to perform automatic traffic classification and marking. I once, had a plan, where I would implement a bulk detector (as a netfilter module), to solve the classification problem. (But then I moved from the dormitory where 300 people shared an 8Mbit/512Kbit ADSL line, and lost the itch-to-scratch)

    I'm wondering how big an effect it would have to combine AQM (Active Queue Management) like RED with ECN marking in the DSLAM/edge-net (the ADSL modem might not have RED config options).

    --Jesper Dangaard Brouer

    ReplyDelete
  3. Yes, we definitely do have a bufferbloat problem at the end user.

    I see it when pushing same-subnet traffic through the linux bridge embedded in my own cheap wifi/router/switch.

    Within the home, small buffers probably is the way to go: WiFi and LAN bandwidth is high relative to the internet connection, and any big intra-home flows are likely to be one-at-a-time (highly serialized), which should minimize the requirement to buffer bursts of traffic originating from different stations simultaneously.

    You don't think there's too much buffering "out there"? jg makes a convincing case. 1200ms of latency at 10+Mb/s is a lot. But we don't know where it is. OTOH, if you saw my comment on his "throw stones" post, you'll see I think there's a big discrepancy in his data. Ping times and TCP latency don't line up.

    Your bulk-detection classifier sounds good: Watch each flow, keep stats on what it's doing, mark appropriately. Right?

    Cisco 4500 based gear (Sup V and later anyway) has a scheme like this, but for queueing, rather than marking. Dynamic Buffer Limiting (DBL), I think they call it. They trend flows and decide whether to drop or queue based on how well the flow responds to drops / how aggressive the flow is.

    ReplyDelete
  4. The name of Jim Gettys is spelled with an 's'.

    ReplyDelete
  5. Sorry about the misspelling, I have corrected Jim Gettys name. And I actually thinks that his real name is James Gettys.

    ReplyDelete
  6. problem with priorities is that everyone wants to have the highest priority

    ReplyDelete
  7. Mr. Jim Gettys’ ideas really help a lot when it comes to understand buffer-bloat, and bet your MA thesis also did the same for other people. It can be thesis ideas for education purposes that would serve a lot of people in the same field to better understand what really buffer-bloat is all about.

    ReplyDelete
  8. Some existing firmware for router with a good working QOS system that eeffectievly establishes the bottleneck is at the router, and then places traffic in seperate queues, thus avoids the problem.

    ReplyDelete