Discussion:
[Bloat] when does the CoDel part of fq_codel help in the real world?
Pete Heist
2018-11-26 19:08:49 UTC
Permalink
In Toke’s thesis defense, there was an interesting exchange with examination committee member Michael (apologies for not catching the last name) regarding how the CoDel part of fq_codel helps in the real world:

http://youtu.be/upvx6rpSLSw

My attempt at a transcript is at the end of this message. (I probably won’t attempt a full defense transcript, but if someone wants more of a particular section I can try. :)

So I just thought to continue the discussion- when does the CoDel part of fq_codel actually help in the real world? I’ll speculate with a few possibilities:

1) Multiplexed HTTP/2.0 requests containing both a saturating stream and interactive traffic. For example, a game that uses HTTP/2.0 to download new map data while position updates or chat happen at the same time. Standalone programs could use HTTP/2.0 this way, or for web apps, the browser may multiplex concurrent uses of XHR over a single TCP connection. I don’t know of any examples.

2) SSH with port forwarding while using an interactive terminal together with a bulk transfer?

3) Does CoDel help the TCP protocol itself somehow? For example, does it speed up the round-trip time when acknowledging data segments, improving behavior on lossy links? Similarly, does it speed up the TCP close sequence for saturating flows?

Pete

---

M: In fq_codel what is really the point of CoDel?
T: Yeah, uh, a bit better intra-flow latency...
M: Right, who cares about that?
T: Apparently some people do.
M: No I mean specifically, what types of flows care about that?
T: Yeah, so, um, flows that are TCP based or have some kind of- like, elastic flows that still want low latency.
M: Elastic flows that are TCP based that want low latency...
T: Things where you want to discover the- like, you want to utilize the full link and sort of probe the bandwidth, but you still want low latency.
M: Can you be more concrete what kind of application is that?
T: I, yeah, I

M: Give me any application example that’s gonna benefit from the CoDel part- CoDel bits in fq_codel? Because I have problems with this.
T: I, I do too... So like, you can implement things this way but equivalently if you have something like fq_codel you could, like, if you have a video streaming application that interleaves control

M: <inaudible> that runs on UDP often.
T: Yeah, but I, Netflix

M: Ok that’s a long way
 <inaudible>
T: No, I tend to agree with you that, um

M: Because the biggest issue in my opinion is, is web traffic- for web traffic, just giving it a huge queue makes the chance bigger that uh, <inaudible, ed: because of the slow start> so you may end up with a (higher) faster completion time by buffering a lot. Uh, you’re not benefitting at all by keeping the queue very small, you are simply <inaudible> Right, you’re benefitting altogether by just <inaudible> which is what the queue does with this nice sparse flow, uh
 <inaudible>
T: You have the infinite buffers in the <inaudible> for that to work, right. One benefit you get from CoDel is that - you screw with things like - you have to drop eventually.
M: You should at some point. The chances are bigger that the small flow succeeds (if given a huge queue). And, in web surfing, why does that, uh(?)
T: Yeah, mmm...
M: Because that would be an example of something where I care about latency but I care about low completion. Other things where I care about latency they often don’t send very much. <inaudible...> bursts, you have to accommodate them basically. Or you have interactive traffic which is UDP and tries to, often react from queueing delay <inaudible>. I’m beginning to suspect that fq minus CoDel is really the best <inaudible> out there.
T: But if, yeah, if you have enough buffer.
M: Well, the more the better.
T: Yeah, well.
M: Haha, I got you to say yes. [laughter :] That goes in history. I said the more the better and you said yeah.
T: No but like, it goes back to good-queue bad-queue, like, buffering in itself has value, you just need to manage it.
M: Ok.
T: Which is also the reason why just having a small queue doesn’t help in itself.
M: Right yeah. Uh, I have a silly question about fq_codel, a very silly one and there may be something I missed in the papers, probably I did, but I'm I was just wondering I mean first of all this is also a bit silly in that <inaudible> it’s a security thing, and I think that’s kind of a package by itself silly because fq_codel often probably <inaudible> just in principle, is that something I could easily attack by creating new flows for every packet?
T: No because, they, you will

M: With the sparse flows, and it’s gonna

T: Yeah, but at some point you’re going to go over the threshold, I, you could, there there’s this thing where the flow goes in, it’s sparse, it empties out and then you put it on the normal round robin implementation before you queue <inaudible> And if you don’t do that than you can have, you could time packets so that they get priority just at the right time and you could have lockout.
M: Yes.
T: But now you will just fall back to fq.
M: Ok, it was just a curiousity, it’s probably in the paper. <inaudible>
T: I think we added that in the RFC, um, you really need to, like, this part is important.
Neal Cardwell
2018-11-26 19:28:07 UTC
Permalink
I believe Dave Taht has pointed out, essentially, that the "codel" part of
fq_codel can be useful in cases where the definition of "flow" is not
visible to fq_codel, so that "fq" part is inactive. For example, if there
is VPN traffic, where the individual flows are not separable by fq_codel,
then it can help to have "codel" AQM for the aggregate of encrypted VPN
traffic. I imagine this rationale could apply where there is any kind of
encapsulation or encryption that makes the notion of "flow" opaque to
fq_codel.

neal
Post by Pete Heist
In Toke’s thesis defense, there was an interesting exchange with
examination committee member Michael (apologies for not catching the last
http://youtu.be/upvx6rpSLSw
My attempt at a transcript is at the end of this message. (I probably
won’t attempt a full defense transcript, but if someone wants more of a
particular section I can try. :)
So I just thought to continue the discussion- when does the CoDel part of
fq_codel actually help in the real world? I’ll speculate with a few
1) Multiplexed HTTP/2.0 requests containing both a saturating stream and
interactive traffic. For example, a game that uses HTTP/2.0 to download new
map data while position updates or chat happen at the same time. Standalone
programs could use HTTP/2.0 this way, or for web apps, the browser may
multiplex concurrent uses of XHR over a single TCP connection. I don’t know
of any examples.
2) SSH with port forwarding while using an interactive terminal together
with a bulk transfer?
3) Does CoDel help the TCP protocol itself somehow? For example, does it
speed up the round-trip time when acknowledging data segments, improving
behavior on lossy links? Similarly, does it speed up the TCP close sequence
for saturating flows?
Pete
---
M: In fq_codel what is really the point of CoDel?
T: Yeah, uh, a bit better intra-flow latency...
M: Right, who cares about that?
T: Apparently some people do.
M: No I mean specifically, what types of flows care about that?
T: Yeah, so, um, flows that are TCP based or have some kind of- like,
elastic flows that still want low latency.
M: Elastic flows that are TCP based that want low latency...
T: Things where you want to discover the- like, you want to utilize the
full link and sort of probe the bandwidth, but you still want low latency.
M: Can you be more concrete what kind of application is that?
T: I, yeah, I

M: Give me any application example that’s gonna benefit from the CoDel
part- CoDel bits in fq_codel? Because I have problems with this.
T: I, I do too... So like, you can implement things this way but
equivalently if you have something like fq_codel you could, like, if you
have a video streaming application that interleaves control

M: <inaudible> that runs on UDP often.
T: Yeah, but I, Netflix

M: Ok that’s a long way
 <inaudible>
T: No, I tend to agree with you that, um

M: Because the biggest issue in my opinion is, is web traffic- for web
traffic, just giving it a huge queue makes the chance bigger that uh,
<inaudible, ed: because of the slow start> so you may end up with a
(higher) faster completion time by buffering a lot. Uh, you’re not
benefitting at all by keeping the queue very small, you are simply
<inaudible> Right, you’re benefitting altogether by just <inaudible> which
is what the queue does with this nice sparse flow, uh
 <inaudible>
T: You have the infinite buffers in the <inaudible> for that to work,
right. One benefit you get from CoDel is that - you screw with things like
- you have to drop eventually.
M: You should at some point. The chances are bigger that the small flow
succeeds (if given a huge queue). And, in web surfing, why does that, uh(?)
T: Yeah, mmm...
M: Because that would be an example of something where I care about
latency but I care about low completion. Other things where I care about
latency they often don’t send very much. <inaudible...> bursts, you have to
accommodate them basically. Or you have interactive traffic which is UDP
and tries to, often react from queueing delay <inaudible>. I’m beginning to
suspect that fq minus CoDel is really the best <inaudible> out there.
T: But if, yeah, if you have enough buffer.
M: Well, the more the better.
T: Yeah, well.
M: Haha, I got you to say yes. [laughter :] That goes in history. I said
the more the better and you said yeah.
T: No but like, it goes back to good-queue bad-queue, like, buffering in
itself has value, you just need to manage it.
M: Ok.
T: Which is also the reason why just having a small queue doesn’t help in itself.
M: Right yeah. Uh, I have a silly question about fq_codel, a very silly
one and there may be something I missed in the papers, probably I did, but
I'm I was just wondering I mean first of all this is also a bit silly in
that <inaudible> it’s a security thing, and I think that’s kind of a
package by itself silly because fq_codel often probably <inaudible> just in
principle, is that something I could easily attack by creating new flows
for every packet?
T: No because, they, you will

M: With the sparse flows, and it’s gonna

T: Yeah, but at some point you’re going to go over the threshold, I, you
could, there there’s this thing where the flow goes in, it’s sparse, it
empties out and then you put it on the normal round robin implementation
before you queue <inaudible> And if you don’t do that than you can have,
you could time packets so that they get priority just at the right time and
you could have lockout.
M: Yes.
T: But now you will just fall back to fq.
M: Ok, it was just a curiousity, it’s probably in the paper. <inaudible>
T: I think we added that in the RFC, um, you really need to, like, this part is important.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Dave Taht
2018-11-27 20:42:26 UTC
Permalink
I believe Dave Taht has pointed out, essentially, that the "codel" part of fq_codel can be useful in cases where the definition of "flow" is not visible to fq_codel, so that "fq" part is inactive. For example, if there is VPN traffic, where the individual flows are not separable by fq_codel, then it can help to have "codel" AQM for the aggregate of encrypted VPN traffic. I imagine this rationale could apply where there is any kind of encapsulation or encryption that makes the notion of "flow" opaque to fq_codel.
Yep. This should end up a bullet point somewhere. The bsd
implementation of fq_codel does not do anywhere near the dpi that
linux does to extract the relevant fields. That's a truly amazing
amount of sub-protocol hashing, sometimes I sit back and have to both
admire and shudder at this:
https://elixir.bootlin.com/linux/v4.18.6/source/net/core/flow_dissector.c#L578

and it's even more frightful if you look at the list of ethernet types
and protocols that are not dissected.

Codeling opaque flows is useful. We will always have opaque flows.
Some protocol types cannot be fq'd due to design - babel for example
mixes up hellos with route transfers.

I've done things like measure induced latency on wireguard streams of
late and codel keeps it sane. still, wireguard internally is optimized
for single flow "dragster" performance, and I'd like it to gain the
same fq_codel optimization that did such nice things for multiple
flows terminating on the router for ipsec. The before/after on that
was marvelous.

Another problem people are perpetually pointing out is hashing is
expensive, and me, saying it's "free" with hardware support for it,
with examples like most modern 10GigE ethernet cards doing it
on-board, hardware assists for sha and references to the FPGA lit,
such as: https://zistvan.github.io/doc/trets15-istvan.pdf

still... hashing is sw expensive.

Recently I went on a quest for a better hash than jenkins which is
used throughout the kernel today. cityhash and murmur for example.
neal
Post by Pete Heist
http://youtu.be/upvx6rpSLSw
My attempt at a transcript is at the end of this message. (I probably won’t attempt a full defense transcript, but if someone wants more of a particular section I can try. :)
1) Multiplexed HTTP/2.0 requests containing both a saturating stream and interactive traffic. For example, a game that uses HTTP/2.0 to download new map data while position updates or chat happen at the same time. Standalone programs could use HTTP/2.0 this way, or for web apps, the browser may multiplex concurrent uses of XHR over a single TCP connection. I don’t know of any examples.
2) SSH with port forwarding while using an interactive terminal together with a bulk transfer?
3) Does CoDel help the TCP protocol itself somehow? For example, does it speed up the round-trip time when acknowledging data segments, improving behavior on lossy links? Similarly, does it speed up the TCP close sequence for saturating flows?
Pete
---
M: In fq_codel what is really the point of CoDel?
T: Yeah, uh, a bit better intra-flow latency...
M: Right, who cares about that?
T: Apparently some people do.
M: No I mean specifically, what types of flows care about that?
T: Yeah, so, um, flows that are TCP based or have some kind of- like, elastic flows that still want low latency.
M: Elastic flows that are TCP based that want low latency...
T: Things where you want to discover the- like, you want to utilize the full link and sort of probe the bandwidth, but you still want low latency.
M: Can you be more concrete what kind of application is that?
T: I, yeah, I…
M: Give me any application example that’s gonna benefit from the CoDel part- CoDel bits in fq_codel? Because I have problems with this.
T: I, I do too... So like, you can implement things this way but equivalently if you have something like fq_codel you could, like, if you have a video streaming application that interleaves control…
M: <inaudible> that runs on UDP often.
T: Yeah, but I, Netflix…
M: Ok that’s a long way… <inaudible>
T: No, I tend to agree with you that, um…
M: Because the biggest issue in my opinion is, is web traffic- for web traffic, just giving it a huge queue makes the chance bigger that uh, <inaudible, ed: because of the slow start> so you may end up with a (higher) faster completion time by buffering a lot. Uh, you’re not benefitting at all by keeping the queue very small, you are simply <inaudible> Right, you’re benefitting altogether by just <inaudible> which is what the queue does with this nice sparse flow, uh… <inaudible>
T: You have the infinite buffers in the <inaudible> for that to work, right. One benefit you get from CoDel is that - you screw with things like - you have to drop eventually.
M: You should at some point. The chances are bigger that the small flow succeeds (if given a huge queue). And, in web surfing, why does that, uh(?)
T: Yeah, mmm...
M: Because that would be an example of something where I care about latency but I care about low completion. Other things where I care about latency they often don’t send very much. <inaudible...> bursts, you have to accommodate them basically. Or you have interactive traffic which is UDP and tries to, often react from queueing delay <inaudible>. I’m beginning to suspect that fq minus CoDel is really the best <inaudible> out there.
T: But if, yeah, if you have enough buffer.
M: Well, the more the better.
T: Yeah, well.
M: Haha, I got you to say yes. [laughter :] That goes in history. I said the more the better and you said yeah.
T: No but like, it goes back to good-queue bad-queue, like, buffering in itself has value, you just need to manage it.
M: Ok.
T: Which is also the reason why just having a small queue doesn’t help in itself.
M: Right yeah. Uh, I have a silly question about fq_codel, a very silly one and there may be something I missed in the papers, probably I did, but I'm I was just wondering I mean first of all this is also a bit silly in that <inaudible> it’s a security thing, and I think that’s kind of a package by itself silly because fq_codel often probably <inaudible> just in principle, is that something I could easily attack by creating new flows for every packet?
T: No because, they, you will…
M: With the sparse flows, and it’s gonna…
T: Yeah, but at some point you’re going to go over the threshold, I, you could, there there’s this thing where the flow goes in, it’s sparse, it empties out and then you put it on the normal round robin implementation before you queue <inaudible> And if you don’t do that than you can have, you could time packets so that they get priority just at the right time and you could have lockout.
M: Yes.
T: But now you will just fall back to fq.
M: Ok, it was just a curiousity, it’s probably in the paper. <inaudible>
T: I think we added that in the RFC, um, you really need to, like, this part is important.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Toke Høiland-Jørgensen
2018-11-27 20:54:55 UTC
Permalink
Post by Dave Taht
I've done things like measure induced latency on wireguard streams of
late and codel keeps it sane. still, wireguard internally is optimized
for single flow "dragster" performance, and I'd like it to gain the
same fq_codel optimization that did such nice things for multiple
flows terminating on the router for ipsec. The before/after on that
was marvelous.
This is on my list, FWIW :)

-Toke
Dave Taht
2018-11-27 21:00:50 UTC
Permalink
Post by Toke Høiland-Jørgensen
Post by Dave Taht
I've done things like measure induced latency on wireguard streams of
late and codel keeps it sane. still, wireguard internally is optimized
for single flow "dragster" performance, and I'd like it to gain the
same fq_codel optimization that did such nice things for multiple
flows terminating on the router for ipsec. The before/after on that
was marvelous.
This is on my list, FWIW :)
Your queue overfloweth. And as good as you are at FQ-ing yourself, I
Post by Toke Høiland-Jørgensen
-Toke
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Toke Høiland-Jørgensen
2018-11-27 21:05:31 UTC
Permalink
Post by Dave Taht
Post by Toke Høiland-Jørgensen
Post by Dave Taht
I've done things like measure induced latency on wireguard streams of
late and codel keeps it sane. still, wireguard internally is optimized
for single flow "dragster" performance, and I'd like it to gain the
same fq_codel optimization that did such nice things for multiple
flows terminating on the router for ipsec. The before/after on that
was marvelous.
This is on my list, FWIW :)
Your queue overfloweth. And as good as you are at FQ-ing yourself, I
Well, 4.20 is already in RC, so that is not going to happen ;)

-Toke
Jonathan Morton
2018-11-26 21:29:53 UTC
Permalink
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel part of fq_codel actually help in the real world?
Fundamentally, without Codel the only limits on the congestion window would be when the sender or receiver hit configured or calculated rwnd and cwnd limits (the rwnd is visible on the wire and usually chosen to be large enough to be a non-factor), or when the queue overflows. Large windows require buffer memory in both sender and receiver, increasing costs on the sender in particular (who typically has many flows to manage per machine).

Queue overflow tends to result in burst loss and head-of-line blocking in the receiver, which is visible to the user as a pause and subsequent jump in the progress of their download, accompanied by a major fluctuation in the estimated time to completion. The lost packets also consume capacity upstream of the bottleneck which does not contribute to application throughput. These effects are independent of whether overflow dropping occurs at the head or tail of the bottleneck queue, though recovery occurs more quickly (and fewer packets might be lost) if dropping occurs from the head of the queue.

From a pure throughput-efficiency standpoint, Codel allows using ECN for congestion signalling instead of packet loss, potentially eliminating packet loss and associated lead-of-line blocking entirely. Even without ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP of the path, reducing memory requirements and significantly shortening the recovery time of each loss cycle, to the point where the end-user may not notice that delivery is not perfectly smooth, and implementing accurate completion time estimators is considerably simplified.

An important use-case is where two sequential bottlenecks exist on the path, the upstream one being only slightly higher capacity but lacking any queue management at all. This is presently common in cases where home CPE implements inbound shaping on a generic ISP last-mile link. In that case, without Codel running on the second bottleneck, traffic would collect in the first bottleneck's queue as well, greatly reducing the beneficial effects of FQ implemented on the second bottleneck. In this topology, the overall effect is inter-flow as well as intra-flow.

The combination of Codel with FQ is done in such a way that a separate instance of Codel is implemented for each flow. This means that congestion signals are only sent to flows that require them, and non-saturating flows are unmolested. This makes the combination synergistic, where each component offers an improvement to the behaviour of the other.

- Jonathan Morton
Luca Muscariello
2018-11-27 09:24:06 UTC
Permalink
I think that this is a very good comment to the discussion at the defense
about the comparison between
SFQ with longest queue drop and FQ_Codel.

A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This gives rule of thumbs to size buffers which is also very practical and
thanks to flow isolation becomes very accurate.

Which is:

1) find a way to keep the number of backlogged flows at a reasonable value.
This largely depends on the minimum fair rate an application may need in
the long term.
We discussed a little bit of available mechanisms to achieve that in the
literature.

2) fix the largest RTT you want to serve at full utilization and size the
buffer using BDP * N_backlogged.
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest RTT
you can serve at full utilization.

3) there is still some memory to dimension for sparse flows in addition to
that, but this is not based on BDP.
It is just enough to compute the total utilization of sparse flows and use
the same simple model Toke has used
to compute the (de)prioritization probability.

This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.

I think that the the codel part would still provide the ECN feature, that
all the others cannot have.
However the others, the last one especially can be implemented in silicon
with reasonable cost.
Post by Pete Heist
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel part
of fq_codel actually help in the real world?
Fundamentally, without Codel the only limits on the congestion window
would be when the sender or receiver hit configured or calculated rwnd and
cwnd limits (the rwnd is visible on the wire and usually chosen to be large
enough to be a non-factor), or when the queue overflows. Large windows
require buffer memory in both sender and receiver, increasing costs on the
sender in particular (who typically has many flows to manage per machine).
Queue overflow tends to result in burst loss and head-of-line blocking in
the receiver, which is visible to the user as a pause and subsequent jump
in the progress of their download, accompanied by a major fluctuation in
the estimated time to completion. The lost packets also consume capacity
upstream of the bottleneck which does not contribute to application
throughput. These effects are independent of whether overflow dropping
occurs at the head or tail of the bottleneck queue, though recovery occurs
more quickly (and fewer packets might be lost) if dropping occurs from the
head of the queue.
From a pure throughput-efficiency standpoint, Codel allows using ECN for
congestion signalling instead of packet loss, potentially eliminating
packet loss and associated lead-of-line blocking entirely. Even without
ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP
of the path, reducing memory requirements and significantly shortening the
recovery time of each loss cycle, to the point where the end-user may not
notice that delivery is not perfectly smooth, and implementing accurate
completion time estimators is considerably simplified.
An important use-case is where two sequential bottlenecks exist on the
path, the upstream one being only slightly higher capacity but lacking any
queue management at all. This is presently common in cases where home CPE
implements inbound shaping on a generic ISP last-mile link. In that case,
without Codel running on the second bottleneck, traffic would collect in
the first bottleneck's queue as well, greatly reducing the beneficial
effects of FQ implemented on the second bottleneck. In this topology, the
overall effect is inter-flow as well as intra-flow.
The combination of Codel with FQ is done in such a way that a separate
instance of Codel is implemented for each flow. This means that congestion
signals are only sent to flows that require them, and non-saturating flows
are unmolested. This makes the combination synergistic, where each
component offers an improvement to the behaviour of the other.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Bless, Roland (TM)
2018-11-27 10:26:17 UTC
Permalink
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
capacity without filling the buffer to its maximum capacity. The BDP
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
over time when they detect a loss:
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link. Moreover,
once you get good loss de-synchronization, the buffer size requirement
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
are not useful/practical anymore at very high speed such as 100 Gbit/s:
memory is also quite costly at such high speeds...

Regards,
Roland

[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
Post by Luca Muscariello
Which is: 
1) find a way to keep the number of backlogged flows at a reasonable value. 
This largely depends on the minimum fair rate an application may need in
the long term.
We discussed a little bit of available mechanisms to achieve that in the
literature.
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.  
Or the other way round: check how much memory you can use 
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization. 
3) there is still some memory to dimension for sparse flows in addition
to that, but this is not based on BDP. 
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used 
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing. 
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Luca Muscariello
2018-11-27 10:29:27 UTC
Permalink
I have never said that you need to fill the buffer to the max size to get
full capacity, which is an absurdity.

I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
capacity without filling the buffer to its maximum capacity. The BDP
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link. Moreover,
once you get good loss de-synchronization, the buffer size requirement
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
memory is also quite costly at such high speeds...
Regards,
Roland
[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
Post by Luca Muscariello
1) find a way to keep the number of backlogged flows at a reasonable
value.
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may need in
the long term.
We discussed a little bit of available mechanisms to achieve that in the
literature.
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization.
3) there is still some memory to dimension for sparse flows in addition
to that, but this is not based on BDP.
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Bless, Roland (TM)
2018-11-27 10:35:00 UTC
Permalink
Hi,
Post by Luca Muscariello
I have never said that you need to fill the buffer to the max size to
get full capacity, which is an absurdity.
Yes, it's absurd, but that's what today's loss-based CC algorithms do.
Post by Luca Muscariello
I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
I was also a bit imprecise: you'll need a BDP in flight, but
you don't need to fill the buffer at all. The latter sentence
is valid only in the direction: queue not empty -> link fully utilized.

Regards,
Roland
Post by Luca Muscariello
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including
QUIC,
Post by Luca Muscariello
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
capacity without filling the buffer to its maximum capacity. The BDP
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link. Moreover,
once you get good loss de-synchronization, the buffer size requirement
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
memory is also quite costly at such high speeds...
Regards,
 Roland
[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
Post by Luca Muscariello
Which is: 
1) find a way to keep the number of backlogged flows at a
reasonable value. 
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may
need in
Post by Luca Muscariello
the long term.
We discussed a little bit of available mechanisms to achieve that
in the
Post by Luca Muscariello
literature.
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.  
Or the other way round: check how much memory you can use 
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization. 
3) there is still some memory to dimension for sparse flows in
addition
Post by Luca Muscariello
to that, but this is not based on BDP. 
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used 
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing. 
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already
available in
Post by Luca Muscariello
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Luca Muscariello
2018-11-27 10:40:39 UTC
Permalink
OK. We agree.
That's correct, you need *at least* the BDP in flight so that the
bottleneck queue never empties out.

This can be easily proven using fluid models for any congestion controlled
source no matter if it is
loss-based, delay-based, rate-based, formula-based etc.

A highly paced source gives you the ability to get as close as
theoretically possible to the BDP+epsilon
as possible.

link fully utilized is defined as Q>0 unless you don't include the packet
currently being transmitted. I do,
so the TXtteer is never idle. But that's a detail.
Post by Bless, Roland (TM)
Hi,
Post by Luca Muscariello
I have never said that you need to fill the buffer to the max size to
get full capacity, which is an absurdity.
Yes, it's absurd, but that's what today's loss-based CC algorithms do.
Post by Luca Muscariello
I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
I was also a bit imprecise: you'll need a BDP in flight, but
you don't need to fill the buffer at all. The latter sentence
is valid only in the direction: queue not empty -> link fully utilized.
Regards,
Roland
Post by Luca Muscariello
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including
QUIC,
Post by Luca Muscariello
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck
link
Post by Luca Muscariello
capacity without filling the buffer to its maximum capacity. The BDP
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link.
Moreover,
Post by Luca Muscariello
once you get good loss de-synchronization, the buffer size
requirement
Post by Luca Muscariello
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very
practical
Post by Luca Muscariello
Post by Luca Muscariello
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
are not useful/practical anymore at very high speed such as 100
memory is also quite costly at such high speeds...
Regards,
Roland
[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
Post by Luca Muscariello
1) find a way to keep the number of backlogged flows at a
reasonable value.
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may
need in
Post by Luca Muscariello
the long term.
We discussed a little bit of available mechanisms to achieve that
in the
Post by Luca Muscariello
literature.
2) fix the largest RTT you want to serve at full utilization and
size
Post by Luca Muscariello
Post by Luca Muscariello
the buffer using BDP * N_backlogged.
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the
largest
Post by Luca Muscariello
Post by Luca Muscariello
RTT you can serve at full utilization.
3) there is still some memory to dimension for sparse flows in
addition
Post by Luca Muscariello
to that, but this is not based on BDP.
It is just enough to compute the total utilization of sparse flows
and
Post by Luca Muscariello
Post by Luca Muscariello
use the same simple model Toke has used
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer
sizing.
Post by Luca Muscariello
Post by Luca Muscariello
It would also be interesting to compare another mechanism that we
have
Post by Luca Muscariello
Post by Luca Muscariello
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already
available in
Post by Luca Muscariello
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN
feature,
Post by Luca Muscariello
Post by Luca Muscariello
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Mikael Abrahamsson
2018-11-27 10:50:24 UTC
Permalink
Post by Luca Muscariello
link fully utilized is defined as Q>0 unless you don't include the
packet currently being transmitted. I do, so the TXtteer is never idle.
But that's a detail.
As someone who works with moving packets, it's perplexing to me to
interact with transport peeps who seem enormously focused on "goodput". My
personal opinion is that most people would be better off with 80% of their
available bandwidth being in use without any noticable buffer induced
delay, as opposed to the transport protocol doing its damndest to fill up
the link to 100% and sometimes failing and inducing delay instead.

Could someone perhaps comment on the thinking in the transport protocol
design "crowd" when it comes to this?
--
Mikael Abrahamsson email: ***@swm.pp.se
Luca Muscariello
2018-11-27 11:01:09 UTC
Permalink
A BDP is not a large buffer. I'm not unveiling a secret.
And it is just a rule of thumb to have an idea at which working point the
protocol is working.
In practice the protocol is usually working below or above that value.
This is where AQM and ECN help also. So most of the time the protocol is
working at way
below 100% efficiency.

My point was that FQ_codel helps to get very close to the optimum w/o
adding useless queueing and latency.
With a single queue that's almost impossible. No, sorry. Just impossible.
Post by Mikael Abrahamsson
Post by Luca Muscariello
link fully utilized is defined as Q>0 unless you don't include the
packet currently being transmitted. I do, so the TXtteer is never idle.
But that's a detail.
As someone who works with moving packets, it's perplexing to me to
interact with transport peeps who seem enormously focused on "goodput". My
personal opinion is that most people would be better off with 80% of their
available bandwidth being in use without any noticable buffer induced
delay, as opposed to the transport protocol doing its damndest to fill up
the link to 100% and sometimes failing and inducing delay instead.
Could someone perhaps comment on the thinking in the transport protocol
design "crowd" when it comes to this?
--
Mikael Abrahamsson
2018-11-27 11:21:48 UTC
Permalink
Post by Luca Muscariello
A BDP is not a large buffer. I'm not unveiling a secret.
It's complicated. I've had people throw in my face that I need 2xBDP in
buffer size to smoothe things out. Personally I don't want more than 10ms
buffer (max), and I don't see why I should need more than that even if
transfers are running over hundreds of ms of light-speed-in-medium induced
delay between the communicating systems.

I have routers that are perfectly capable at buffering packets for
hundreds of ms even at hundreds of megabits/s of access speed. I choose
not to use them though, and configure them to drop packets much earlier.
Post by Luca Muscariello
My point was that FQ_codel helps to get very close to the optimum w/o
adding useless queueing and latency. With a single queue that's almost
impossible. No, sorry. Just impossible.
Right, I realise I wasn't clear I wasn't actually commenting on your
specific text directly, my question was more generic.
--
Mikael Abrahamsson email: ***@swm.pp.se
Jonathan Morton
2018-11-27 12:17:48 UTC
Permalink
It's complicated. I've had people throw in my face that I need 2xBDP in buffer size to smoothe things out. Personally I don't want more than 10ms buffer (max), and I don't see why I should need more than that even if transfers are running over hundreds of ms of light-speed-in-medium induced delay between the communicating systems.
I think we can agree that the ideal CC algo would pace packets out smoothly at exactly the path capacity, neither building a queue at the bottleneck nor leaving capacity on the table.

Actually achieving that in practice turns out to be difficult, because there's no general way to discover the path capacity in advance. AQMs like Codel, in combination with ECN, get us a step closer by explicitly informing each flow when it is exceeding that capacity while the queue is still reasonably short. FQ also helps, by preventing flows from inadvertently interfering with each other by imperfectly managing their congestion windows.

So with the presently deployed state of the art, we have cwnds oscillating around reasonably short queue lengths, backing off sharply in response to occasional signals, then probing back upwards when that signal goes away for a while. It's a big improvement over dumb drop-tail FIFOs, but it's still some distance from the ideal. That's because the information injected by the bottleneck AQM is a crude binary state.

I do not include DCTCP in the deployed state of the art, because it is not deployable in the RFC-compliant Internet; it is effectively incompatible with Codel in particular, because it wrongly interprets CE marks and is thus noncompliant with the ECN RFC.

However, I agree with DCTCP's goal of achieving finer-grained control of the cwnd, through AQMs providing more nuanced information about the state of the path capacity and/or bottleneck queue. An implementation that made use of ECT(1) instead of changing the meaning of CE marks would remain RFC-compliant, and could get "sufficiently close" to the ideal described above.

- Jonathan Morton
Luca Muscariello
2018-11-27 13:37:29 UTC
Permalink
Another bit to this.
A router queue is supposed to serve packets no matter what is running at
the controlled end-point, BBR, Cubic or else.
So, delay-based congestion controller still get hurt in today Internet
unless they can get their portion of buffer at the line card.
FQ creates incentives for end-points to send traffic in a smoother way
because the reward to the application is immediate and
measurable. But the end-point does not know in advance if FQ is there or
not.

So going back to sizing the link buffer, the rule I mentioned applies. And
it allows to get best completion times for a wider range of RTTs.
If you, Mikael don't want more than 10ms buffer, how do you achieve that?
You change the behaviour of the source and hope flow isolation is available.
If you just cut the buffer down to 10ms and do nothing else, the only thing
you get is a short queue and may throw away half of your link capacity.
Post by Mikael Abrahamsson
Post by Mikael Abrahamsson
It's complicated. I've had people throw in my face that I need 2xBDP in
buffer size to smoothe things out. Personally I don't want more than 10ms
buffer (max), and I don't see why I should need more than that even if
transfers are running over hundreds of ms of light-speed-in-medium induced
delay between the communicating systems.
I think we can agree that the ideal CC algo would pace packets out
smoothly at exactly the path capacity, neither building a queue at the
bottleneck nor leaving capacity on the table.
Actually achieving that in practice turns out to be difficult, because
there's no general way to discover the path capacity in advance. AQMs like
Codel, in combination with ECN, get us a step closer by explicitly
informing each flow when it is exceeding that capacity while the queue is
still reasonably short. FQ also helps, by preventing flows from
inadvertently interfering with each other by imperfectly managing their
congestion windows.
So with the presently deployed state of the art, we have cwnds oscillating
around reasonably short queue lengths, backing off sharply in response to
occasional signals, then probing back upwards when that signal goes away
for a while. It's a big improvement over dumb drop-tail FIFOs, but it's
still some distance from the ideal. That's because the information
injected by the bottleneck AQM is a crude binary state.
I do not include DCTCP in the deployed state of the art, because it is not
deployable in the RFC-compliant Internet; it is effectively incompatible
with Codel in particular, because it wrongly interprets CE marks and is
thus noncompliant with the ECN RFC.
However, I agree with DCTCP's goal of achieving finer-grained control of
the cwnd, through AQMs providing more nuanced information about the state
of the path capacity and/or bottleneck queue. An implementation that made
use of ECT(1) instead of changing the meaning of CE marks would remain
RFC-compliant, and could get "sufficiently close" to the ideal described
above.
- Jonathan Morton
Mikael Abrahamsson
2018-11-27 13:49:53 UTC
Permalink
Post by Luca Muscariello
If you, Mikael don't want more than 10ms buffer, how do you achieve that?
class class-default
random-detect 10 ms 2000 ms

That's the only thing available to me on the platforms I have. If you
would like this improved, please reach out to the Cisco ASR9k BU and tell
them to implement ECN and PIE (or something even better). They won't do it
because I say so, it seems. WRED is all they give me.
Post by Luca Muscariello
You change the behaviour of the source and hope flow isolation is available.
Sorry, I only transport the packets, I don't create them.
Post by Luca Muscariello
If you just cut the buffer down to 10ms and do nothing else, the only thing
you get is a short queue and may throw away half of your link capacity.
If i have lots of queue I might instead get customer complaints about high
latency for their interactive applications.
--
Mikael Abrahamsson email: ***@swm.pp.se
Luca Muscariello
2018-11-27 14:07:46 UTC
Permalink
Post by Mikael Abrahamsson
Post by Luca Muscariello
If you, Mikael don't want more than 10ms buffer, how do you achieve that?
class class-default
random-detect 10 ms 2000 ms
That's the only thing available to me on the platforms I have. If you
would like this improved, please reach out to the Cisco ASR9k BU and tell
them to implement ECN and PIE (or something even better). They won't do it
because I say so, it seems. WRED is all they give me.
This is a whole different discussion but if you want to have a per-user
context
at the BNG level + TM + FQ I'm not sure that kind of beast will ever exist.
Unless you have a very small user fan-out the hardware clocks could loop
over
several thousands of contexts.
You should expect those kind of features to be in the CMTS or OLT.
Post by Mikael Abrahamsson
Post by Luca Muscariello
You change the behaviour of the source and hope flow isolation is
available.
Sorry, I only transport the packets, I don't create them.
I'm sure you create a lot of packets. Don't be humble.
Post by Mikael Abrahamsson
Post by Luca Muscariello
If you just cut the buffer down to 10ms and do nothing else, the only
thing
Post by Luca Muscariello
you get is a short queue and may throw away half of your link capacity.
If i have lots of queue I might instead get customer complaints about high
latency for their interactive applications.
--
Mikael Abrahamsson
2018-11-27 14:18:37 UTC
Permalink
Post by Luca Muscariello
This is a whole different discussion but if you want to have a per-user
context at the BNG level + TM + FQ I'm not sure that kind of beast will
ever exist. Unless you have a very small user fan-out the hardware
clocks could loop over several thousands of contexts. You should expect
those kind of features to be in the CMTS or OLT.
This is per queue per customer access port (250 customers per 10GE port,
so 250 queues). It's on an "service edge" linecard that I imagine people
use for BNG purposes. I tend to not use words like that because to me a
router is a router.

I do not do coax. I do not do PON. I do point to point ethernet using
routers and switches, like god^WIEEE intended.
--
Mikael Abrahamsson email: ***@swm.pp.se
Kathleen Nichols
2018-11-27 18:44:00 UTC
Permalink
I have been kind of blown away by this discussion. Jim Gettys kind of
kicked off the current wave of dealing with full queues, dubbing it
"bufferbloat". He wanted to write up how it happened so that people
could start on a solution and I was enlisted to get an article written.
We tried to draw on the accumulated knowledge of decades and use a
context of What Jim Saw. I think the article offers some insight on
queues (perhaps I'm biased as a co-author, but I'm not claiming any
original insights just putting it together)
https://queue.acm.org/detail.cfm?id=2071893

Further, in our first writing about CoDel, Van insisted on getting a
good explanation of queues and how things go wrong. I think the figures
and the explanation of how buffers are meant to be shock absorbers are
very useful (yes, bias again, but I'm not saying you have to agree about
CoDel's efficacy, just about how queues happen and why we need some
buffer). https://queue.acm.org/detail.cfm?id=2209336

It's just kind of weird since Jim's evangelism is at the root of this
list (and Dave's picking up the torch of course). Reading is a lost art.

Kathie
Dave Taht
2018-11-27 19:25:28 UTC
Permalink
Post by Kathleen Nichols
I have been kind of blown away by this discussion. Jim Gettys kind of
kicked off the current wave of dealing with full queues, dubbing it
"bufferbloat". He wanted to write up how it happened so that people
could start on a solution and I was enlisted to get an article written.
We tried to draw on the accumulated knowledge of decades and use a
context of What Jim Saw. I think the article offers some insight on
queues (perhaps I'm biased as a co-author, but I'm not claiming any
original insights just putting it together)
https://queue.acm.org/detail.cfm?id=2071893
Further, in our first writing about CoDel, Van insisted on getting a
good explanation of queues and how things go wrong. I think the figures
and the explanation of how buffers are meant to be shock absorbers are
very useful (yes, bias again, but I'm not saying you have to agree about
CoDel's efficacy, just about how queues happen and why we need some
buffer). https://queue.acm.org/detail.cfm?id=2209336
It's just kind of weird since Jim's evangelism is at the root of this
list (and Dave's picking up the torch of course). Reading is a lost art.
The working title of my PHD thesis is "Sanely shedding load", which, while
our work to date has made a dent on it, is a broader question, that also applies
to people and how we manage load...

... las, for example, I'm at least 20 messages behind on these lists
this week so far,
and I figure that thesis will never complete until I learn how to
write in passive voice.

I think the two pieces above, this list, evangelism, and working
results has made an
enormous dent in how computer scientists and engineers think about
things over the
last 7 years, and it will continue to percolate through academia and
elsewhere to good effect.

And your and jim's paper has 389 cites now:

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=bufferbloat&btnG=&oq=

I know, sadly, though, that my opinion about how much we've changed
the world, has more than a bit of confirmation bias in it.

I loved what uber did (https://eng.uber.com/qalm/)

I'd certainly love to see kleinrock taught not just in comp sci and
engineering, but in business and government,
as I just spent 3 hours at the DMV that I wish I didn't need to spend,
and a day where I could hit dice.com for
"queue theory" and come up with hundreds of hits across all
professions... rather than 0.
Post by Kathleen Nichols
Kathie
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Roland Bless
2018-11-27 21:57:53 UTC
Permalink
Hi Kathie,

[long time, no see :-)]
I'm well aware of the CoDel paper and it really does a nice job
of explaining the good queue and bad queue properties. What we
found is that loss-based TCP CCs systematically build standing
queues. Their positive function is to keep up the link utilization,
their drawback is the huge queuing delay. So everyone
not aware of both papers should read them. However, if you think
something that I wrote is NOT in accordance with your findings,
please let me know.

Regards,
Roland

On 27.11.18 at 19:44 Kathleen Nichols wrote> I have been kind of blown
away by this discussion. Jim Gettys kind of> kicked off the current wave
of dealing with full queues, dubbing it
Post by Kathleen Nichols
"bufferbloat". He wanted to write up how it happened so that people
could start on a solution and I was enlisted to get an article written.
We tried to draw on the accumulated knowledge of decades and use a
context of What Jim Saw. I think the article offers some insight on
queues (perhaps I'm biased as a co-author, but I'm not claiming any
original insights just putting it together)
https://queue.acm.org/detail.cfm?id=2071893
Further, in our first writing about CoDel, Van insisted on getting a
good explanation of queues and how things go wrong. I think the figures
and the explanation of how buffers are meant to be shock absorbers are
very useful (yes, bias again, but I'm not saying you have to agree about
CoDel's efficacy, just about how queues happen and why we need some
buffer). https://queue.acm.org/detail.cfm?id=2209336
It's just kind of weird since Jim's evangelism is at the root of this
list (and Dave's picking up the torch of course). Reading is a lost art.
Bless, Roland (TM)
2018-11-27 11:53:10 UTC
Permalink
Hi Luca,
Post by Luca Muscariello
A BDP is not a large buffer. I'm not unveiling a secret.
That depends on speed and RTT (note that typically there are
several flows with different RTTs sharing the same buffer).
The essential point is not how much buffer capacity is available,
but how much is actually used, because that adds queueing delay.
Post by Luca Muscariello
And it is just a rule of thumb to have an idea at which working point
the protocol is working.
No, one can actually prove that this is the best size for
loss-based CC with backoff factor of 0.5 (assuming a single flow).
Post by Luca Muscariello
In practice the protocol is usually working below or above that value.
That depends on the protocol.
Post by Luca Muscariello
This is where AQM and ECN help also. So most of the time the protocol is
working at way 
below 100% efficiency.
My point was that FQ_codel helps to get very close to the optimum w/o
adding useless queueing and latency.
With a single queue that's almost impossible. No, sorry. Just impossible.
No, it's possible. Please read the TCP LoLa paper.

Regards,
Roland
Luca Muscariello
2018-11-27 11:58:22 UTC
Permalink
A buffer in a router is sized once. RTT varies.
So BDP varies. That’s as simple as that.
So you just cannot be always at optimum because you don’t know what RTT you
have at any time.

Lola si not solving that. No protocol could BTW.
BTW I don’t see any formal proof about queue occupancy in the paper.
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
A BDP is not a large buffer. I'm not unveiling a secret.
That depends on speed and RTT (note that typically there are
several flows with different RTTs sharing the same buffer).
The essential point is not how much buffer capacity is available,
but how much is actually used, because that adds queueing delay.
Post by Luca Muscariello
And it is just a rule of thumb to have an idea at which working point
the protocol is working.
No, one can actually prove that this is the best size for
loss-based CC with backoff factor of 0.5 (assuming a single flow).
Post by Luca Muscariello
In practice the protocol is usually working below or above that value.
That depends on the protocol.
Post by Luca Muscariello
This is where AQM and ECN help also. So most of the time the protocol is
working at way
below 100% efficiency.
My point was that FQ_codel helps to get very close to the optimum w/o
adding useless queueing and latency.
With a single queue that's almost impossible. No, sorry. Just impossible.
No, it's possible. Please read the TCP LoLa paper.
Regards,
Roland
Bless, Roland (TM)
2018-11-27 12:22:38 UTC
Permalink
Hi,
Post by Luca Muscariello
A buffer in a router is sized once. RTT varies.
So BDP varies. That’s as simple as that.
So you just cannot be always at optimum because you don’t know what RTT
you have at any time.
The endpoints can measure the RTT. Yes, it's probably a bit noisy and
there are several practical problems such as congestion on the reverse
path and multiple bottlenecks, but in general it's not impossible.
Post by Luca Muscariello
Lola si not solving that. No protocol could BTW.
LoLa is exactly solving that. It measures RTTmin and effective RTT
(and there are lots of other delay-based CC proposals doing that)
and tries to control the overall queuing delay, even achieving
RTT-independent flow rate fairness.
Post by Luca Muscariello
BTW I don’t see  any formal proof about queue occupancy in the paper.
It's not in the LoLa paper, it was in a different paper, but reviewers
thought it was already common knowledge.

Regards,
Roland
Post by Luca Muscariello
Hi Luca,
Post by Luca Muscariello
A BDP is not a large buffer. I'm not unveiling a secret.
That depends on speed and RTT (note that typically there are
several flows with different RTTs sharing the same buffer).
The essential point is not how much buffer capacity is available,
but how much is actually used, because that adds queueing delay.
Post by Luca Muscariello
And it is just a rule of thumb to have an idea at which working point
the protocol is working.
No, one can actually prove that this is the best size for
loss-based CC with backoff factor of 0.5 (assuming a single flow).
Post by Luca Muscariello
In practice the protocol is usually working below or above that value.
That depends on the protocol.
Post by Luca Muscariello
This is where AQM and ECN help also. So most of the time the
protocol is
Post by Luca Muscariello
working at way 
below 100% efficiency.
My point was that FQ_codel helps to get very close to the optimum w/o
adding useless queueing and latency.
With a single queue that's almost impossible. No, sorry. Just
impossible.
No, it's possible. Please read the TCP LoLa paper.
Regards,
 Roland
Jonathan Morton
2018-11-27 11:06:06 UTC
Permalink
Could someone perhaps comment on the thinking in the transport protocol design "crowd" when it comes to this?
BBR purports to aim for the optimum of maximum throughput at minimum latency; there is a sharp knee in the throughput-latency graph, at least in an idealised scenario. In practice it's more complicated, hence the gradual evolution of BBR.

Previously, there has been a dichotomy between loss-based TCPs which aim for maximum throughput regardless of latency, and delay-based TCPs which aim for minimum latency with little regard for throughput. Pretty much nobody uses the latter in the real world, because they get outcompeted by loss-based traffic when they meekly back down at the first sign of a queuing delay. Everyone uses loss-based TCPs, generally NewReno, CUBIC, or Compound. CUBIC is specifically designed to spend most of its time near queue saturation, by growing more slowly when near the cwnd at which loss was last experienced than when distant from it.

Compound is actually an interesting special case. It's a very straightforward combination of a loss-based TCP and a delay-based one, such that it spends a lot of time at or near the optimum operating point - at least in theory. However, it will always eventually transition into a NewReno-like mode, and fill the queue until loss is experienced.

LEDBAT is a delay-based algorithm that can be applied to protocols other than TCP. It's often used in BitTorrent clients as part of µTP. However, the sheer weight of flows employed by BT clients tends to overwhelm the algorithm, as remote senders often collectively flood the queue with near-simultaneous bursts in response to changes in collective swarm state. BT client authors seem to be ill-equipped to address this problem adequately.

- Jonathan Morton
Michael Welzl
2018-11-27 11:07:35 UTC
Permalink
Well, I'm concerned about the delay experienced by people when they surf the web... flow completion time, which relates not only to the delay of packets as they are sent from A to B, but also the utilization.

Cheers,
Michael
link fully utilized is defined as Q>0 unless you don't include the packet currently being transmitted. I do, so the TXtteer is never idle. But that's a detail.
As someone who works with moving packets, it's perplexing to me to interact with transport peeps who seem enormously focused on "goodput". My personal opinion is that most people would be better off with 80% of their available bandwidth being in use without any noticable buffer induced delay, as opposed to the transport protocol doing its damndest to fill up the link to 100% and sometimes failing and inducing delay instead.
Could someone perhaps comment on the thinking in the transport protocol design "crowd" when it comes to this?
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Dave Taht
2018-11-29 07:35:53 UTC
Permalink
Post by Mikael Abrahamsson
Post by Luca Muscariello
link fully utilized is defined as Q>0 unless you don't include the
packet currently being transmitted. I do, so the TXtteer is never
idle. But that's a detail.
As someone who works with moving packets, it's perplexing to me to
interact with transport peeps who seem enormously focused on
"goodput". My personal opinion is that most people would be better off
with 80% of their available bandwidth being in use without any
noticable buffer induced delay, as opposed to the transport protocol
doing its damndest to fill up the link to 100% and sometimes failing
and inducing delay instead.
+1

I came up with a new analogy today.

Some really like to build dragsters - that go fast but might explode at
the end of the strip - or even during the race!

I like to build churches - that will stand for a 1000 years.

You can reason about stable, deterministic systems, and build other
beautiful structures on top of them. I have faith in churches, not
dragsters.
Post by Mikael Abrahamsson
Could someone perhaps comment on the thinking in the transport
protocol design "crowd" when it comes to this?
Michael Welzl
2018-11-27 11:04:16 UTC
Permalink
Folks,

I'm lost in this conversation: I thought it started with a statement saying that the queue length must be at least a BDP such that full utilization is attained because the queue never drains.
To this, I'd want to add that, in addition to the links from Roland, the point of ABE is to address exactly that: https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn-12
(in the RFC Editor queue)

But now I think you're discussing a BDP worth of data *in flight*, which is something else.

Cheers,
Michael
Post by Luca Muscariello
OK. We agree.
That's correct, you need *at least* the BDP in flight so that the bottleneck queue never empties out.
This can be easily proven using fluid models for any congestion controlled source no matter if it is
loss-based, delay-based, rate-based, formula-based etc.
A highly paced source gives you the ability to get as close as theoretically possible to the BDP+epsilon
as possible.
link fully utilized is defined as Q>0 unless you don't include the packet currently being transmitted. I do,
so the TXtteer is never idle. But that's a detail.
Hi,
Post by Luca Muscariello
I have never said that you need to fill the buffer to the max size to
get full capacity, which is an absurdity.
Yes, it's absurd, but that's what today's loss-based CC algorithms do.
Post by Luca Muscariello
I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
I was also a bit imprecise: you'll need a BDP in flight, but
you don't need to fill the buffer at all. The latter sentence
is valid only in the direction: queue not empty -> link fully utilized.
Regards,
Roland
Post by Luca Muscariello
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including
QUIC,
Post by Luca Muscariello
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
capacity without filling the buffer to its maximum capacity. The BDP
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link. Moreover,
once you get good loss de-synchronization, the buffer size requirement
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
memory is also quite costly at such high speeds...
Regards,
Roland
[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
Post by Luca Muscariello
1) find a way to keep the number of backlogged flows at a
reasonable value.
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may
need in
Post by Luca Muscariello
the long term.
We discussed a little bit of available mechanisms to achieve that
in the
Post by Luca Muscariello
literature.
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization.
3) there is still some memory to dimension for sparse flows in
addition
Post by Luca Muscariello
to that, but this is not based on BDP.
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already
available in
Post by Luca Muscariello
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Bless, Roland (TM)
2018-11-27 12:48:14 UTC
Permalink
Hi Michael,
Post by Michael Welzl
I'm lost in this conversation: I thought it started with a statement saying that the queue length must be at least a BDP such that full utilization is attained because the queue never drains.
I think it helps to distinguish between bottleneck buffer size (i.e.,
its capacity) and bottleneck buffer occupancy (i.e., queue length).
My point was that full bottleneck utilization doesn't require any buffer
occupancy at all. I think one should also distinguish precisely between
different viewpoints: at the sender or at the bottleneck (aggregate of
many flows from different sources with different RTTs).
Post by Michael Welzl
To this, I'd want to add that, in addition to the links from Roland, the point of ABE is to address exactly that: https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn-12
(in the RFC Editor queue)
Yep, because then the backoff is less drastic so the utilization is kept
at a higher level even if the queue is much smaller than a BDP (that is
concluded from the fact that when ECN is present an AQM will try to keep
the queue much smaller). Our complementary approach was the AQM Steering
approach that let's the AQM adapt.
Post by Michael Welzl
But now I think you're discussing a BDP worth of data *in flight*, which is something else.
Yes, maybe because I think it was confused, but it's anyhow related:
having a BDP in flight will allow you fully utilize the link capacity,
having BDP+x in flight will lead to having x queued up in the bottleneck
buffer. So having 2BDP inflight will lead to 1 BDP on the wire and 1 BDP
in the buffer. That's what loss based CC variants usually have and what
BBRv1 set as limit.

Regards,
Roland
Bless, Roland (TM)
2018-11-27 11:40:45 UTC
Permalink
Hi Luca,
Post by Luca Muscariello
OK. We agree.
That's correct, you need *at least* the BDP in flight so that the
bottleneck queue never empties out.
No, that's not what I meant, but it's quite simple.
You need: data min_inflight=2 * RTTmin * bottleneck_rate to filly
utilize the bottleneck link.
If this is true, the bottleneck queue will be empty. If your amount
of inflight data is larger, the bottleneck queue buffer will store
the excess packets. With just min_inflight there will be no
bottleneck queue, the packets are "on the wire".
Post by Luca Muscariello
This can be easily proven using fluid models for any congestion
controlled source no matter if it is 
loss-based, delay-based, rate-based, formula-based etc.
A highly paced source gives you the ability to get as close as
theoretically possible to the BDP+epsilon
as possible.
Yep, but that BDP is "on the wire" and epsilon will be in the bottleneck
buffer.
Post by Luca Muscariello
link fully utilized is defined as Q>0 unless you don't include the
packet currently being transmitted. I do,
so the TXtteer is never idle. But that's a detail.
I wouldn't define link fully utilized as Q>0, but if Q>0 then
the link is fully utilized (that's what I meant by the direction
of implication).

Rgards,
Roland
Post by Luca Muscariello
On Tue, Nov 27, 2018 at 11:35 AM Bless, Roland (TM)
Hi,
Post by Luca Muscariello
I have never said that you need to fill the buffer to the max size to
get full capacity, which is an absurdity.
Yes, it's absurd, but that's what today's loss-based CC algorithms do.
Post by Luca Muscariello
I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
I was also a bit imprecise: you'll need a BDP in flight, but
you don't need to fill the buffer at all. The latter sentence
is valid only in the direction: queue not empty -> link fully utilized.
Regards,
 Roland
Post by Luca Muscariello
On Tue 27 Nov 2018 at 11:26, Bless, Roland (TM)
     Hi Luca,
     > A congestion controlled protocol such as TCP or others,
including
Post by Luca Muscariello
     QUIC,
     > LEDBAT and so on
     > need at least the BDP in the transmission queue to get full link
     > efficiency, i.e. the queue never empties out.
     This is not true. There are congestion control algorithms
     (e.g., TCP LoLa [1] or BBRv2) that can fully utilize the
bottleneck link
Post by Luca Muscariello
     capacity without filling the buffer to its maximum capacity.
The BDP
Post by Luca Muscariello
     rule of thumb basically stems from the older loss-based congestion
     control variants that profit from the standing queue that they
built
Post by Luca Muscariello
     while they back-off and stop sending, the queue keeps the
bottleneck
Post by Luca Muscariello
     output busy and you'll not see underutilization of the link.
Moreover,
Post by Luca Muscariello
     once you get good loss de-synchronization, the buffer size
requirement
Post by Luca Muscariello
     for multiple long-lived flows decreases.
     > This gives rule of thumbs to size buffers which is also very
practical
Post by Luca Muscariello
     > and thanks to flow isolation becomes very accurate.
     The positive effect of buffers is merely their role to absorb
     short-term bursts (i.e., mismatch in arrival and departure rates)
     instead of dropping packets. One does not need a big buffer to
     fully utilize a link (with perfect knowledge you can keep the link
     saturated even without a single packet waiting in the buffer).
     Furthermore, large buffers (e.g., using the BDP rule of thumb)
     are not useful/practical anymore at very high speed such as
     memory is also quite costly at such high speeds...
     Regards,
      Roland
     [1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
     TCP LoLa: Congestion Control for Low Latencies and High
Throughput.
Post by Luca Muscariello
     Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
     215-218, Singapore, Singapore, October 2017
     http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
     > Which is: 
     >
     > 1) find a way to keep the number of backlogged flows at a
     reasonable value. 
     > This largely depends on the minimum fair rate an application may
     need in
     > the long term.
     > We discussed a little bit of available mechanisms to achieve
that
Post by Luca Muscariello
     in the
     > literature.
     >
     > 2) fix the largest RTT you want to serve at full utilization
and size
Post by Luca Muscariello
     > the buffer using BDP * N_backlogged.  
     > Or the other way round: check how much memory you can use 
     > in the router/line card/device and for a fixed N, compute
the largest
Post by Luca Muscariello
     > RTT you can serve at full utilization. 
     >
     > 3) there is still some memory to dimension for sparse flows in
     addition
     > to that, but this is not based on BDP. 
     > It is just enough to compute the total utilization of sparse
flows and
Post by Luca Muscariello
     > use the same simple model Toke has used 
     > to compute the (de)prioritization probability.
     >
     > This procedure would allow to size FQ_codel but also SFQ.
     > It would be interesting to compare the two under this buffer
sizing. 
Post by Luca Muscariello
     > It would also be interesting to compare another mechanism
that we have
Post by Luca Muscariello
     > mentioned during the defense
     > which is AFD + a sparse flow queue. Which is, BTW, already
     available in
     > Cisco nexus switches for data centres.
     >
     > I think that the the codel part would still provide the ECN
feature,
Post by Luca Muscariello
     > that all the others cannot have.
     > However the others, the last one especially can be
implemented in
Post by Luca Muscariello
     > silicon with reasonable cost.
Bless, Roland (TM)
2018-11-27 11:43:49 UTC
Permalink
Hi,
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
OK. We agree.
That's correct, you need *at least* the BDP in flight so that the
bottleneck queue never empties out.
No, that's not what I meant, but it's quite simple.
You need: data min_inflight=2 * RTTmin * bottleneck_rate to filly
Sorry, it's meant to be: min_inflight= RTTmin * bottleneck_rate

Regards,
Roland
Post by Bless, Roland (TM)
utilize the bottleneck link.
If this is true, the bottleneck queue will be empty. If your amount
of inflight data is larger, the bottleneck queue buffer will store
the excess packets. With just min_inflight there will be no
bottleneck queue, the packets are "on the wire".
Post by Luca Muscariello
This can be easily proven using fluid models for any congestion
controlled source no matter if it is 
loss-based, delay-based, rate-based, formula-based etc.
A highly paced source gives you the ability to get as close as
theoretically possible to the BDP+epsilon
as possible.
Yep, but that BDP is "on the wire" and epsilon will be in the bottleneck
buffer.
Post by Luca Muscariello
link fully utilized is defined as Q>0 unless you don't include the
packet currently being transmitted. I do,
so the TXtteer is never idle. But that's a detail.
I wouldn't define link fully utilized as Q>0, but if Q>0 then
the link is fully utilized (that's what I meant by the direction
of implication).
Rgards,
Roland
Post by Luca Muscariello
On Tue, Nov 27, 2018 at 11:35 AM Bless, Roland (TM)
Hi,
Post by Luca Muscariello
I have never said that you need to fill the buffer to the max size to
get full capacity, which is an absurdity.
Yes, it's absurd, but that's what today's loss-based CC algorithms do.
Post by Luca Muscariello
I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
I was also a bit imprecise: you'll need a BDP in flight, but
you don't need to fill the buffer at all. The latter sentence
is valid only in the direction: queue not empty -> link fully utilized.
Regards,
 Roland
Post by Luca Muscariello
On Tue 27 Nov 2018 at 11:26, Bless, Roland (TM)
     Hi Luca,
     > A congestion controlled protocol such as TCP or others,
including
Post by Luca Muscariello
     QUIC,
     > LEDBAT and so on
     > need at least the BDP in the transmission queue to get full link
     > efficiency, i.e. the queue never empties out.
     This is not true. There are congestion control algorithms
     (e.g., TCP LoLa [1] or BBRv2) that can fully utilize the
bottleneck link
Post by Luca Muscariello
     capacity without filling the buffer to its maximum capacity.
The BDP
Post by Luca Muscariello
     rule of thumb basically stems from the older loss-based congestion
     control variants that profit from the standing queue that they
built
Post by Luca Muscariello
     while they back-off and stop sending, the queue keeps the
bottleneck
Post by Luca Muscariello
     output busy and you'll not see underutilization of the link.
Moreover,
Post by Luca Muscariello
     once you get good loss de-synchronization, the buffer size
requirement
Post by Luca Muscariello
     for multiple long-lived flows decreases.
     > This gives rule of thumbs to size buffers which is also very
practical
Post by Luca Muscariello
     > and thanks to flow isolation becomes very accurate.
     The positive effect of buffers is merely their role to absorb
     short-term bursts (i.e., mismatch in arrival and departure rates)
     instead of dropping packets. One does not need a big buffer to
     fully utilize a link (with perfect knowledge you can keep the link
     saturated even without a single packet waiting in the buffer).
     Furthermore, large buffers (e.g., using the BDP rule of thumb)
     are not useful/practical anymore at very high speed such as
     memory is also quite costly at such high speeds...
     Regards,
      Roland
     [1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
     TCP LoLa: Congestion Control for Low Latencies and High
Throughput.
Post by Luca Muscariello
     Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
     215-218, Singapore, Singapore, October 2017
     http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
     > Which is: 
     >
     > 1) find a way to keep the number of backlogged flows at a
     reasonable value. 
     > This largely depends on the minimum fair rate an application may
     need in
     > the long term.
     > We discussed a little bit of available mechanisms to achieve
that
Post by Luca Muscariello
     in the
     > literature.
     >
     > 2) fix the largest RTT you want to serve at full utilization
and size
Post by Luca Muscariello
     > the buffer using BDP * N_backlogged.  
     > Or the other way round: check how much memory you can use 
     > in the router/line card/device and for a fixed N, compute
the largest
Post by Luca Muscariello
     > RTT you can serve at full utilization. 
     >
     > 3) there is still some memory to dimension for sparse flows in
     addition
     > to that, but this is not based on BDP. 
     > It is just enough to compute the total utilization of sparse
flows and
Post by Luca Muscariello
     > use the same simple model Toke has used 
     > to compute the (de)prioritization probability.
     >
     > This procedure would allow to size FQ_codel but also SFQ.
     > It would be interesting to compare the two under this buffer
sizing. 
Post by Luca Muscariello
     > It would also be interesting to compare another mechanism
that we have
Post by Luca Muscariello
     > mentioned during the defense
     > which is AFD + a sparse flow queue. Which is, BTW, already
     available in
     > Cisco nexus switches for data centres.
     >
     > I think that the the codel part would still provide the ECN
feature,
Post by Luca Muscariello
     > that all the others cannot have.
     > However the others, the last one especially can be
implemented in
Post by Luca Muscariello
     > silicon with reasonable cost.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Dave Taht
2018-11-29 07:39:56 UTC
Permalink
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
OK. We agree.
That's correct, you need *at least* the BDP in flight so that the
bottleneck queue never empties out.
No, that's not what I meant, but it's quite simple.
You need: data min_inflight=2 * RTTmin * bottleneck_rate to filly
utilize the bottleneck link.
If this is true, the bottleneck queue will be empty. If your amount
of inflight data is larger, the bottleneck queue buffer will store
the excess packets. With just min_inflight there will be no
bottleneck queue, the packets are "on the wire".
Post by Luca Muscariello
This can be easily proven using fluid models for any congestion
controlled source no matter if it is 
loss-based, delay-based, rate-based, formula-based etc.
A highly paced source gives you the ability to get as close as
theoretically possible to the BDP+epsilon
as possible.
Yep, but that BDP is "on the wire" and epsilon will be in the bottleneck
buffer.
I'm hoping I made my point effectively earlier, that

" data min_inflight=2 * RTTmin * bottleneck_rate "

when it is nearly certain that more than one flow exists, means aiming
for the BDP in a single flow is generally foolish. Liked the stanford
result, I think it's pretty general. I see hundreds of flows active
every minute. There was another paper that looked into some magic
200-ish number as simultaneous flows active, normally
Post by Bless, Roland (TM)
Post by Luca Muscariello
link fully utilized is defined as Q>0 unless you don't include the
packet currently being transmitted. I do,
so the TXtteer is never idle. But that's a detail.
I wouldn't define link fully utilized as Q>0, but if Q>0 then
the link is fully utilized (that's what I meant by the direction
of implication).
Rgards,
Roland
Post by Luca Muscariello
On Tue, Nov 27, 2018 at 11:35 AM Bless, Roland (TM)
Hi,
Post by Luca Muscariello
I have never said that you need to fill the buffer to the max size to
get full capacity, which is an absurdity.
Yes, it's absurd, but that's what today's loss-based CC algorithms do.
Post by Luca Muscariello
I said you need at least the BDP so that the queue never empties out.
The link is fully utilized IFF the queue is never emptied.
I was also a bit imprecise: you'll need a BDP in flight, but
you don't need to fill the buffer at all. The latter sentence
is valid only in the direction: queue not empty -> link fully utilized.
Regards,
 Roland
Post by Luca Muscariello
On Tue 27 Nov 2018 at 11:26, Bless, Roland (TM)
     Hi Luca,
     > A congestion controlled protocol such as TCP or others,
including
Post by Luca Muscariello
     QUIC,
     > LEDBAT and so on
     > need at least the BDP in the transmission queue to get full link
     > efficiency, i.e. the queue never empties out.
     This is not true. There are congestion control algorithms
     (e.g., TCP LoLa [1] or BBRv2) that can fully utilize the
bottleneck link
Post by Luca Muscariello
     capacity without filling the buffer to its maximum capacity.
The BDP
Post by Luca Muscariello
     rule of thumb basically stems from the older loss-based congestion
     control variants that profit from the standing queue that they
built
Post by Luca Muscariello
     while they back-off and stop sending, the queue keeps the
bottleneck
Post by Luca Muscariello
     output busy and you'll not see underutilization of the link.
Moreover,
Post by Luca Muscariello
     once you get good loss de-synchronization, the buffer size
requirement
Post by Luca Muscariello
     for multiple long-lived flows decreases.
     > This gives rule of thumbs to size buffers which is also very
practical
Post by Luca Muscariello
     > and thanks to flow isolation becomes very accurate.
     The positive effect of buffers is merely their role to absorb
     short-term bursts (i.e., mismatch in arrival and departure rates)
     instead of dropping packets. One does not need a big buffer to
     fully utilize a link (with perfect knowledge you can keep the link
     saturated even without a single packet waiting in the buffer).
     Furthermore, large buffers (e.g., using the BDP rule of thumb)
     are not useful/practical anymore at very high speed such as
     memory is also quite costly at such high speeds...
     Regards,
      Roland
     [1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
     TCP LoLa: Congestion Control for Low Latencies and High
Throughput.
Post by Luca Muscariello
     Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
     215-218, Singapore, Singapore, October 2017
     http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
     > Which is: 
     >
     > 1) find a way to keep the number of backlogged flows at a
     reasonable value. 
     > This largely depends on the minimum fair rate an application may
     need in
     > the long term.
     > We discussed a little bit of available mechanisms to achieve
that
Post by Luca Muscariello
     in the
     > literature.
     >
     > 2) fix the largest RTT you want to serve at full utilization
and size
Post by Luca Muscariello
     > the buffer using BDP * N_backlogged.  
     > Or the other way round: check how much memory you can use 
     > in the router/line card/device and for a fixed N, compute
the largest
Post by Luca Muscariello
     > RTT you can serve at full utilization. 
     >
     > 3) there is still some memory to dimension for sparse flows in
     addition
     > to that, but this is not based on BDP. 
     > It is just enough to compute the total utilization of sparse
flows and
Post by Luca Muscariello
     > use the same simple model Toke has used 
     > to compute the (de)prioritization probability.
     >
     > This procedure would allow to size FQ_codel but also SFQ.
     > It would be interesting to compare the two under this buffer
sizing. 
Post by Luca Muscariello
     > It would also be interesting to compare another mechanism
that we have
Post by Luca Muscariello
     > mentioned during the defense
     > which is AFD + a sparse flow queue. Which is, BTW, already
     available in
     > Cisco nexus switches for data centres.
     >
     > I think that the the codel part would still provide the ECN
feature,
Post by Luca Muscariello
     > that all the others cannot have.
     > However the others, the last one especially can be
implemented in
Post by Luca Muscariello
     > silicon with reasonable cost.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Jonathan Morton
2018-11-29 07:45:44 UTC
Permalink
…when it is nearly certain that more than one flow exists, means aiming
for the BDP in a single flow is generally foolish.
It might be more accurate to say that the BDP of the fair-share of the path is the cwnd to aim for. Plus epsilon for probing.

- Jonathan Morton
Dave Taht
2018-11-29 07:54:55 UTC
Permalink
Post by Jonathan Morton
…when it is nearly certain that more than one flow exists, means aiming
for the BDP in a single flow is generally foolish.
It might be more accurate to say that the BDP of the fair-share of the path is the cwnd to aim for. Plus epsilon for probing.
OK, much better, thanks.
Post by Jonathan Morton
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Luca Muscariello
2018-11-29 08:09:02 UTC
Permalink
If you have multiple flows the BDP will change as measured at the end
points.
Also the queue occupancy has to accommodate the overshoot. If you have a
BDP in flight plus
epsilon you should not size based on the long term value but on the
overshoot.
If you don't have space for it, the long term value may be even larger.
Post by Jonathan Morton
Post by Jonathan Morton

when it is nearly certain that more than one flow exists, means aiming
for the BDP in a single flow is generally foolish.
It might be more accurate to say that the BDP of the fair-share of the
path is the cwnd to aim for. Plus epsilon for probing.
OK, much better, thanks.
Post by Jonathan Morton
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave TÀht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Bless, Roland (TM)
2018-11-29 13:49:41 UTC
Permalink
Hi Jonathan,
Post by Jonathan Morton
…when it is nearly certain that more than one flow exists, means aiming
for the BDP in a single flow is generally foolish.
It might be more accurate to say that the BDP of the fair-share of the path is the cwnd to aim for. Plus epsilon for probing.
+1

Right, my statement wasn't on buffer sizing, but on the amount of
inflight data (see other mail). Interestingly enough, it seems hard to
find out the current share without any queue, where the flows indirectly
interact with each other...

Regards,
Roland

Bless, Roland (TM)
2018-11-29 08:41:28 UTC
Permalink
Hi Dave,
Post by Dave Taht
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
OK. We agree.
That's correct, you need *at least* the BDP in flight so that the
bottleneck queue never empties out.
No, that's not what I meant, but it's quite simple.
You need: data min_inflight=2 * RTTmin * bottleneck_rate to filly
utilize the bottleneck link.
If this is true, the bottleneck queue will be empty. If your amount
of inflight data is larger, the bottleneck queue buffer will store
the excess packets. With just min_inflight there will be no
bottleneck queue, the packets are "on the wire".
Post by Luca Muscariello
This can be easily proven using fluid models for any congestion
controlled source no matter if it is 
loss-based, delay-based, rate-based, formula-based etc.
A highly paced source gives you the ability to get as close as
theoretically possible to the BDP+epsilon
as possible.
Yep, but that BDP is "on the wire" and epsilon will be in the bottleneck
buffer.
I'm hoping I made my point effectively earlier, that
" data min_inflight=2 * RTTmin * bottleneck_rate "
That factor of 2 was a mistake in my first mail (sorry for that...).
I corrected that three minutes after. I should have written:
data min_inflight=RTTmin * bottleneck_rate
Post by Dave Taht
when it is nearly certain that more than one flow exists, means aiming
for the BDP in a single flow is generally foolish. Liked the stanford
I think one should not confuse the buffer sizing rule with the
calcluation for inflight data...
Post by Dave Taht
result, I think it's pretty general. I see hundreds of flows active
every minute. There was another paper that looked into some magic
200-ish number as simultaneous flows active, normally
So for buffer sizing, the BDP dependent rule is foolish in general,
because it is optimized for older loss-based TCP congestion controls
so that they can keep the utilization high. It's correct that in
presence of multiple flows and good loss desynchronization, you
still get high utilization with a smaller buffer (Appenzeller et. al,
SIGCOMM 2004).

However, when it comes to CWnd sizing, that inflight rule would convert
to:
data min_inflight=RTTmin * bottleneck_rate_share
because other flows are present at the bottleneck.

Interestingly enough: flows with a different RTT_min should
use different CWnds, but their amount of queued data at the bottleneck
should be nearly equal if you want to have flow rate fairness.

Regards
Roland
Dave Taht
2018-11-29 07:33:06 UTC
Permalink
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
capacity without filling the buffer to its maximum capacity. The BDP
Just to stay cynical, I would rather like the BBR and Lola folk to look
closely at asymmetric networks, ack path delay, and lower rates than
1Gbit. And what the heck... wifi. :)

BBRv1, for example, is hard coded to reduce cwnd to 4, not lower - because
that works in the data center. Lola, so far as I know, achieves its
tested results at 1-10Gbits. My world and much of the rest of the world,
barely gets to a gbit, on a good day, with a tail-wind.

If either of these TCPs could be tuned to work well and not saturate
5Mbit links I would be a happier person. RRUL benchmarks anyone?

I did, honestly, want to run lola, (codebase was broken), and I am
patiently waiting for BBRv2 to escape (while hoping that the googlers
actually run some flent tests at edge bandwidths before I tear into it)

Personally, I'd settle for SFQ on the CMTSes, fq_codel on the home
routers, and then let the tcp-ers decide how much delay and loss they
can tolerate.

Another thought... I mean... can't we all just agree to make cubic
more gentle and go fix that, and not a have a flag day? "From linux 5.0
forward cubic shall:

Stop increasing its window at 250ms of delay greater than
the initial RTT?

Have it occasionally rtt probe a bit, more like BBR?
Post by Bless, Roland (TM)
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link. Moreover,
once you get good loss de-synchronization, the buffer size requirement
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
memory is also quite costly at such high speeds...
Regards,
Roland
[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
This whole thread, although diversive... well, I'd really like everybody
to get together and try to write a joint paper on the best stuff to do,
worldwide, to make bufferbloat go away.
Post by Bless, Roland (TM)
Post by Luca Muscariello
Which is: 
1) find a way to keep the number of backlogged flows at a reasonable value. 
This largely depends on the minimum fair rate an application may need in
the long term.
We discussed a little bit of available mechanisms to achieve that in the
literature.
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.  
Or the other way round: check how much memory you can use 
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization. 
3) there is still some memory to dimension for sparse flows in addition
to that, but this is not based on BDP. 
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used 
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing. 
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Bless, Roland (TM)
2018-11-29 08:13:43 UTC
Permalink
Hi Dave,
Post by Dave Taht
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
This is not true. There are congestion control algorithms
(e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
capacity without filling the buffer to its maximum capacity. The BDP
Just to stay cynical, I would rather like the BBR and Lola folk to look
closely at asymmetric networks, ack path delay, and lower rates than
1Gbit. And what the heck... wifi. :)
Yes, absolutely right from a practical point of view.
The thing is that we have to prioritize our research work
at the moment. LoLa is meant to be a conceptual study rather
than a real-world full blown, rock solid congestion control.
It came out of a research project that focuses on high speed networks,
thus we were experimenting with that. Scaling a CC across several
orders of magnitude w.r.t. to speed is a challenge. I think, Mario
also used 100Mbit/s for experiments (but they aren't in that paper)
and it still works fine. However, experimenting with LoLa in real
world environments will always be a problem if flows with
loss-based CC are actually present at the same bottleneck, because LoLa
will back-off (it will not sacrifice its low latency goal for getting
more bandwidth). However, LoLa shows that you can actually get very
close to the goal of limiting queuing delay, but achieving high
utilization _and_ fairness at the same time. BTW, there is an ns-3
implementation of LoLa available...
Post by Dave Taht
BBRv1, for example, is hard coded to reduce cwnd to 4, not lower - because
that works in the data center. Lola, so far as I know, achieves its
tested results at 1-10Gbits. My world and much of the rest of the world,
barely gets to a gbit, on a good day, with a tail-wind.
If either of these TCPs could be tuned to work well and not saturate
5Mbit links I would be a happier person. RRUL benchmarks anyone?
I think we need some students to do this...
Post by Dave Taht
I did, honestly, want to run lola, (codebase was broken), and I am
patiently waiting for BBRv2 to escape (while hoping that the googlers
actually run some flent tests at edge bandwidths before I tear into it)
LoLa code is currently revised by Felix and I think it will converge
to a more stable state within the next few weeks.
Post by Dave Taht
Personally, I'd settle for SFQ on the CMTSes, fq_codel on the home
routers, and then let the tcp-ers decide how much delay and loss they
can tolerate.
Another thought... I mean... can't we all just agree to make cubic
more gentle and go fix that, and not a have a flag day? "From linux 5.0
Stop increasing its window at 250ms of delay greater than
the initial RTT?
Have it occasionally rtt probe a bit, more like BBR?
RTT probing is fine, but in order to measure RTTmin you have
to make sure that the bottleneck queue is empty. This isn't that
trivial, because all flows need to synchronize a bit in order to
achieve that. But both, BBR and LoLa, have such mechanisms.
Post by Dave Taht
Post by Bless, Roland (TM)
rule of thumb basically stems from the older loss-based congestion
control variants that profit from the standing queue that they built
while they back-off and stop sending, the queue keeps the bottleneck
output busy and you'll not see underutilization of the link. Moreover,
once you get good loss de-synchronization, the buffer size requirement
for multiple long-lived flows decreases.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
The positive effect of buffers is merely their role to absorb
short-term bursts (i.e., mismatch in arrival and departure rates)
instead of dropping packets. One does not need a big buffer to
fully utilize a link (with perfect knowledge you can keep the link
saturated even without a single packet waiting in the buffer).
Furthermore, large buffers (e.g., using the BDP rule of thumb)
memory is also quite costly at such high speeds...
Regards,
Roland
[1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
TCP LoLa: Congestion Control for Low Latencies and High Throughput.
Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
215-218, Singapore, Singapore, October 2017
http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf
This whole thread, although diversive... well, I'd really like everybody
to get together and try to write a joint paper on the best stuff to do,
worldwide, to make bufferbloat go away.
Yea, at least if everyone would use LoLa you could eliminate
bufferbloat, but a flag day is impossible and loss-based CC
will not go away so soon. However, self-inflicted queueing
delay from loss-based CCs hurts nowadays and now we know how to do
better...
Post by Dave Taht
Post by Bless, Roland (TM)
Post by Luca Muscariello
Which is: 
1) find a way to keep the number of backlogged flows at a reasonable value. 
This largely depends on the minimum fair rate an application may need in
the long term.
We discussed a little bit of available mechanisms to achieve that in the
literature.
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.  
Or the other way round: check how much memory you can use 
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization. 
3) there is still some memory to dimension for sparse flows in addition
to that, but this is not based on BDP. 
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used 
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing. 
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Regards,
Roland
Pete Heist
2018-11-29 10:00:15 UTC
Permalink
Post by Dave Taht
This whole thread, although diversive... well, I'd really like everybody
to get together and try to write a joint paper on the best stuff to do,
worldwide, to make bufferbloat go away.
+1

I don’t think it’s an accident that a discussion around CoDel evolved into a discussion around TCP.

If newer TCP CC algorithms can eliminate self-induced bloat, it should still be possible for queue management to handle older TCP implementations and extreme cases while not damaging newer TCPs. Beyond that, there may be areas where queue management can actually enhance the performance of newer TCPs. For starters, there’s what happens within an RTT, which I suppose can’t be dealt with in the TCP stack, and referring back to one of Jon’s messages from 11/27, the possibility for improved signaling from AQM back to TCP on the state of the queue. Global coordination could make this work better.

p.s.- Apologies for it taking me longer than an RTT to re-read the original CoDel papers and think through some implications. My original question might have been smarter.
Toke Høiland-Jørgensen
2018-11-27 11:52:05 UTC
Permalink
Post by Luca Muscariello
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
One think I wondered about afterwards was whether or not it would be
feasible (as in, easy to add, or maybe even supported in current
versions) to tie in an AQM to an AFD-type virtual fairness queueing
system? You could keep the AQM state variables along with the per-flow
state and react appropriately. Any reason why this wouldn't work?

-Toke
Dave Taht
2018-11-28 03:37:39 UTC
Permalink
Post by Toke Høiland-Jørgensen
Post by Luca Muscariello
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
One think I wondered about afterwards was whether or not it would be
feasible (as in, easy to add, or maybe even supported in current
versions) to tie in an AQM to an AFD-type virtual fairness queueing
system? You could keep the AQM state variables along with the per-flow
state and react appropriately. Any reason why this wouldn't work?
Just bookmarking this thought for now.
Dave Taht
2018-11-27 20:58:18 UTC
Permalink
OK, wow, this conversation got long. and I'm still 20 messages behind.

Two points, and I'm going to go back to work, and maybe I'll try to
summarize a table
of the competing viewpoints, as there's far more than BDP of
discussion here, and what
we need is sqrt(bdp) to deal with all the different conversational flows. :)

On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
I think that this is a very good comment to the discussion at the defense about the comparison between
SFQ with longest queue drop and FQ_Codel.
A congestion controlled protocol such as TCP or others, including QUIC, LEDBAT and so on
need at least the BDP in the transmission queue to get full link efficiency, i.e. the queue never empties out.
no, I think it needs a BDP in flight.

I think some of the confusion here is that your TCP stack needs to
keep around a BDP in order to deal with
retransmits, but that lives in another set of buffers entirely.
This gives rule of thumbs to size buffers which is also very practical and thanks to flow isolation becomes very accurate.
1) find a way to keep the number of backlogged flows at a reasonable value.
This largely depends on the minimum fair rate an application may need in the long term.
We discussed a little bit of available mechanisms to achieve that in the literature.
2) fix the largest RTT you want to serve at full utilization and size the buffer using BDP * N_backlogged.
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest RTT you can serve at full utilization.
My own take on the whole BDP argument is that *so long as the flows in
that BDP are thoroughly mixed* you win.
3) there is still some memory to dimension for sparse flows in addition to that, but this is not based on BDP.
It is just enough to compute the total utilization of sparse flows and use the same simple model Toke has used
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature, that all the others cannot have.
However the others, the last one especially can be implemented in silicon with reasonable cost.
Post by Jonathan Morton
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel part of fq_codel actually help in the real world?
Fundamentally, without Codel the only limits on the congestion window would be when the sender or receiver hit configured or calculated rwnd and cwnd limits (the rwnd is visible on the wire and usually chosen to be large enough to be a non-factor), or when the queue overflows. Large windows require buffer memory in both sender and receiver, increasing costs on the sender in particular (who typically has many flows to manage per machine).
Queue overflow tends to result in burst loss and head-of-line blocking in the receiver, which is visible to the user as a pause and subsequent jump in the progress of their download, accompanied by a major fluctuation in the estimated time to completion. The lost packets also consume capacity upstream of the bottleneck which does not contribute to application throughput. These effects are independent of whether overflow dropping occurs at the head or tail of the bottleneck queue, though recovery occurs more quickly (and fewer packets might be lost) if dropping occurs from the head of the queue.
From a pure throughput-efficiency standpoint, Codel allows using ECN for congestion signalling instead of packet loss, potentially eliminating packet loss and associated lead-of-line blocking entirely. Even without ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP of the path, reducing memory requirements and significantly shortening the recovery time of each loss cycle, to the point where the end-user may not notice that delivery is not perfectly smooth, and implementing accurate completion time estimators is considerably simplified.
An important use-case is where two sequential bottlenecks exist on the path, the upstream one being only slightly higher capacity but lacking any queue management at all. This is presently common in cases where home CPE implements inbound shaping on a generic ISP last-mile link. In that case, without Codel running on the second bottleneck, traffic would collect in the first bottleneck's queue as well, greatly reducing the beneficial effects of FQ implemented on the second bottleneck. In this topology, the overall effect is inter-flow as well as intra-flow.
The combination of Codel with FQ is done in such a way that a separate instance of Codel is implemented for each flow. This means that congestion signals are only sent to flows that require them, and non-saturating flows are unmolested. This makes the combination synergistic, where each component offers an improvement to the behaviour of the other.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Luca Muscariello
2018-11-27 22:19:15 UTC
Permalink
I suggest re-reading this

https://queue.acm.org/detail.cfm?id=3022184
Post by Dave Taht
OK, wow, this conversation got long. and I'm still 20 messages behind.
Two points, and I'm going to go back to work, and maybe I'll try to
summarize a table
of the competing viewpoints, as there's far more than BDP of
discussion here, and what
we need is sqrt(bdp) to deal with all the different conversational flows. :)
On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
Post by Luca Muscariello
I think that this is a very good comment to the discussion at the
defense about the comparison between
Post by Luca Muscariello
SFQ with longest queue drop and FQ_Codel.
A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
Post by Luca Muscariello
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
no, I think it needs a BDP in flight.
I think some of the confusion here is that your TCP stack needs to
keep around a BDP in order to deal with
retransmits, but that lives in another set of buffers entirely.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
Post by Luca Muscariello
1) find a way to keep the number of backlogged flows at a reasonable
value.
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may need in
the long term.
Post by Luca Muscariello
We discussed a little bit of available mechanisms to achieve that in the
literature.
Post by Luca Muscariello
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.
Post by Luca Muscariello
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization.
My own take on the whole BDP argument is that *so long as the flows in
that BDP are thoroughly mixed* you win.
Post by Luca Muscariello
3) there is still some memory to dimension for sparse flows in addition
to that, but this is not based on BDP.
Post by Luca Muscariello
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used
Post by Luca Muscariello
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
Post by Luca Muscariello
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
Post by Luca Muscariello
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
Post by Luca Muscariello
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Post by Luca Muscariello
Post by Jonathan Morton
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel
part of fq_codel actually help in the real world?
Post by Luca Muscariello
Post by Jonathan Morton
Fundamentally, without Codel the only limits on the congestion window
would be when the sender or receiver hit configured or calculated rwnd and
cwnd limits (the rwnd is visible on the wire and usually chosen to be large
enough to be a non-factor), or when the queue overflows. Large windows
require buffer memory in both sender and receiver, increasing costs on the
sender in particular (who typically has many flows to manage per machine).
Post by Luca Muscariello
Post by Jonathan Morton
Queue overflow tends to result in burst loss and head-of-line blocking
in the receiver, which is visible to the user as a pause and subsequent
jump in the progress of their download, accompanied by a major fluctuation
in the estimated time to completion. The lost packets also consume
capacity upstream of the bottleneck which does not contribute to
application throughput. These effects are independent of whether overflow
dropping occurs at the head or tail of the bottleneck queue, though
recovery occurs more quickly (and fewer packets might be lost) if dropping
occurs from the head of the queue.
Post by Luca Muscariello
Post by Jonathan Morton
From a pure throughput-efficiency standpoint, Codel allows using ECN
for congestion signalling instead of packet loss, potentially eliminating
packet loss and associated lead-of-line blocking entirely. Even without
ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP
of the path, reducing memory requirements and significantly shortening the
recovery time of each loss cycle, to the point where the end-user may not
notice that delivery is not perfectly smooth, and implementing accurate
completion time estimators is considerably simplified.
Post by Luca Muscariello
Post by Jonathan Morton
An important use-case is where two sequential bottlenecks exist on the
path, the upstream one being only slightly higher capacity but lacking any
queue management at all. This is presently common in cases where home CPE
implements inbound shaping on a generic ISP last-mile link. In that case,
without Codel running on the second bottleneck, traffic would collect in
the first bottleneck's queue as well, greatly reducing the beneficial
effects of FQ implemented on the second bottleneck. In this topology, the
overall effect is inter-flow as well as intra-flow.
Post by Luca Muscariello
Post by Jonathan Morton
The combination of Codel with FQ is done in such a way that a separate
instance of Codel is implemented for each flow. This means that congestion
signals are only sent to flows that require them, and non-saturating flows
are unmolested. This makes the combination synergistic, where each
component offers an improvement to the behaviour of the other.
Post by Luca Muscariello
Post by Jonathan Morton
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave TÀht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Roland Bless
2018-11-27 22:30:30 UTC
Permalink
Hi,
I suggest re-reading this 
https://queue.acm.org/detail.cfm?id=3022184
Probably not without this afterwards:
https://ieeexplore.ieee.org/document/8117540

(especially sections II and III).

Regards,
Roland
Dave Taht
2018-11-27 23:17:04 UTC
Permalink
Post by Roland Bless
Hi,
Post by Luca Muscariello
I suggest re-reading this
https://queue.acm.org/detail.cfm?id=3022184
https://ieeexplore.ieee.org/document/8117540
(especially sections II and III).
And this was awesome:

https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20congestion%20control%20using%20the%20power%20metric%20LK%20Mod%20aug%202%202018.pdf

and somewhere in that one are two points I'd like to make about BBR
with multiple flows, reverse congestion issues, and so on...

but now that we all have bedtime reading, I'm going to go back to
hacking on libcuckoo. :)
Post by Roland Bless
Regards,
Roland
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Kathleen Nichols
2018-11-28 03:47:08 UTC
Permalink
On 11/27/18 3:17 PM, Dave Taht wrote:
...
Post by Dave Taht
but now that we all have bedtime reading, I'm going to go back to
hacking on libcuckoo. :)
Geez, louise. As if everyone doesn't have enough to do! I apologize. I
did not mean for anyone to completely read the links I sent, just look
at the relevant pictures and skim the relevant text.
Luca Muscariello
2018-11-28 09:56:31 UTC
Permalink
Dave,

The single BDP inflight is a rule of thumb that does not account for
fluctuations of the RTT.
And I am not talking about random fluctuations and noise. I am talking
about fluctuations
from a control theoretic point of view to stabilise the system, e.g. the
trajectory of the system variable that
gets to the optimal point no matter the initial conditions (Lyapunov).
The ACM queue paper talking about Codel makes a fairly intuitive and
accessible explanation of that.

There is a less accessible literature talking about that, which dates back
to some time ago
that may be useful to re-read again

Damon Wischik and Nick McKeown. 2005.
Part I: buffer sizes for core routers.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI=
http://dx.doi.org/10.1145/1070873.1070884
http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf

and

Gaurav Raina, Don Towsley, and Damon Wischik. 2005.
Part II: control theory for buffer sizing.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82.
DOI=http://dx.doi.org/10.1145/1070873.1070885
http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf

One of the thing that Frank Kelly has brought to the literature is about
optimal control.
From a pure optimization point of view we know since Robert Gallagher (and
Bertsekas 1981) that
the optimal sending rate is a function of the shadow price at the
bottleneck.
This shadow price is nothing more than the Lagrange multiplier of the
capacity constraint
at the bottleneck. Some protocols such as XCP or RCP propose to carry
something
very close to a shadow price in the ECN but that's not that simple.
And currently we have a 0/1 "shadow price" which way insufficient.

Optimal control as developed by Frank Kelly since 1998 tells you that you
have
a stability region that is needed to get to the optimum.

Wischik work, IMO, helps quite a lot to understand tradeoffs while
designing AQM
and CC. I feel like the people who wrote the codel ACM Queue paper are very
much aware of this literature,
because Codel design principles seem to take into account that.
And the BBR paper too.
Post by Dave Taht
OK, wow, this conversation got long. and I'm still 20 messages behind.
Two points, and I'm going to go back to work, and maybe I'll try to
summarize a table
of the competing viewpoints, as there's far more than BDP of
discussion here, and what
we need is sqrt(bdp) to deal with all the different conversational flows. :)
On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
Post by Luca Muscariello
I think that this is a very good comment to the discussion at the
defense about the comparison between
Post by Luca Muscariello
SFQ with longest queue drop and FQ_Codel.
A congestion controlled protocol such as TCP or others, including QUIC,
LEDBAT and so on
Post by Luca Muscariello
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
no, I think it needs a BDP in flight.
I think some of the confusion here is that your TCP stack needs to
keep around a BDP in order to deal with
retransmits, but that lives in another set of buffers entirely.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very practical
and thanks to flow isolation becomes very accurate.
Post by Luca Muscariello
1) find a way to keep the number of backlogged flows at a reasonable
value.
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may need in
the long term.
Post by Luca Muscariello
We discussed a little bit of available mechanisms to achieve that in the
literature.
Post by Luca Muscariello
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.
Post by Luca Muscariello
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization.
My own take on the whole BDP argument is that *so long as the flows in
that BDP are thoroughly mixed* you win.
Post by Luca Muscariello
3) there is still some memory to dimension for sparse flows in addition
to that, but this is not based on BDP.
Post by Luca Muscariello
It is just enough to compute the total utilization of sparse flows and
use the same simple model Toke has used
Post by Luca Muscariello
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have
mentioned during the defense
Post by Luca Muscariello
which is AFD + a sparse flow queue. Which is, BTW, already available in
Cisco nexus switches for data centres.
Post by Luca Muscariello
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
Post by Luca Muscariello
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Post by Luca Muscariello
Post by Jonathan Morton
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel
part of fq_codel actually help in the real world?
Post by Luca Muscariello
Post by Jonathan Morton
Fundamentally, without Codel the only limits on the congestion window
would be when the sender or receiver hit configured or calculated rwnd and
cwnd limits (the rwnd is visible on the wire and usually chosen to be large
enough to be a non-factor), or when the queue overflows. Large windows
require buffer memory in both sender and receiver, increasing costs on the
sender in particular (who typically has many flows to manage per machine).
Post by Luca Muscariello
Post by Jonathan Morton
Queue overflow tends to result in burst loss and head-of-line blocking
in the receiver, which is visible to the user as a pause and subsequent
jump in the progress of their download, accompanied by a major fluctuation
in the estimated time to completion. The lost packets also consume
capacity upstream of the bottleneck which does not contribute to
application throughput. These effects are independent of whether overflow
dropping occurs at the head or tail of the bottleneck queue, though
recovery occurs more quickly (and fewer packets might be lost) if dropping
occurs from the head of the queue.
Post by Luca Muscariello
Post by Jonathan Morton
From a pure throughput-efficiency standpoint, Codel allows using ECN
for congestion signalling instead of packet loss, potentially eliminating
packet loss and associated lead-of-line blocking entirely. Even without
ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP
of the path, reducing memory requirements and significantly shortening the
recovery time of each loss cycle, to the point where the end-user may not
notice that delivery is not perfectly smooth, and implementing accurate
completion time estimators is considerably simplified.
Post by Luca Muscariello
Post by Jonathan Morton
An important use-case is where two sequential bottlenecks exist on the
path, the upstream one being only slightly higher capacity but lacking any
queue management at all. This is presently common in cases where home CPE
implements inbound shaping on a generic ISP last-mile link. In that case,
without Codel running on the second bottleneck, traffic would collect in
the first bottleneck's queue as well, greatly reducing the beneficial
effects of FQ implemented on the second bottleneck. In this topology, the
overall effect is inter-flow as well as intra-flow.
Post by Luca Muscariello
Post by Jonathan Morton
The combination of Codel with FQ is done in such a way that a separate
instance of Codel is implemented for each flow. This means that congestion
signals are only sent to flows that require them, and non-saturating flows
are unmolested. This makes the combination synergistic, where each
component offers an improvement to the behaviour of the other.
Post by Luca Muscariello
Post by Jonathan Morton
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave TÀht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Dave Taht
2018-11-28 10:40:23 UTC
Permalink
On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello
Dave,
The single BDP inflight is a rule of thumb that does not account for fluctuations of the RTT.
And I am not talking about random fluctuations and noise. I am talking about fluctuations
from a control theoretic point of view to stabilise the system, e.g. the trajectory of the system variable that
gets to the optimal point no matter the initial conditions (Lyapunov).
I have been trying all day to summon the gumption to make this argument:

IF you have a good idea of the actual RTT...

it is also nearly certain that there will be *at least* one other flow
you will be competing with...
therefore the fluctuations from every point of view are dominated by
the interaction between these flows and
the goal is, in general, is not to take up a full BDP for your single flow.

And BBR aims for some tiny percentage less than what it thinks it can
get, when, well, everybody's seen it battle it out with itself and
with cubic. I hand it FQ at the bottleneck link and it works well.

single flows exist only in the minds of theorists and labs.

There's a relevant passage worth citing in the kleinrock paper, I
thought (did he write two recently?) that talked about this problem...
I *swear* when I first read it it had a deeper discussion of the
second sentence below and had two paragraphs that went into the issues
with multiple flows:

"ch earlier and led to the Flow Deviation algorithm [28]. 17 The
reason that the early work of 40 years ago took so long to make its
current impact is because in [31] it was shown that the mechanism
presented in [2] and [3] could not be implemented in a decentralized
algorithm. This delayed the application of Power until the recent work
by the Google team in [1] demonstrated that the key elements of
response time and bandwidth could indeed be estimated using a
distributed control loop sliding window spanning approximately 10
round-trip times."

but I can't find it today.
The ACM queue paper talking about Codel makes a fairly intuitive and accessible explanation of that.
I haven't re-read the lola paper. I just wanted to make the assertion
above. And then duck. :)

Also, when I last looked at BBR, it made a false assumption that 200ms
was "long enough" to probe the actual RTT, when my comcast links and
others are measured at 680ms+ of buffering.

And I always liked the stanford work, here, which tried to assert that
a link with n flows requires no more than B = (RTT ×C)/ √ n.

http://yuba.stanford.edu/techreports/TR04-HPNG-060800.pdf

night!
There is a less accessible literature talking about that, which dates back to some time ago
that may be useful to re-read again
Damon Wischik and Nick McKeown. 2005.
Part I: buffer sizes for core routers.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI=http://dx.doi.org/10.1145/1070873.1070884
http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf
and
Gaurav Raina, Don Towsley, and Damon Wischik. 2005.
Part II: control theory for buffer sizing.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82.
DOI=http://dx.doi.org/10.1145/1070873.1070885
http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf
One of the thing that Frank Kelly has brought to the literature is about optimal control.
From a pure optimization point of view we know since Robert Gallagher (and Bertsekas 1981) that
the optimal sending rate is a function of the shadow price at the bottleneck.
This shadow price is nothing more than the Lagrange multiplier of the capacity constraint
at the bottleneck. Some protocols such as XCP or RCP propose to carry something
very close to a shadow price in the ECN but that's not that simple.
And currently we have a 0/1 "shadow price" which way insufficient.
Optimal control as developed by Frank Kelly since 1998 tells you that you have
a stability region that is needed to get to the optimum.
Wischik work, IMO, helps quite a lot to understand tradeoffs while designing AQM
and CC. I feel like the people who wrote the codel ACM Queue paper are very much aware of this literature,
because Codel design principles seem to take into account that.
And the BBR paper too.
Post by Dave Taht
OK, wow, this conversation got long. and I'm still 20 messages behind.
Two points, and I'm going to go back to work, and maybe I'll try to
summarize a table
of the competing viewpoints, as there's far more than BDP of
discussion here, and what
we need is sqrt(bdp) to deal with all the different conversational flows. :)
On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
I think that this is a very good comment to the discussion at the defense about the comparison between
SFQ with longest queue drop and FQ_Codel.
A congestion controlled protocol such as TCP or others, including QUIC, LEDBAT and so on
need at least the BDP in the transmission queue to get full link efficiency, i.e. the queue never empties out.
no, I think it needs a BDP in flight.
I think some of the confusion here is that your TCP stack needs to
keep around a BDP in order to deal with
retransmits, but that lives in another set of buffers entirely.
This gives rule of thumbs to size buffers which is also very practical and thanks to flow isolation becomes very accurate.
1) find a way to keep the number of backlogged flows at a reasonable value.
This largely depends on the minimum fair rate an application may need in the long term.
We discussed a little bit of available mechanisms to achieve that in the literature.
2) fix the largest RTT you want to serve at full utilization and size the buffer using BDP * N_backlogged.
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest RTT you can serve at full utilization.
My own take on the whole BDP argument is that *so long as the flows in
that BDP are thoroughly mixed* you win.
3) there is still some memory to dimension for sparse flows in addition to that, but this is not based on BDP.
It is just enough to compute the total utilization of sparse flows and use the same simple model Toke has used
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we have mentioned during the defense
which is AFD + a sparse flow queue. Which is, BTW, already available in Cisco nexus switches for data centres.
I think that the the codel part would still provide the ECN feature, that all the others cannot have.
However the others, the last one especially can be implemented in silicon with reasonable cost.
Post by Jonathan Morton
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel part of fq_codel actually help in the real world?
Fundamentally, without Codel the only limits on the congestion window would be when the sender or receiver hit configured or calculated rwnd and cwnd limits (the rwnd is visible on the wire and usually chosen to be large enough to be a non-factor), or when the queue overflows. Large windows require buffer memory in both sender and receiver, increasing costs on the sender in particular (who typically has many flows to manage per machine).
Queue overflow tends to result in burst loss and head-of-line blocking in the receiver, which is visible to the user as a pause and subsequent jump in the progress of their download, accompanied by a major fluctuation in the estimated time to completion. The lost packets also consume capacity upstream of the bottleneck which does not contribute to application throughput. These effects are independent of whether overflow dropping occurs at the head or tail of the bottleneck queue, though recovery occurs more quickly (and fewer packets might be lost) if dropping occurs from the head of the queue.
From a pure throughput-efficiency standpoint, Codel allows using ECN for congestion signalling instead of packet loss, potentially eliminating packet loss and associated lead-of-line blocking entirely. Even without ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP of the path, reducing memory requirements and significantly shortening the recovery time of each loss cycle, to the point where the end-user may not notice that delivery is not perfectly smooth, and implementing accurate completion time estimators is considerably simplified.
An important use-case is where two sequential bottlenecks exist on the path, the upstream one being only slightly higher capacity but lacking any queue management at all. This is presently common in cases where home CPE implements inbound shaping on a generic ISP last-mile link. In that case, without Codel running on the second bottleneck, traffic would collect in the first bottleneck's queue as well, greatly reducing the beneficial effects of FQ implemented on the second bottleneck. In this topology, the overall effect is inter-flow as well as intra-flow.
The combination of Codel with FQ is done in such a way that a separate instance of Codel is implemented for each flow. This means that congestion signals are only sent to flows that require them, and non-saturating flows are unmolested. This makes the combination synergistic, where each component offers an improvement to the behaviour of the other.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Luca Muscariello
2018-11-28 10:48:19 UTC
Permalink
Post by Dave Taht
On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello
Post by Luca Muscariello
Dave,
The single BDP inflight is a rule of thumb that does not account for
fluctuations of the RTT.
Post by Luca Muscariello
And I am not talking about random fluctuations and noise. I am talking
about fluctuations
Post by Luca Muscariello
from a control theoretic point of view to stabilise the system, e.g. the
trajectory of the system variable that
Post by Luca Muscariello
gets to the optimal point no matter the initial conditions (Lyapunov).
IF you have a good idea of the actual RTT...
it is also nearly certain that there will be *at least* one other flow
you will be competing with...
therefore the fluctuations from every point of view are dominated by
the interaction between these flows and
the goal is, in general, is not to take up a full BDP for your single flow.
And BBR aims for some tiny percentage less than what it thinks it can
get, when, well, everybody's seen it battle it out with itself and
with cubic. I hand it FQ at the bottleneck link and it works well.
single flows exist only in the minds of theorists and labs.
There's a relevant passage worth citing in the kleinrock paper, I
thought (did he write two recently?) that talked about this problem...
I *swear* when I first read it it had a deeper discussion of the
second sentence below and had two paragraphs that went into the issues
"ch earlier and led to the Flow Deviation algorithm [28]. 17 The
reason that the early work of 40 years ago took so long to make its
current impact is because in [31] it was shown that the mechanism
presented in [2] and [3] could not be implemented in a decentralized
algorithm. This delayed the application of Power until the recent work
by the Google team in [1] demonstrated that the key elements of
response time and bandwidth could indeed be estimated using a
distributed control loop sliding window spanning approximately 10
round-trip times."
but I can't find it today.
Here it is

https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20Congestion%20Control%20Using%20the%20Power%20Metric-Keep%20the%20Pipe%20Just%20Full%2C%20But%20No%20Fuller%20July%202018.pdf
Post by Dave Taht
Post by Luca Muscariello
The ACM queue paper talking about Codel makes a fairly intuitive and
accessible explanation of that.
I haven't re-read the lola paper. I just wanted to make the assertion
above. And then duck. :)
Also, when I last looked at BBR, it made a false assumption that 200ms
was "long enough" to probe the actual RTT, when my comcast links and
others are measured at 680ms+ of buffering.
This is essentially the same paper I cited which is Part I.
Post by Dave Taht
And I always liked the stanford work, here, which tried to assert that
a link with n flows requires no more than B = (RTT ×C)/ √ n.
http://yuba.stanford.edu/techreports/TR04-HPNG-060800.pdf
That that paper does not say that that rule ALWAYS apply. It does under
certain conditions.
But my point is about optimality.

I does NOT mean that the system HAS to work ALWAYS in that point because
things change.

And for BBR, I would say that one thing is the design principles another is
the implementations
and we better distinguish between them. The key design principles are all
valid.
Post by Dave Taht
night!
night ;)
Post by Dave Taht
Post by Luca Muscariello
There is a less accessible literature talking about that, which dates
back to some time ago
Post by Luca Muscariello
that may be useful to re-read again
Damon Wischik and Nick McKeown. 2005.
Part I: buffer sizes for core routers.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI=
http://dx.doi.org/10.1145/1070873.1070884
Post by Luca Muscariello
http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf
and
Gaurav Raina, Don Towsley, and Damon Wischik. 2005.
Part II: control theory for buffer sizing.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82.
DOI=http://dx.doi.org/10.1145/1070873.1070885
http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf
One of the thing that Frank Kelly has brought to the literature is about
optimal control.
Post by Luca Muscariello
From a pure optimization point of view we know since Robert Gallagher
(and Bertsekas 1981) that
Post by Luca Muscariello
the optimal sending rate is a function of the shadow price at the
bottleneck.
Post by Luca Muscariello
This shadow price is nothing more than the Lagrange multiplier of the
capacity constraint
Post by Luca Muscariello
at the bottleneck. Some protocols such as XCP or RCP propose to carry
something
Post by Luca Muscariello
very close to a shadow price in the ECN but that's not that simple.
And currently we have a 0/1 "shadow price" which way insufficient.
Optimal control as developed by Frank Kelly since 1998 tells you that
you have
Post by Luca Muscariello
a stability region that is needed to get to the optimum.
Wischik work, IMO, helps quite a lot to understand tradeoffs while
designing AQM
Post by Luca Muscariello
and CC. I feel like the people who wrote the codel ACM Queue paper are
very much aware of this literature,
Post by Luca Muscariello
because Codel design principles seem to take into account that.
And the BBR paper too.
Post by Dave Taht
OK, wow, this conversation got long. and I'm still 20 messages behind.
Two points, and I'm going to go back to work, and maybe I'll try to
summarize a table
of the competing viewpoints, as there's far more than BDP of
discussion here, and what
we need is sqrt(bdp) to deal with all the different conversational
flows. :)
Post by Luca Muscariello
Post by Dave Taht
On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
Post by Luca Muscariello
I think that this is a very good comment to the discussion at the
defense about the comparison between
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
SFQ with longest queue drop and FQ_Codel.
A congestion controlled protocol such as TCP or others, including
QUIC, LEDBAT and so on
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
need at least the BDP in the transmission queue to get full link
efficiency, i.e. the queue never empties out.
Post by Luca Muscariello
Post by Dave Taht
no, I think it needs a BDP in flight.
I think some of the confusion here is that your TCP stack needs to
keep around a BDP in order to deal with
retransmits, but that lives in another set of buffers entirely.
Post by Luca Muscariello
This gives rule of thumbs to size buffers which is also very
practical and thanks to flow isolation becomes very accurate.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
1) find a way to keep the number of backlogged flows at a reasonable
value.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
This largely depends on the minimum fair rate an application may need
in the long term.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
We discussed a little bit of available mechanisms to achieve that in
the literature.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
2) fix the largest RTT you want to serve at full utilization and size
the buffer using BDP * N_backlogged.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Or the other way round: check how much memory you can use
in the router/line card/device and for a fixed N, compute the largest
RTT you can serve at full utilization.
Post by Luca Muscariello
Post by Dave Taht
My own take on the whole BDP argument is that *so long as the flows in
that BDP are thoroughly mixed* you win.
Post by Luca Muscariello
3) there is still some memory to dimension for sparse flows in
addition to that, but this is not based on BDP.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
It is just enough to compute the total utilization of sparse flows
and use the same simple model Toke has used
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
to compute the (de)prioritization probability.
This procedure would allow to size FQ_codel but also SFQ.
It would be interesting to compare the two under this buffer sizing.
It would also be interesting to compare another mechanism that we
have mentioned during the defense
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
which is AFD + a sparse flow queue. Which is, BTW, already available
in Cisco nexus switches for data centres.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
I think that the the codel part would still provide the ECN feature,
that all the others cannot have.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
However the others, the last one especially can be implemented in
silicon with reasonable cost.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel
part of fq_codel actually help in the real world?
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
Fundamentally, without Codel the only limits on the congestion
window would be when the sender or receiver hit configured or calculated
rwnd and cwnd limits (the rwnd is visible on the wire and usually chosen to
be large enough to be a non-factor), or when the queue overflows. Large
windows require buffer memory in both sender and receiver, increasing costs
on the sender in particular (who typically has many flows to manage per
machine).
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
Queue overflow tends to result in burst loss and head-of-line
blocking in the receiver, which is visible to the user as a pause and
subsequent jump in the progress of their download, accompanied by a major
fluctuation in the estimated time to completion. The lost packets also
consume capacity upstream of the bottleneck which does not contribute to
application throughput. These effects are independent of whether overflow
dropping occurs at the head or tail of the bottleneck queue, though
recovery occurs more quickly (and fewer packets might be lost) if dropping
occurs from the head of the queue.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
From a pure throughput-efficiency standpoint, Codel allows using ECN
for congestion signalling instead of packet loss, potentially eliminating
packet loss and associated lead-of-line blocking entirely. Even without
ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP
of the path, reducing memory requirements and significantly shortening the
recovery time of each loss cycle, to the point where the end-user may not
notice that delivery is not perfectly smooth, and implementing accurate
completion time estimators is considerably simplified.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
An important use-case is where two sequential bottlenecks exist on
the path, the upstream one being only slightly higher capacity but lacking
any queue management at all. This is presently common in cases where home
CPE implements inbound shaping on a generic ISP last-mile link. In that
case, without Codel running on the second bottleneck, traffic would collect
in the first bottleneck's queue as well, greatly reducing the beneficial
effects of FQ implemented on the second bottleneck. In this topology, the
overall effect is inter-flow as well as intra-flow.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
The combination of Codel with FQ is done in such a way that a
separate instance of Codel is implemented for each flow. This means that
congestion signals are only sent to flows that require them, and
non-saturating flows are unmolested. This makes the combination
synergistic, where each component offers an improvement to the behaviour of
the other.
Post by Luca Muscariello
Post by Dave Taht
Post by Luca Muscariello
Post by Jonathan Morton
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave TÀht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
--
Dave TÀht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Bless, Roland (TM)
2018-11-28 12:10:02 UTC
Permalink
Hi Luca,
Post by Luca Muscariello
And for BBR, I would say that one thing is the design principles another
is the implementations
and we better distinguish between them. The key design principles are
all valid.
While the goal is certainly right to operate around the optimal point
where the buffer is nearly empty, BBR's model is only valid from either
the viewpoint of the bottleneck or that of a single sender.

In BBR, one of the key design principle is to observe the
achieved delivery rate. One assumption in BBRv1 is that if the delivery
rate can still be increased, then the bottleneck isn't saturated. This
doesn't necessarily hold if you have multiple BBR flows present at the
bottleneck.
Every BBR flow can (nearly always) increase its delivery rate while
probing: it will simply decrease other flows' shares. This is not
an _implementation_ issue of BBRv1 and has been explained in section III
of our BBR evaluation paper.

This section shows also that BBRv1 will (by concept) increase its amount
of inflight data to the maximum of 2 * estimated_BDP if multiple flows
are present. A BBR sender could also use packet loss or RTT increase as
indicators that it is probably operating right from the optimal
point, but this is not done in BBRv1.
BBRv2 will be thus an improvement over BBRv1 in several ways.

Regards,
Roland
Dave Taht
2018-11-29 07:22:28 UTC
Permalink
Post by Bless, Roland (TM)
Hi Luca,
Post by Luca Muscariello
And for BBR, I would say that one thing is the design principles another
is the implementations
and we better distinguish between them. The key design principles are
all valid.
While the goal is certainly right to operate around the optimal point
where the buffer is nearly empty, BBR's model is only valid from either
the viewpoint of the bottleneck or that of a single sender.
I think I agree with this, from my own experimental data.
Post by Bless, Roland (TM)
In BBR, one of the key design principle is to observe the
achieved delivery rate. One assumption in BBRv1 is that if the delivery
rate can still be increased, then the bottleneck isn't saturated. This
doesn't necessarily hold if you have multiple BBR flows present at the
bottleneck.
Every BBR flow can (nearly always) increase its delivery rate while
probing: it will simply decrease other flows' shares. This is not
an _implementation_ issue of BBRv1 and has been explained in section III
of our BBR evaluation paper.
Haven't re-read it yet.
Post by Bless, Roland (TM)
This section shows also that BBRv1 will (by concept) increase its amount
of inflight data to the maximum of 2 * estimated_BDP if multiple flows
are present. A BBR sender could also use packet loss or RTT increase as
Carnage!
Post by Bless, Roland (TM)
indicators that it is probably operating right from the optimal
point, but this is not done in BBRv1.
BBRv2 will be thus an improvement over BBRv1 in several ways.
I really really really want a sane response to ecn in bbr.
Post by Bless, Roland (TM)
Regards,
Roland
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Dave Taht
2018-11-29 07:20:53 UTC
Permalink
Post by Dave Taht
On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello
Post by Luca Muscariello
Dave,
The single BDP inflight is a rule of thumb that does not account
for fluctuations of the RTT.
Post by Luca Muscariello
And I am not talking about random fluctuations and noise. I am
talking about fluctuations
Post by Luca Muscariello
from a control theoretic point of view to stabilise the system,
e.g. the trajectory of the system variable that
Post by Luca Muscariello
gets to the optimal point no matter the initial conditions
(Lyapunov).
IF you have a good idea of the actual RTT...
it is also nearly certain that there will be *at least* one other flow
you will be competing with...
therefore the fluctuations from every point of view are dominated by
the interaction between these flows and
the goal is, in general, is not to take up a full BDP for your single flow.
And BBR aims for some tiny percentage less than what it thinks it can
get, when, well, everybody's seen it battle it out with itself and
with cubic. I hand it FQ at the bottleneck link and it works well.
single flows exist only in the minds of theorists and labs.
There's a relevant passage worth citing in the kleinrock paper, I
thought (did he write two recently?) that talked about this problem...
I *swear* when I first read it it had a deeper discussion of the
second sentence below and had two paragraphs that went into the issues
"ch earlier and led to the Flow Deviation algorithm [28]. 17 The
reason that the early work of 40 years ago took so long to make its
current impact is because in [31] it was shown that the mechanism
presented in [2] and [3] could not be implemented in a
decentralized
algorithm. This delayed the application of Power until the recent work
by the Google team in [1] demonstrated that the key elements of
response time and bandwidth could indeed be estimated using a
distributed control loop sliding window spanning approximately 10
round-trip times."
but I can't find it today.
Here it is
https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20Congestion%20Control%20Using%20the%20Power%20Metric-Keep%20the%20Pipe%20Just%20Full%2C%20But%20No%20Fuller%20July%202018.pdf
Thank you that is more what I remember reading. That said, I still
remember a really two paragraph thing that went into footnote 17 of the
40+ years of history behind all this, that clicked with me about why
we're still going wrong... and I can't remember what it is. I'll go
deeper into the past and go read more refs off of this.
Dave Taht
2018-11-27 20:50:57 UTC
Permalink
Post by Jonathan Morton
Post by Pete Heist
So I just thought to continue the discussion- when does the CoDel part of fq_codel actually help in the real world?
Fundamentally, without Codel the only limits on the congestion window would be when the sender or receiver hit configured or calculated rwnd and cwnd limits (the rwnd is visible on the wire and usually chosen to be large enough to be a non-factor), or when the queue overflows. Large windows require buffer memory in both sender and receiver, increasing costs on the sender in particular (who typically has many flows to manage per machine).
You can run devices out of memory more easily with our current ecn
implementations. I am seeing folk cut memory per instance to 256k in,
for example, the gluon project.

we end up dropping (which is better than the device crashing), and in
the fq_codel case, *bulk* head dropping. I see a bifurcation on the
data when we do this, and
I have a one line patch to the fq_codel bulk dropper that appears to
make things better when extremely memory constrained like this, but
haven't got around to fully evaluating it:

https://github.com/dtaht/fq_codel_fast/commit/a524fc2e39dc291199b9b04fb890ea1548f17641

would rather like more to try the memory limits we are seeing in the
field. 32MB (fq_codel default) is waaaay too much. 4MB is way too much
even for gbit, I think. 256k? well, given the choice between crashing
or not...
Post by Jonathan Morton
Queue overflow tends to result in burst loss and head-of-line blocking in the receiver, which is visible to the user as a pause and subsequent jump in the progress of their download, accompanied by a major fluctuation in the estimated time to completion. The lost packets also consume capacity upstream of the bottleneck which does not contribute to application throughput. These effects are independent of whether overflow dropping occurs at the head or tail of the bottleneck queue, though recovery occurs more quickly (and fewer packets might be lost) if dropping occurs from the head of the queue.
From a pure throughput-efficiency standpoint, Codel allows using ECN for congestion signalling instead of packet loss, potentially eliminating packet loss and associated lead-of-line blocking entirely. Even without ECN, the actual cwnd is kept near the minimum necessary to satisfy the BDP of the path, reducing memory requirements and significantly shortening the recovery time of each loss cycle, to the point where the end-user may not notice that delivery is not perfectly smooth, and implementing accurate completion time estimators is considerably simplified.
I wish we had fractional cwnd below 1 and/or that pacing did not rely
on cwnd at all. too many flows at any rate you choose, can end up
marking 100% of packets, still not run you out of memory, and cause
delay.
Post by Jonathan Morton
An important use-case is where two sequential bottlenecks exist on the path, the upstream one being only slightly higher capacity but lacking any queue management at all. This is presently common in cases where home CPE implements inbound shaping on a generic ISP last-mile link. In that case, without Codel running on the second bottleneck, traffic would collect in the first bottleneck's queue as well, greatly reducing the beneficial effects of FQ implemented on the second bottleneck. In this topology, the overall effect is inter-flow as well as intra-flow.
The combination of Codel with FQ is done in such a way that a separate instance of Codel is implemented for each flow. This means that congestion signals are only sent to flows that require them, and non-saturating flows are unmolested. This makes the combination synergistic, where each component offers an improvement to the behaviour of the other.
also a good bullet point somewhere!
Post by Jonathan Morton
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Michael Welzl
2018-11-26 21:56:51 UTC
Permalink
Hi folks,

That “Michael” dude was me :)

About the stuff below, a few comments. First, an impressive effort to dig all of this up - I also thought that this was an interesting conversation to have!

However, I would like to point out that thesis defense conversations are meant to be provocative, by design - when I said that CoDel doesn’t usually help and long queues would be the right thing for all applications, I certainly didn’t REALLY REALLY mean that. The idea was just to be thought provoking - and indeed I found this interesting: e.g., if you think about a short HTTP/1 connection, a large buffer just gives it a greater chance to get all packets across, and the perceived latency from the reduced round-trips after not dropping anything may in fact be less than with a smaller (or CoDel’ed) buffer.

But corner cases aside, in fact I very much agree with the answers to my question Pete gives below, and also with the points others have made in answering this thread. Jonathan Morton even mentioned ECN - after Dave’s recent over-reaction to ECN I made a point of not bringing up ECN *yet* again, but
 yes indeed, being able to use ECN to tell an application to back off instead of requiring to drop a packet is also one of the benefits.
(I think people easily miss the latency benefit of not dropping a packet, and thereby eliminating head-of-line blocking - packet drops require an extra RTT for retransmission, which can be quite a long time. This is about measuring latency at the right layer...)
BTW, Anna Brunstrom was also very quick to also give me the HTTP/2.0 example in the break after the defense. Also, TCP will generally not work very well when queues get very long
 the RTT estimate gets way off.

All in all, I think this is a fun thought to consider for a bit, but not really something worth spending people’s time on, IMO: big buffers are bad, period. All else are corner cases.

I’ll use the opportunity to tell folks that I was also pretty impressed with Toke’s thesis as well as his performance at the defense. Among the many cool things he’s developed (or contributed to), my personal favorite is the airtime fairness scheduler. But, there were many more. Really good stuff.

With that, I wish all the best to all you bloaters out there - thanks for reducing our queues!

Cheers,
Michael
Post by Pete Heist
http://youtu.be/upvx6rpSLSw http://youtu.be/upvx6rpSLSw
My attempt at a transcript is at the end of this message. (I probably won’t attempt a full defense transcript, but if someone wants more of a particular section I can try. :)
1) Multiplexed HTTP/2.0 requests containing both a saturating stream and interactive traffic. For example, a game that uses HTTP/2.0 to download new map data while position updates or chat happen at the same time. Standalone programs could use HTTP/2.0 this way, or for web apps, the browser may multiplex concurrent uses of XHR over a single TCP connection. I don’t know of any examples.
2) SSH with port forwarding while using an interactive terminal together with a bulk transfer?
3) Does CoDel help the TCP protocol itself somehow? For example, does it speed up the round-trip time when acknowledging data segments, improving behavior on lossy links? Similarly, does it speed up the TCP close sequence for saturating flows?
Pete
---
M: In fq_codel what is really the point of CoDel?
T: Yeah, uh, a bit better intra-flow latency...
M: Right, who cares about that?
T: Apparently some people do.
M: No I mean specifically, what types of flows care about that?
T: Yeah, so, um, flows that are TCP based or have some kind of- like, elastic flows that still want low latency.
M: Elastic flows that are TCP based that want low latency...
T: Things where you want to discover the- like, you want to utilize the full link and sort of probe the bandwidth, but you still want low latency.
M: Can you be more concrete what kind of application is that?
T: I, yeah, I

M: Give me any application example that’s gonna benefit from the CoDel part- CoDel bits in fq_codel? Because I have problems with this.
T: I, I do too... So like, you can implement things this way but equivalently if you have something like fq_codel you could, like, if you have a video streaming application that interleaves control

M: <inaudible> that runs on UDP often.
T: Yeah, but I, Netflix

M: Ok that’s a long way
 <inaudible>
T: No, I tend to agree with you that, um

M: Because the biggest issue in my opinion is, is web traffic- for web traffic, just giving it a huge queue makes the chance bigger that uh, <inaudible, ed: because of the slow start> so you may end up with a (higher) faster completion time by buffering a lot. Uh, you’re not benefitting at all by keeping the queue very small, you are simply <inaudible> Right, you’re benefitting altogether by just <inaudible> which is what the queue does with this nice sparse flow, uh
 <inaudible>
T: You have the infinite buffers in the <inaudible> for that to work, right. One benefit you get from CoDel is that - you screw with things like - you have to drop eventually.
M: You should at some point. The chances are bigger that the small flow succeeds (if given a huge queue). And, in web surfing, why does that, uh(?)
T: Yeah, mmm...
M: Because that would be an example of something where I care about latency but I care about low completion. Other things where I care about latency they often don’t send very much. <inaudible...> bursts, you have to accommodate them basically. Or you have interactive traffic which is UDP and tries to, often react from queueing delay <inaudible>. I’m beginning to suspect that fq minus CoDel is really the best <inaudible> out there.
T: But if, yeah, if you have enough buffer.
M: Well, the more the better.
T: Yeah, well.
M: Haha, I got you to say yes. [laughter :] That goes in history. I said the more the better and you said yeah.
T: No but like, it goes back to good-queue bad-queue, like, buffering in itself has value, you just need to manage it.
M: Ok.
T: Which is also the reason why just having a small queue doesn’t help in itself.
M: Right yeah. Uh, I have a silly question about fq_codel, a very silly one and there may be something I missed in the papers, probably I did, but I'm I was just wondering I mean first of all this is also a bit silly in that <inaudible> it’s a security thing, and I think that’s kind of a package by itself silly because fq_codel often probably <inaudible> just in principle, is that something I could easily attack by creating new flows for every packet?
T: No because, they, you will

M: With the sparse flows, and it’s gonna

T: Yeah, but at some point you’re going to go over the threshold, I, you could, there there’s this thing where the flow goes in, it’s sparse, it empties out and then you put it on the normal round robin implementation before you queue <inaudible> And if you don’t do that than you can have, you could time packets so that they get priority just at the right time and you could have lockout.
M: Yes.
T: But now you will just fall back to fq.
M: Ok, it was just a curiousity, it’s probably in the paper. <inaudible>
T: I think we added that in the RFC, um, you really need to, like, this part is important.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Toke Høiland-Jørgensen
2018-11-26 22:13:00 UTC
Permalink
Post by Michael Welzl
However, I would like to point out that thesis defense conversations
are meant to be provocative, by design - when I said that CoDel
doesn’t usually help and long queues would be the right thing for all
applications, I certainly didn’t REALLY REALLY mean that.
Just as I don't REALLY REALLY mean that bigger buffers are always better
as you so sneakily tricked me into blurting out ;)
Post by Michael Welzl
The idea was just to be thought provoking - and indeed I found this
interesting: e.g., if you think about a short HTTP/1 connection, a
large buffer just gives it a greater chance to get all packets across,
and the perceived latency from the reduced round-trips after not
dropping anything may in fact be less than with a smaller (or
CoDel’ed) buffer.
Yeah, as a thought experiment I think it kinda works for the use case
you said: Just dump the entire piece of content into the network, and
let it be queued until the receiver catches up. It almost becomes a
CDN-in-the-network kind of thing (just needs multicast to serve multiple
receivers at once... ;)). Only trouble is that you need infinite queue
space to realise it...
Post by Michael Welzl
BTW, Anna Brunstrom was also very quick to also give me the HTTP/2.0
example in the break after the defense.
Yup, was thinking of HTTP/2 when I said "control data on the same
connection as the payload". Can see from Pete's transcript that it
didn't come across terribly clearly, though :P
Post by Michael Welzl
I’ll use the opportunity to tell folks that I was also pretty
impressed with Toke’s thesis as well as his performance at the
defense.
Thanks! It's been fun (both the writing and the defending) :)
Post by Michael Welzl
With that, I wish all the best to all you bloaters out there - thanks
for reducing our queues!
Yes, a huge thank you all from me as well; working with the community
here has been among my favourite aspects of my thesis work!

-Toke
Pete Heist
2018-11-27 08:54:59 UTC
Permalink
Thank you all for the responses!

I was asked a related question by my local WISP, who wanted to know if there would be any reason that fq_codel or Cake would be an improvement over sfq specifically for some "noisy links” (loose translation from Czech) in a backhaul that have some loss but also experience saturation. I conservatively answered no, but that may not be correct, in case the reduced TCP RTT could help with loss recovery or other behaviors, as Jon pointed out. I suspect more research would be needed to quantify this. Neal’s/Dave's point about “non-flow" traffic is well taken also.
Post by Toke Høiland-Jørgensen
Post by Michael Welzl
However, I would like to point out that thesis defense conversations
are meant to be provocative, by design - when I said that CoDel
doesn’t usually help and long queues would be the right thing for all
applications, I certainly didn’t REALLY REALLY mean that.
Just as I don't REALLY REALLY mean that bigger buffers are always better
as you so sneakily tricked me into blurting out ;)
I think most of us knew that “yeah” wasn’t a confirmation. Yeah can be used in a dozen different ways depending on context and intonation, but it did give some comic relief. :)
Post by Toke Høiland-Jørgensen
Post by Michael Welzl
BTW, Anna Brunstrom was also very quick to also give me the HTTP/2.0
example in the break after the defense.
Yup, was thinking of HTTP/2 when I said "control data on the same
connection as the payload". Can see from Pete's transcript that it
didn't come across terribly clearly, though :P
Ah, sorry for missing that part! I thought there was more there but didn’t want to write something unless I was sure I heard it.
Post by Toke Høiland-Jørgensen
Post by Michael Welzl
I’ll use the opportunity to tell folks that I was also pretty
impressed with Toke’s thesis as well as his performance at the
defense.
I’ll second that. I’m enjoying digesting the thesis, as well as the results of airtime fairness in a real world deployment. :)
Jonathan Morton
2018-11-27 09:31:33 UTC
Permalink
…any reason that fq_codel or Cake would be an improvement over sfq specifically for some "noisy links” (loose translation from Czech) in a backhaul that have some loss but also experience saturation.
If the random loss is low enough that saturation can be achieved, then adding AQM will be beneficial in the usual way. Also, it will not interfere in those cases where saturation is not achieved (for whatever reason).

- Jonathan Morton
Michael Richardson
2018-11-27 13:19:04 UTC
Permalink
Post by Pete Heist
I was asked a related question by my local WISP, who wanted to know if
there would be any reason that fq_codel or Cake would be an improvement
over sfq specifically for some "noisy links” (loose translation from
Czech) in a backhaul that have some loss but also experience
If the drops are due to noise, then I don't think it will help.
The congestion signals should already getting made.
But, there are lots of cases where there is excessive buffers at the edge of
the backhaul, which is encouraging excessive traffic and thus congestion.

--
Michael Richardson <mcr+***@sandelman.ca>, Sandelman Software Works
-= IPv6 IoT consulting =-
Jonathan Morton
2018-11-27 18:59:07 UTC
Permalink
Post by Michael Richardson
If the drops are due to noise, then I don't think it will help.
The congestion signals should already getting made.
If they are drops due to noise, then they are not congestion signals at all, as they occur independently of whether the link is saturated. It's perfectly possible for these "random drops" to occur at a low enough rate as to still permit saturation, perhaps depending on which CC algo the sender uses. In that case, you still want AQM.

- Jonathan Morton
Dave Taht
2018-11-27 20:10:22 UTC
Permalink
Post by Michael Welzl
Hi folks,
That “Michael” dude was me :)
About the stuff below, a few comments. First, an impressive effort to dig all of this up - I also thought that this was an interesting conversation to have!
However, I would like to point out that thesis defense conversations are meant to be provocative, by design - when I said that CoDel doesn’t usually help and long queues would be the right thing for all applications, I certainly didn’t REALLY REALLY mean that. The idea was just to be thought provoking - and indeed I found this interesting: e.g., if you think about a short HTTP/1 connection, a large buffer just gives it a greater chance to get all packets across, and the perceived latency from the reduced round-trips after not dropping anything may in fact be less than with a smaller (or CoDel’ed) buffer.
I really did want Toke to have a hard time. Thanks for putting his
back against the wall!

And I'd rather this be a discussion of toke's views... I do tend to
think he thinks FQ solves more than it does.... and I wish we had a
sound analysis as to why 1024 queues
works so much better for us than 64 or less on the workloads we have.
I tend to think in part it's because that acts as a 1000x1
rate-shifter - but should it scale up? Or down? Is what we did with
cake (1024 setassociative) useful? or excessive? I'm regularly seeing
64,000 queues on 10Gig and up hardware due to 64 hardware queues and
fq_codel on each, on that sort of gear. I think that's too much and
renders the aqm ineffective, but lack data...

but, to rant a bit...

While I tend to believe FQ solves 97% of the problem, AQM 2.9% and ECN .09%.

BUT: Amdahls law says once you reduce one part of the problem to 0,
everything else takes 100%. :)

it often seems like me, being the sole and very lonely FQ advocate
here in 2011, have reversed the situation (in this group!), and I'm
oft the AQM advocate *here* now.

It's sort of like all the people quoting the e2e argument still, back
at me when dave reed (at least, and perhaps the other co-authors now)
have bought into this level of network interference between the
endpoints, and had no religion - or the red in a different light paper
being rejected because it attempted to overturn other religion - and
I'll be damned if I'll let fq_codel, sch_fq, pie, l4s, scream, nada,

I admit to getting kind of crusty and set in my ways, but so long as
people put code in front of me along with the paper, I still think,
when the facts change, so do my opinions.

Pacing is *really impressive* and I'd like to see that enter
everything, not just in packet processing - I've been thinking hard
about the impact of cpu bursts (like resizing a hash table), and other
forms of work that we currently do on computers that have a
"dragster-like" peak performance, and a great average, but horrible
pathologies - and I think the world would be better off if we built
more

Anyway...

Once you have FQ and a sound outer limit on buffer size (100ms),
depredations like comcast's 680ms buffers no longer matter. There's
still plenty of room to innovate. BBR works brilliantly vs fq_codel
(and you can even turn ECN on which it doesn't respect and still get a
great result). LoLa would probably work well also 'cept that the git
tree was busted when I last tried it and it hasn't been tested much in
the 1Mbit-1Gbit range.
Post by Michael Welzl
But corner cases aside, in fact I very much agree with the answers to my question Pete gives below, and also with the points others have made in answering this thread. Jonathan Morton even mentioned ECN - after Dave’s recent over-reaction to ECN I made a point of not bringing up ECN *yet* again
Not going to go into it (much) today! We ended up starting another
project on ECN that that operates under my core ground rule - "show me
the code" - and life over there and on that mailing list has been
pleasantly quiet. https://www.bufferbloat.net/projects/ecn-sane/wiki/
.

I did get back on the tsvwg mailing list recently because of some
ludicrously inaccurate misstatements about fq_codel. I also made a
strong appeal to the l4s people, to, in general, "stop thanking me" in
their documents. To me that reads as an endorsement, where all I did
was participate in the process until I gave up and hit my "show me the
code" moment - which was about 5 years ago and hasn't moved on the
needle since except in mutating standards documents.

The other document I didn't like was an arbitary attempt to just set
the ecn backoff figure to .8 when the sanest thing, given the
deployment, and pacing... was to aim for a right number - anyway.....
in that case I just wanted off the "thank you" list.

I like to think the more or less rfc3168 compliant deployment of ecn
is thus far going brilliantly, but lack data. Certainly would like a
hostile reviewers evaluation of cake's ecn method and for that matter,
pie's, honestly - from real traffic! There's an RFC- compliant version
of Pie being pushed into the kernel after it gets through some of
stephens nits.

And I'd really prefer all future discussions of "ecn benefits" to come
with code and data and be discussed over on the ecn-sane mailing list,
or *not discussed here* if no code is available.
Post by Michael Welzl
, but… yes indeed, being able to use ECN to tell an application to back off instead of requiring to drop a packet is also one of the benefits.
One thus far mis-understood and under-analyzed aspect of our work is
the switch to head dropping.

To me the switch to head dropping essentially killed the tail loss RTO
problem, eliminated most of the need for ecn. Forward progress and
prompt signalling always happens. That otherwise wonderful piece
stuart cheshire did at apple elided the actual dropping mode version
of fq_codel, which as best as I recall was about 12? 15ms? long and
totally invisible to the application.
Post by Michael Welzl
(I think people easily miss the latency benefit of not dropping a packet, and thereby eliminating head-of-line blocking - packet drops require an extra RTT for retransmission, which can be quite a long time. This is about measuring latency at the right layer...)
see above. And yea, perversely, I agree with your last statement. A
slashdot web page download takes 78 separate flows and 2.2 seconds to
complete. Worst case completion
time - if you had *tail* loss would be about 80ms longer than that, on
a tiny fraction of loads. The rest of it is absorbed into those 2.2
seconds.

EVEN with http 2.0/ I would be extremely surprised to learn that many
websites fit it all into one tcp transaction.

There are very few other examples of TCP traffic requiring a low
latency response. I happen to be very happy with the ecn support in
mosh btw, not that anybody's ever looked at it since we did it.

And I'd really prefer all future discussions of "ecn benefits" to come
with code and data and be discussed over on the ecn-sane mailing list,
or not discussed here if no code is available.
Post by Michael Welzl
BTW, Anna Brunstrom was also very quick to also give me the HTTP/2.0 example in the break after the defense. Also, TCP will generally not work very well when queues get very long… the RTT estimate gets way off.
I like to think that the syn/ack and ssl negotation handshake under
fq_codel gives a much more accurate estimate of actual RTT than we
ever had before.
Post by Michael Welzl
All in all, I think this is a fun thought to consider for a bit, but not really something worth spending people’s time on, IMO: big buffers are bad, period. All else are corner cases.
I've said it elsewhere, and perhaps we should resume, but an RFC
merely stating the obvious about maximal buffer limits and getting
ISPs do to do that would be a boon.
Post by Michael Welzl
I’ll use the opportunity to tell folks that I was also pretty impressed with Toke’s thesis as well as his performance at the defense. Among the many cool things he’s developed (or contributed to), my personal favorite is the airtime fairness scheduler. But, there were many more. Really good stuff.
I so wish the world has about 1000 more toke's in training. How can we
make that happen?
Post by Michael Welzl
With that, I wish all the best to all you bloaters out there - thanks for reducing our queues!
Cheers,
Michael
http://youtu.be/upvx6rpSLSw
My attempt at a transcript is at the end of this message. (I probably won’t attempt a full defense transcript, but if someone wants more of a particular section I can try. :)
1) Multiplexed HTTP/2.0 requests containing both a saturating stream and interactive traffic. For example, a game that uses HTTP/2.0 to download new map data while position updates or chat happen at the same time. Standalone programs could use HTTP/2.0 this way, or for web apps, the browser may multiplex concurrent uses of XHR over a single TCP connection. I don’t know of any examples.
2) SSH with port forwarding while using an interactive terminal together with a bulk transfer?
3) Does CoDel help the TCP protocol itself somehow? For example, does it speed up the round-trip time when acknowledging data segments, improving behavior on lossy links? Similarly, does it speed up the TCP close sequence for saturating flows?
Pete
---
M: In fq_codel what is really the point of CoDel?
T: Yeah, uh, a bit better intra-flow latency...
M: Right, who cares about that?
T: Apparently some people do.
M: No I mean specifically, what types of flows care about that?
T: Yeah, so, um, flows that are TCP based or have some kind of- like, elastic flows that still want low latency.
M: Elastic flows that are TCP based that want low latency...
T: Things where you want to discover the- like, you want to utilize the full link and sort of probe the bandwidth, but you still want low latency.
M: Can you be more concrete what kind of application is that?
T: I, yeah, I…
M: Give me any application example that’s gonna benefit from the CoDel part- CoDel bits in fq_codel? Because I have problems with this.
T: I, I do too... So like, you can implement things this way but equivalently if you have something like fq_codel you could, like, if you have a video streaming application that interleaves control…
M: <inaudible> that runs on UDP often.
T: Yeah, but I, Netflix…
M: Ok that’s a long way… <inaudible>
T: No, I tend to agree with you that, um…
M: Because the biggest issue in my opinion is, is web traffic- for web traffic, just giving it a huge queue makes the chance bigger that uh, <inaudible, ed: because of the slow start> so you may end up with a (higher) faster completion time by buffering a lot. Uh, you’re not benefitting at all by keeping the queue very small, you are simply <inaudible> Right, you’re benefitting altogether by just <inaudible> which is what the queue does with this nice sparse flow, uh… <inaudible>
T: You have the infinite buffers in the <inaudible> for that to work, right. One benefit you get from CoDel is that - you screw with things like - you have to drop eventually.
M: You should at some point. The chances are bigger that the small flow succeeds (if given a huge queue). And, in web surfing, why does that, uh(?)
T: Yeah, mmm...
M: Because that would be an example of something where I care about latency but I care about low completion. Other things where I care about latency they often don’t send very much. <inaudible...> bursts, you have to accommodate them basically. Or you have interactive traffic which is UDP and tries to, often react from queueing delay <inaudible>. I’m beginning to suspect that fq minus CoDel is really the best <inaudible> out there.
T: But if, yeah, if you have enough buffer.
M: Well, the more the better.
T: Yeah, well.
M: Haha, I got you to say yes. [laughter :] That goes in history. I said the more the better and you said yeah.
T: No but like, it goes back to good-queue bad-queue, like, buffering in itself has value, you just need to manage it.
M: Ok.
T: Which is also the reason why just having a small queue doesn’t help in itself.
M: Right yeah. Uh, I have a silly question about fq_codel, a very silly one and there may be something I missed in the papers, probably I did, but I'm I was just wondering I mean first of all this is also a bit silly in that <inaudible> it’s a security thing, and I think that’s kind of a package by itself silly because fq_codel often probably <inaudible> just in principle, is that something I could easily attack by creating new flows for every packet?
T: No because, they, you will…
M: With the sparse flows, and it’s gonna…
T: Yeah, but at some point you’re going to go over the threshold, I, you could, there there’s this thing where the flow goes in, it’s sparse, it empties out and then you put it on the normal round robin implementation before you queue <inaudible> And if you don’t do that than you can have, you could time packets so that they get priority just at the right time and you could have lockout.
M: Yes.
T: But now you will just fall back to fq.
M: Ok, it was just a curiousity, it’s probably in the paper. <inaudible>
T: I think we added that in the RFC, um, you really need to, like, this part is important.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Michael Welzl
2018-11-27 21:20:58 UTC
Permalink
Post by Dave Taht
To me the switch to head dropping essentially killed the tail loss RTO
problem, eliminated most of the need for ecn.
I doubt that: TCP will need to retransmit that packet at the head, and that takes an RTT - all the packets after it will need to wait in the receiver buffer before the application gets them.
But I don’t have measurements to prove my point, so I’m just hand-waving

I don’t doubt that this kills the tail loss RTO problem.
I doubt that it eliminates the need for ECN.

Cheers,
Michael
Dave Taht
2018-11-29 07:11:27 UTC
Permalink
Post by Dave Taht
To me the switch to head dropping essentially killed the tail loss RTO
problem, eliminated most of the need for ecn.
I doubt that: TCP will need to retransmit that packet at the head,
and that takes an RTT - all the packets after it will need to wait
in the receiver buffer before the application gets them.
But I don’t have measurements to prove my point, so I’m just
hand-waving…
I don’t doubt that this kills the tail loss RTO problem.
Yea! I wish we had more data on it though. We haven't really ever looked
at RTOs in our (enormous) data sets, it's just an assumption that we
don't see them. There's terabytes of captures....
Post by Dave Taht
I doubt that it eliminates the need for ECN.
A specific example that burned me was stuarts demo showing screen
sharing "just working", with ecn, on what was about a 20ms path.

GREAT demo! Very real result from codel. Ship it! Audience applauded madly.
fq_codel went into OSX earlier this year.

Thing was, there was a 16ms frame rate (at best, probably closer to
64ms), at least a 32ms jitter buffer (probably in the 100s of ms
actually), an encoder that took at least a frame's worth of time...

and having the flow retransmit a lost packet vs ecn - within a 15ms rtt
- with a jitter buffer already there - was utterly invisible also to the
application and user.

Sooo....

see ecn-sane. Please try to write a position paper as to where and why
ecn is good and bad.

if one day we could merely establish a talmud of commentary
around this religion it would help.
Mikael Abrahamsson
2018-11-29 07:28:17 UTC
Permalink
Post by Dave Taht
see ecn-sane. Please try to write a position paper as to where and why
ecn is good and bad.
if one day we could merely establish a talmud of commentary
around this religion it would help.
From my viewpoint it seems to be all about incremental deployment. We have
30 years of "crud" that things need to work with, and the worst-case needs
to be a disaster for anything that wants to deploy.

This is one thing about L4S, ETC(1) is the last "codepoint" in the header
not used, that can statelessly identify something. If anyone sees a better
way to use it compared to "let's put it in a separate queue and CE-mark it
agressively at very low queue depths and also do not care about
re-ordering so a ARQ L2 can re-order all it wants", then they need to
speak up, soon.

I actually think the "let's not care about re-ordering" would be a
brilliant thing, it'd help quite a lot of packet network types become less
costly and more efficient, while at the same time not doing blocking of
subsequent packets just because some earlier packet needed to be
retransmitted. Brilliant for QUIC for instance, that already handles this
(at least per-stream).
--
Mikael Abrahamsson email: ***@swm.pp.se
Jonathan Morton
2018-11-29 07:36:47 UTC
Permalink
This is one thing about L4S, ETC(1) is the last "codepoint" in the header not used, that can statelessly identify something. If anyone sees a better way to use it compared to "let's put it in a separate queue and CE-mark it agressively at very low queue depths and also do not care about re-ordering so a ARQ L2 can re-order all it wants", then they need to speak up, soon.
You are essentially proposing using ECT(1) to take over an intended function of Diffserv. In my view, that is the wrong approach. Better to improve Diffserv to the point where it becomes useful in practice. Cake has taken steps in that direction, by implementing some reasonable interpretation of some Diffserv codepoints.

My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers.

- Jonathan Morton
Mikael Abrahamsson
2018-11-29 07:46:34 UTC
Permalink
Post by Jonathan Morton
You are essentially proposing using ECT(1) to take over an intended function of Diffserv.
Well, I am not proposing anything. I am giving people a heads-up that the
L4S authors are proposing this.

But yes, you're right. Diffserv has shown itself to be really hard to
incrementally deploy across the Internet, so it's generally bleached
mid-path.
Post by Jonathan Morton
In my view, that is the wrong approach. Better to improve Diffserv to
the point where it becomes useful in practice.
I agree, but unfortunately nobody has made me king of the Internet yet so
I can't just decree it into existance.
Post by Jonathan Morton
Cake has taken steps in that direction, by implementing some reasonable
interpretation of some Diffserv codepoints.
Great. I don't know if I've asked this but is CAKE easily implementable in
hardware? From what I can tell it's still only Marvell that is trying to
put high performance enough CPUs into HGWs to do forwarding in CPU (which
can do CAKE), all others still rely on packet accelerators to achieve the
desired speeds.
Post by Jonathan Morton
My alternative use of ECT(1) is more in keeping with the other
codepoints represented by those two bits, to allow ECN to provide more
fine-grained information about congestion than it presently does. The
main challenge is communicating the relevant information back to the
sender upon receipt, ideally without increasing overhead in the TCP/IP
headers.
You need to go into the IETF process and voice this opinion then, because
if nobody opposes in the near time then ECT(1) might go to L4S
interpretation of what is going on. They do have ECN feedback mechanisms
in their proposal, have you read it? It's a whole suite of documents,
architecture, AQM proposal, transport proposal, the entire thing.

On the other hand, what you want to do and what L4S tries to do might be
closely related. It doesn't sound too far off.

Also, Bob Briscoe works for Cable Labs now, so he will now have silicon
behind him. This silicon might go into other things, not just DOCSIS
equipment, so if you have use-cases that L4S doesn't do but might do with
minor modification, it might be better to join him than to fight him.
--
Mikael Abrahamsson email: ***@swm.pp.se
Michael Welzl
2018-11-29 08:08:42 UTC
Permalink
Post by Jonathan Morton
You are essentially proposing using ECT(1) to take over an intended function of Diffserv.
Well, I am not proposing anything. I am giving people a heads-up that the L4S authors are proposing this.
But yes, you're right. Diffserv has shown itself to be really hard to incrementally deploy across the Internet, so it's generally bleached mid-path.
Rumours, rumours. Just like "SCTP can never work", all the Internet must run over HTTP, etc etc.

For the "DiffServ is generally bleached" stuff, there is pretty clear counter evidence.
One: https://itc-conference.org/_Resources/Persistent/780df4482d0fe80f6180f523ebb9482c6869e98b/Barik18ITC30.pdf
And another: http://tma.ifip.org/wp-content/uploads/sites/7/2017/06/mnm2017_paper13.pdf
Post by Jonathan Morton
In my view, that is the wrong approach. Better to improve Diffserv to the point where it becomes useful in practice.
I agree, but unfortunately nobody has made me king of the Internet yet so I can't just decree it into existance.
Well, for what you want (re-ordering tolerance), I would think that the LE codepoint is suitable. From:
https://tools.ietf.org/html/draft-ietf-tsvwg-le-phb-06
"there ought to be an expectation that packets of the LE PHB could be excessively delayed or dropped when any other traffic is present"

... I think it would be strange for an application to expect this, yet not expect it to happen for only a few individual packets from a stream.
Post by Jonathan Morton
Cake has taken steps in that direction, by implementing some reasonable interpretation of some Diffserv codepoints.
Great.
+1
I don't know if I've asked this but is CAKE easily implementable in hardware? From what I can tell it's still only Marvell that is trying to put high performance enough CPUs into HGWs to do forwarding in CPU (which can do CAKE), all others still rely on packet accelerators to achieve the desired speeds.
Post by Jonathan Morton
My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers.
You need to go into the IETF process and voice this opinion then, because if nobody opposes in the near time then ECT(1) might go to L4S interpretation of what is going on. They do have ECN feedback mechanisms in their proposal, have you read it? It's a whole suite of documents, architecture, AQM proposal, transport proposal, the entire thing.
On the other hand, what you want to do and what L4S tries to do might be closely related. It doesn't sound too far off.
Indeed I think that the proposal of finer-grain feedback using 2 bits instead of one is not adding anything to, but in fact strictly weaker than L4S, where the granularity is in the order of the number of packets that you sent per RTT, i.e. much higher.
Also, Bob Briscoe works for Cable Labs now, so he will now have silicon behind him. This silicon might go into other things, not just DOCSIS equipment, so if you have use-cases that L4S doesn't do but might do with minor modification, it might be better to join him than to fight him.
Yes...

Cheers,
Michael
Jonathan Morton
2018-11-29 10:30:10 UTC
Permalink
Post by Michael Welzl
Post by Jonathan Morton
My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers.
You need to go into the IETF process and voice this opinion then, because if nobody opposes in the near time then ECT(1) might go to L4S interpretation of what is going on. They do have ECN feedback mechanisms in their proposal, have you read it? It's a whole suite of documents, architecture, AQM proposal, transport proposal, the entire thing.
On the other hand, what you want to do and what L4S tries to do might be closely related. It doesn't sound too far off.
Indeed I think that the proposal of finer-grain feedback using 2 bits instead of one is not adding anything to, but in fact strictly weaker than L4S, where the granularity is in the order of the number of packets that you sent per RTT, i.e. much higher.
An important facet you may be missing here is that we don't *only* have 2 bits to work with, but a whole sequence of packets carrying these 2-bit codepoints. We can convey fine-grained information by setting codepoints stochastically or in a pattern, rather than by merely choosing one of the three available (ignoring Not-ECT). The receiver can then observe the density of codepoints and report that to the sender.

Which is more-or-less the premise of DCTCP. However, DCTCP changes the meaning of CE, instead of making use of ECT(1), which I think is the big mistake that makes it undeployable.

So, from the middlebox perspective, very little changes. ECN-capable packets still carry ECT(0) or ECT(1). You still set CE on ECT packets, or drop Non-ECT packets, to signal when a serious level of persistent queue has developed, so that the sender needs to back off a lot. But if a less serious congestion condition exists, you can now signal *that* by changing some proportion of ECT(0) codepoints to ECT(1), with the intention that senders either reduce their cwnd growth rate, halt growth entirely, or enter a gradual decline. Those are three things that ECN cannot currently signal.

This change is invisible to existing, RFC-compliant, deployed middleboxes and endpoints, so should be completely backwards-compatible and incrementally deployable in the network. (The only thing it breaks is the optional ECN integrity RFC that, according to fairly recent measurements, literally nobody bothered implementing.)

Through TCP Timestamps, both sender and receiver can know fairly precisely when a round-trip has occurred. The receiver can use this information to calculate the ratio of ECT(0) and ECT(1) codepoints received in the most recent RTT. A new TCP Option could replace TCP Timestamps and the two bytes of padding that usually go with it, allowing reporting of this ratio without actually increasing the size of the TCP header. Large cwnds can be accommodated at the receiver by shifting both counters right until they both fit in a byte each; it is the ratio between them that is significant.

It is then incumbent on the sender to do something useful with that information. A reasonable idea would be to aim for a 1:1 ratio via an integrating control loop. Receipt of even one ECT(1) signal might be considered grounds for exiting slow-start, while exceeding 1:2 ratio should limit growth rate to "Reno linear" semantics (significant for CUBIC), and exceeding 2:1 ratio should trigger a "Reno linear" *decrease* of cwnd. Through all this, a single CE mark (reported in the usual way via ECE and CWR) still has the usual effect of a multiplicative decrease.

That's my proposal.

- Jonathan Morton
Michael Welzl
2018-11-29 12:06:03 UTC
Permalink
Post by Jonathan Morton
Post by Michael Welzl
Post by Jonathan Morton
My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers.
You need to go into the IETF process and voice this opinion then, because if nobody opposes in the near time then ECT(1) might go to L4S interpretation of what is going on. They do have ECN feedback mechanisms in their proposal, have you read it? It's a whole suite of documents, architecture, AQM proposal, transport proposal, the entire thing.
On the other hand, what you want to do and what L4S tries to do might be closely related. It doesn't sound too far off.
Indeed I think that the proposal of finer-grain feedback using 2 bits instead of one is not adding anything to, but in fact strictly weaker than L4S, where the granularity is in the order of the number of packets that you sent per RTT, i.e. much higher.
An important facet you may be missing here is that we don't *only* have 2 bits to work with, but a whole sequence of packets carrying these 2-bit codepoints. We can convey fine-grained information by setting codepoints stochastically or in a pattern, rather than by merely choosing one of the three available (ignoring Not-ECT). The receiver can then observe the density of codepoints and report that to the sender.
Which is more-or-less the premise of DCTCP. However, DCTCP changes the meaning of CE, instead of making use of ECT(1), which I think is the big mistake that makes it undeployable.
So, from the middlebox perspective, very little changes. ECN-capable packets still carry ECT(0) or ECT(1). You still set CE on ECT packets, or drop Non-ECT packets, to signal when a serious level of persistent queue has developed, so that the sender needs to back off a lot. But if a less serious congestion condition exists, you can now signal *that* by changing some proportion of ECT(0) codepoints to ECT(1), with the intention that senders either reduce their cwnd growth rate, halt growth entirely, or enter a gradual decline. Those are three things that ECN cannot currently signal.
This change is invisible to existing, RFC-compliant, deployed middleboxes and endpoints, so should be completely backwards-compatible and incrementally deployable in the network. (The only thing it breaks is the optional ECN integrity RFC that, according to fairly recent measurements, literally nobody bothered implementing.)
Through TCP Timestamps, both sender and receiver can know fairly precisely when a round-trip has occurred. The receiver can use this information to calculate the ratio of ECT(0) and ECT(1) codepoints received in the most recent RTT. A new TCP Option could replace TCP Timestamps and the two bytes of padding that usually go with it, allowing reporting of this ratio without actually increasing the size of the TCP header. Large cwnds can be accommodated at the receiver by shifting both counters right until they both fit in a byte each; it is the ratio between them that is significant.
It is then incumbent on the sender to do something useful with that information. A reasonable idea would be to aim for a 1:1 ratio via an integrating control loop. Receipt of even one ECT(1) signal might be considered grounds for exiting slow-start, while exceeding 1:2 ratio should limit growth rate to "Reno linear" semantics (significant for CUBIC), and exceeding 2:1 ratio should trigger a "Reno linear" *decrease* of cwnd. Through all this, a single CE mark (reported in the usual way via ECE and CWR) still has the usual effect of a multiplicative decrease.
That's my proposal.
- and it's an interesting one. Indeed, I wasn't aware that you're thinking of a DCTCP-style signal from a string of packets.

Of course, this is hard to get right - there are many possible flavours to ideas like this ... but yes, interesting!

Cheers,
Michael
Jonathan Morton
2018-11-29 12:52:55 UTC
Permalink
Post by Michael Welzl
Post by Jonathan Morton
That's my proposal.
- and it's an interesting one. Indeed, I wasn't aware that you're thinking of a DCTCP-style signal from a string of packets.
Of course, this is hard to get right - there are many possible flavours to ideas like this ... but yes, interesting!
I'm glad you think so. Working title is ELR - Explicit Load Regulation.

As noted, this needs standardisation effort, which is a bit outside my realm of experience - Cake was a great success, but relied entirely on exploiting existing standards to their logical conclusions. I think I started writing some material to put in an I-D, but got distracted by something more urgent.

If there's an opportunity to coordinate with relevant people from similar efforts, so much the better. I wonder, for example, whether the DCTCP folks would be open to supporting a more deployable version of their idea, or whether that would be a political non-starter for them.

- Jonathan Morton
Michael Welzl
2018-11-29 12:12:19 UTC
Permalink
Post by Michael Welzl
Post by Jonathan Morton
In my view, that is the wrong approach. Better to improve Diffserv to the point where it becomes useful in practice.
I agree, but unfortunately nobody has made me king of the Internet yet so I can't just decree it into existance.
https://tools.ietf.org/html/draft-ietf-tsvwg-le-phb-06
"there ought to be an expectation that packets of the LE PHB could be excessively delayed or dropped when any other traffic is present"
... I think it would be strange for an application to expect this, yet not expect it to happen for only a few individual packets from a stream.
Actually, maybe this is a problem: the semantics of LE are way broader than "tolerant to re-ordering". What about applications that are reordering-tolerant, yet still latency critical?
E.g., if I use a protocol that can hand over messages out of order (e.g. SCTP, and imagine it running over UDP if that helps), then the benefit of this is typically to get messages delivered faster (without receiver-side HOL blocking)).
But then, wouldn't it be good to have a way to tell the network "I don't care about ordering" ?

It seems to me that we'd need a new codepoint for that.
But, it also seems to me that this couldn't get standardised because that standard would embrace a layer violation (caring about a transport connection), even though that has been implemented for ages.
:-(

Cheers,
Michael
Jonathan Morton
2018-11-29 12:56:22 UTC
Permalink
Post by Michael Welzl
But then, wouldn't it be good to have a way to tell the network "I don't care about ordering" ?
I have to ask, why would the network care? What optimisations can be obtained by reordering packets *within* a flow, when it's usually just as easy to deliver them in order?

Of course, we already have FQ which reorders packets in *different* flows. The benefits are obvious in that case.

- Jonathan Morton
Mikael Abrahamsson
2018-11-29 13:30:17 UTC
Permalink
Post by Jonathan Morton
I have to ask, why would the network care? What optimisations can be
obtained by reordering packets *within* a flow, when it's usually just
as easy to deliver them in order?
Because most implementations aren't flow aware at all and might have 4
queues, saying "oh, this single queue is for transports that don't care
about ordering" means everything in that queue can just be sent as soon as
it can, ignoring HOL caused by ARQ.
Post by Jonathan Morton
Of course, we already have FQ which reorders packets in *different*
flows. The benefits are obvious in that case.
FQ is a fringe in real life (speaking as a packet moving monkey). It's
just on this mailing list that it's the norm.
--
Mikael Abrahamsson email: ***@swm.pp.se
Jonathan Morton
2018-11-29 08:09:12 UTC
Permalink
I don't know if I've asked this but is CAKE easily implementable in hardware?
I'd say the important bits are only slightly harder than doing the same with fq_codel. Some of the less important details might be significantly harder, and could reasonably be left out. The Diffserv bit should be nearly trivial to put in.

I believe much of Cake's perceived CPU overhead is actually down to inefficiencies in the Linux network stack. Using a CPU and some modest auxiliary hardware dedicated to moving packets, not tied up in handling general-purpose duties, then achieving greater efficiency with reasonable hardware costs could be quite easy, without losing the flexibility to change algorithms later.

- Jonathan Morton
Mikael Abrahamsson
2018-11-29 08:19:20 UTC
Permalink
Post by Jonathan Morton
I'd say the important bits are only slightly harder than doing the same with fq_codel.
Ok, FQ_CODEL is way off to get implemented in HW. I haven't heard anyone
even discussing it. Have you (or anyone else) heard differently?
Post by Jonathan Morton
I believe much of Cake's perceived CPU overhead is actually down to
inefficiencies in the Linux network stack. Using a CPU and some modest
auxiliary hardware dedicated to moving packets, not tied up in handling
general-purpose duties, then achieving greater efficiency with
reasonable hardware costs could be quite easy, without losing the
flexibility to change algorithms later.
I need to watch the MT7621 packet accelerator talk at the most recent
OpenWrt summit. I installed OpenWrt 18.06.1 on an Mikrotik RB750vGR3 and
just clicked my way around in LUCI and enabled flow offload and b00m, it
now did full gig NAT44 forwarding. It's implemented as a -j FLOWOFFLOAD
iptables rule. The good thing here might be that we could throw
unimportant high speed flows off to the accelerator and then just handle
the time sensitive flows in CPU, and just make sure the CPU has
preferential access to the media for its time-sensitive flow. That kind of
approach might make FQ_CODEL deployable even on slow CPU platforms with
accelerators because you would only run some flows through FQ_CODEL, where
the bulk high-speed flows would be handed off to acceleration (and we
guess they don't care about PDV and bufferbloat).
--
Mikael Abrahamsson email: ***@swm.pp.se
Jonathan Morton
2018-11-29 08:34:28 UTC
Permalink
Post by Jonathan Morton
I'd say the important bits are only slightly harder than doing the same with fq_codel.
Ok, FQ_CODEL is way off to get implemented in HW. I haven't heard anyone even discussing it. Have you (or anyone else) heard differently?
I haven't heard of anyone with a specific project to do so, no. But there are basically three components to implement:

1: Codel AQM. This shouldn't be too difficult.

2: Hashing flows into separate queues. I think this is doable if you accept simplified memory management (eg. assuming every packet is a full MTU for allocation purposes) and accept limited/no support for encapsulated protocols (which simplifies locating the elements of the 5-tuple for hashing).

3: Dequeuing packets from queues following DRR++ rules. I think this is also doable, since it basically means managing some linked lists.

It should be entirely feasible to prototype this at GigE speeds using existing FPGA hardware. Development can then continue from there. Overall, it's well within the capabilities of any competent HW vendor, so long as they're genuinely interested.

- Jonathan Morton
Sebastian Moeller
2018-11-29 10:15:27 UTC
Permalink
Hi Mikael,
Post by Jonathan Morton
You are essentially proposing using ECT(1) to take over an intended function of Diffserv.
Well, I am not proposing anything. I am giving people a heads-up that the L4S authors are proposing this.
But yes, you're right. Diffserv has shown itself to be really hard to incrementally deploy across the Internet, so it's generally bleached mid-path.
Post by Jonathan Morton
In my view, that is the wrong approach. Better to improve Diffserv to the point where it becomes useful in practice.
I agree, but unfortunately nobody has made me king of the Internet yet so I can't just decree it into existance.
With your kind of clue, I would happily vote you as (temporary) king of the internet. ;)
Post by Jonathan Morton
Cake has taken steps in that direction, by implementing some reasonable interpretation of some Diffserv codepoints.
Great. I don't know if I've asked this but is CAKE easily implementable in hardware? From what I can tell it's still only Marvell that is trying to put high performance enough CPUs into HGWs to do forwarding in CPU (which can do CAKE), all others still rely on packet accelerators to achieve the desired speeds.
As far as I can tell intel is pushing atom/x86 cores into its docsis SoCs (puma5/6/7) as well as into the high-end dsl SoCs (formerly lantiq, https://www.intel.com/content/www/us/en/smart-home/anywan-grx750-home-gateway-brief.html?wapkw=grx750), I am quite confident that those also pack enough punch for CPU based routing at Gbps-rates. In docsis modems these are already rolled-out, I do not know of any DSL modem/router that uses the GRX750
Post by Jonathan Morton
My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers.
You need to go into the IETF process and voice this opinion then, because if nobody opposes in the near time then ECT(1) might go to L4S interpretation of what is going on. They do have ECN feedback mechanisms in their proposal, have you read it? It's a whole suite of documents, architecture, AQM proposal, transport proposal, the entire thing.
On the other hand, what you want to do and what L4S tries to do might be closely related. It doesn't sound too far off.
Also, Bob Briscoe works for Cable Labs now, so he will now have silicon behind him. This silicon might go into other things, not just DOCSIS equipment, so if you have use-cases that L4S doesn't do but might do with minor modification, it might be better to join him than to fight him.
Call me naive, but the solution to the impasse at getting a common definition of diffserv agreed upon is replacing all TCP CC algorithms? This is replacing changing all endpoints (and network nodes) to honor diffserve with changing all endpoints to use a different TCP CC. At least I would call that ambitious.... (unless L4S offers noticeable advantages for all participating without being terribly unfair to the non-participating legacy TCP users*).

Best Regards
Sebastian


*) Well, being unfair ad out-competing the legacy users would be the best way to incentivize everybody to upgrade, but that would also be true for a better diffserve scheme...
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Mikael Abrahamsson
2018-11-29 10:53:30 UTC
Permalink
Post by Sebastian Moeller
As far as I can tell intel is pushing atom/x86 cores into its
docsis SoCs (puma5/6/7) as well as into the high-end dsl SoCs (formerly
lantiq,
https://www.intel.com/content/www/us/en/smart-home/anywan-grx750-home-gateway-brief.html?wapkw=grx750),
I am quite confident that those also pack enough punch for CPU based
routing at Gbps-rates. In docsis modems these are already rolled-out, I
do not know of any DSL modem/router that uses the GRX750
"10 Gbit/s packet processor".

Game over, again.
Post by Sebastian Moeller
Call me naive, but the solution to the impasse at getting a common
definition of diffserv agreed upon is replacing all TCP CC algorithms?
This is replacing changing all endpoints (and network nodes) to honor
diffserve with changing all endpoints to use a different TCP CC. At
least I would call that ambitious.... (unless L4S offers noticeable
advantages for all participating without being terribly unfair to the
non-participating legacy TCP users*).
L4S proposes a separate queue for the L4S compatible traffic, and some
kind of fair split between L4S and non-L4S traffic. I guess it's kind of
along the lines of my earlier proposals about having some kind of fair
split with 3 queues for PHB LE, BE and the rest. It makes it deployable in
current HW without the worst kind of DDoS downsides imaginable.

The Internet is all about making things incrementally deployable. It's
very frustrating, but that's the way it is. Whatever we want to propose
needs to work so-so with what's already out there and it's ok if it takes
a while before it makes everything better.

I'd like diffserv to work better, but it would take a lot of work in the
operator community to bring it out to where it needs to be. It's not
hopeless though, and I think
https://tools.ietf.org/html/draft-ietf-tsvwg-le-phb-06 is one step in the
right direction. Just the fact that we might have two queues instead of
one in the simplest implementations might help. The first step is to get
ISPs to not bleach diffserv but at least allow 000xxx.
--
Mikael Abrahamsson email: ***@swm.pp.se
Pete Heist
2018-11-28 02:04:03 UTC
Permalink
Post by Dave Taht
EVEN with http 2.0/ I would be extremely surprised to learn that many
websites fit it all into one tcp transaction.
There are very few other examples of TCP traffic requiring a low
latency response.
This is the crux of what I was looking for originally- some of these examples along with what the impact is on TCP itself, and what that actually means for people. I got that and then some. So for future readers, here’s an attempt to crudely summarize how CoDel (specifically as used in fq_codel) helps:

For some TCP CC algorithms (probably still most, as of this writing):
- reduces TCP RTT at saturation, improving interactivity for single flows with mixed bulk/interactive traffic like HTTP/2.0 or SSH
- smooths data delivery (thus meeting user’s expectations) by avoiding queue overflow (and tail-drop) in larger queues
- increases throughput efficiency with ECN (won’t tackle this further here!)
- reduces memory requirements for TCP buffers

Regardless of TCP CC algorithm:
- reduces latency for “non-flow” traffic, such as that for some encrypted VPNs
Dave Taht
2018-11-28 03:52:29 UTC
Permalink
Post by Dave Taht
EVEN with http 2.0/ I would be extremely surprised to learn that many
websites fit it all into one tcp transaction.
There are very few other examples of TCP traffic requiring a low
latency response.
This is the crux of what I was looking for originally- some of these
examples along with what the impact is on TCP itself, and what that
One thing I USED to use a lot - and am trying to use again - is that
you can run emacs on multiple people's displays over X11 over TCP.

We used to have interactive apps like that for X, whiteboards, and the
like, and as best as I recall they worked *better* than doing it over
the web....

emacs over tcp works amazingly well coast to coast now, over wireguard
or ssh, with or without ecn enabled, over cubic. I've been meaning to
try bbr for a while now.

open-frame-other-display still works! if X could gain a mosh like and/or
quic-like underling transport it could make a comeback...
Post by Dave Taht
actually means for people. I got that and then some. So for future
readers, here’s an attempt to crudely summarize how CoDel
- reduces TCP RTT at saturation, improving interactivity for single
flows with mixed bulk/interactive traffic like HTTP/2.0 or SSH
- smooths data delivery (thus meeting user’s expectations) by avoiding
queue overflow (and tail-drop) in larger queues
- increases throughput efficiency with ECN (won’t tackle this further
here!)
- reduces memory requirements for TCP buffers
- reduces latency for “non-flow” traffic, such as that for some
encrypted VPNs
We could use to put together a document of the size and scope of

http://www2.rdrop.com/~paulmck/scalability/paper/sfq.2002.06.04.pdf

someday.

That paper went into crc as a suitable hash function, btw... I'd like to
find a better hash... someday.

fq_codel itself is still awaiting the definitive paper with all the
plusses and minuses and measurements and "future work" - toke's recent
work taking apart the drop probability is only a piece of that elephant
- there's computational cost, of QFQ vs fq_codel (to this *day* I'd like
a QFQ + codel version), the sparse flow optimization (which could really
use a few less syllables to describe) vs straight DRR under more normal
traffic rather than traffic that attempts to highlight its usefulness,
with the common 300 byte quantum and 1514. There's the P4 work and FPGA
versions... I'd really like to revisit our LPCC paper and the fractal
self similar traffic paper. and the list goes on and on....

anybody else 'roun here need a MSC or PHD?
Post by Dave Taht
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Loading...