Discussion:
[Cerowrt-devel] cerowrt 3.3.8-17 is released
Dave Taht
2012-08-13 06:08:52 UTC
Permalink
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.

http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/

fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.

Go forth and break things!

In other news:

Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.

http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3

Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Maciej Soltysiak
2012-08-13 16:06:59 UTC
Permalink
Hi Dave,

Very nice! So far, the GUI didn't break after changing SSID which is
positive ;-)

For this release, what's the proper way to work with QoS?
Editing simple_qos still and running it still?

Regards,
Maciej
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Dave Taht
2012-08-13 16:20:53 UTC
Permalink
I eliminated the "aqm" tab at least temporarily in this release, so
the proper way at the moment is to use the "qos" tab or edit the
/etc/config/qos file directly.

While work is continuing on the aqm and simple qos implementations,
the bulk of the work (see the codel list) is presently at the
fq_codel/codel layer below and it seems somewhat futile to be
scripting more on top of that right now.

(I won't object if people hack on simple_qos however - I do expect to
ultimately do better than openwrt qos that way)
Post by Maciej Soltysiak
Hi Dave,
Very nice! So far, the GUI didn't break after changing SSID which is
positive ;-)
For this release, what's the proper way to work with QoS?
Editing simple_qos still and running it still?
Regards,
Maciej
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-6 is out
with fq_codel!"
Sebastian Moeller
2012-08-15 17:23:10 UTC
Permalink
Hi Dave,

great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...

Some notes and a question:
I noticed that even given plenty of swap space (1GB on a usb stick), using http://broadband.mpi-sws.org/residential/ to exercise UDP stress (on the uplink I assume) I can easily produce (I run the test from a macosx via 5GHz wireless over 1.5 yards):
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
What then happens is that the OOM killer will aim for bind (reasonable since it is the largest single process) and kill it. When I try to restart bind by:
***@nacktmulle:~# /etc/rc.d/S47namedprep start
***@nacktmulle:~# /etc/rc.d/S48named restart
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn



best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
d***@reed.com
2012-08-15 22:53:42 UTC
Permalink
Just to clarify, the way Netalyzr attempts to measure "uplink buffering" may not actually measure queue length. It just spews UDP packets at the target, and measures sender-receiver packet delay at the maximum load it can generate. So it's making certain assumptions about the location and FIFO nature of the "bottleneck queue" when it calculates that.

I don't think this is good news that you are reproting.

Assuming codel is measuring "sojourn time" and controlling it properly, you should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets should be being dropped to keep that delay down to under 10 milliseconds.
I have no idea how that jibes with low ping times, unless you are getting the ICMP packets spoofed.


-----Original Message-----
From: "Sebastian Moeller" <***@gmx.de>
Sent: Wednesday, August 15, 2012 1:23pm
To: "Dave Taht" <***@gmail.com>
Cc: cerowrt-***@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released



Hi Dave,

great work, as always I upgraded my production router to the latest and greatest (since I only have one router
). And it works quite well for normal usage

Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...

Some notes and a question:
I noticed that even given plenty of swap space (1GB on a usb stick), using http://broadband.mpi-sws.org/residential/ to exercise UDP stress (on the uplink I assume) I can easily produce (I run the test from a macosx via 5GHz wireless over 1.5 yards):
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those
).
What then happens is that the OOM killer will aim for bind (reasonable since it is the largest single process) and kill it. When I try to restart bind by:
***@nacktmulle:~# /etc/rc.d/S47namedprep start
***@nacktmulle:~# /etc/rc.d/S48named restart
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn



best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
William Katsak
2012-08-15 22:57:04 UTC
Permalink
I agree with this assessment as far as behavior goes. With my recent
experimentation on a Russian DSL line, I was seeing ~1200ms of uplink
buffer reported (Netalyzr) natively, but as soon as I got the AQM
running properly,
that went away completely.

Bill

Sent from my iPhone

On Aug 15, 2012, at 6:53 PM, "***@reed.com" <***@reed.com> wrote:

Just to clarify, the way Netalyzr attempts to measure "uplink buffering"
may not actually measure queue length. It just spews UDP packets at the
target, and measures sender-receiver packet delay at the maximum load it
can generate. So it's making certain assumptions about the location and
FIFO nature of the "bottleneck queue" when it calculates that.



I don't think this is good news that you are reproting.



Assuming codel is measuring "sojourn time" and controlling it properly, you
should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets
should be being dropped to keep that delay down to under 10 milliseconds.

I have no idea how that jibes with low ping times, unless you are getting
the ICMP packets spoofed.





-----Original Message-----
From: "Sebastian Moeller" <***@gmx.de>
Sent: Wednesday, August 15, 2012 1:23pm
To: "Dave Taht" <***@gmail.com>
Cc: cerowrt-***@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released

Hi Dave,

great work, as always I upgraded my production router to the latest and
greatest (since I only have one router…). And it works quite well for
normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating
the uplink does not affect ping times to a remote target noticeably,
basically the same as for all codellized ceo versions I tested so far...

Some notes and a question:
I noticed that even given plenty of swap space (1GB on a usb stick), using
http://broadband.mpi-sws.org/residential/ to exercise UDP stress (on the
uplink I assume) I can easily produce (I run the test from a macosx via
5GHz wireless over 1.5 yards):
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff
alloc of size 1926 failed
(and plenty of those…).
What then happens is that the OOM killer will aim for bind (reasonable
since it is the largest single process) and kill it. When I try to restart
bind by:
***@nacktmulle:~# /etc/rc.d/S47namedprep start
***@nacktmulle:~# /etc/rc.d/S48named restart
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now
I assume I am doing something wrong, but what, if you have any idea how to
solve this short of a reboot of the router (my current method) I would be
happy to learn



best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Post by Dave Taht
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Sebastian Moeller
2012-08-16 04:54:01 UTC
Permalink
Hi William,
I agree with this assessment as far as behavior goes. With my recent experimentation on a Russian DSL line, I was seeing ~1200ms of uplink buffer reported (Netalyzr) natively, but as soon as I got the AQM running properly, that went away completely.
QOS or simple_qos.sh? I might switch to simple_qos next to see the effects there.

best
Sebastian
Bill
Sent from my iPhone
Post by d***@reed.com
Just to clarify, the way Netalyzr attempts to measure "uplink buffering" may not actually measure queue length. It just spews UDP packets at the target, and measures sender-receiver packet delay at the maximum load it can generate. So it's making certain assumptions about the location and FIFO nature of the "bottleneck queue" when it calculates that.
I don't think this is good news that you are reproting.
Assuming codel is measuring "sojourn time" and controlling it properly, you should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets should be being dropped to keep that delay down to under 10 milliseconds.
I have no idea how that jibes with low ping times, unless you are getting the ICMP packets spoofed.
-----Original Message-----
Sent: Wednesday, August 15, 2012 1:23pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
William Katsak
2012-08-16 11:08:23 UTC
Permalink
This was -6, so I was using simple_qos.sh.

Bill

Sent from my iPhone
Post by Sebastian Moeller
Hi William,
I agree with this assessment as far as behavior goes. With my recent experimentation on a Russian DSL line, I was seeing ~1200ms of uplink buffer reported (Netalyzr) natively, but as soon as I got the AQM running properly, that went away completely.
QOS or simple_qos.sh? I might switch to simple_qos next to see the effects there.
best
Sebastian
Bill
Sent from my iPhone
Post by d***@reed.com
Just to clarify, the way Netalyzr attempts to measure "uplink buffering" may not actually measure queue length. It just spews UDP packets at the target, and measures sender-receiver packet delay at the maximum load it can generate. So it's making certain assumptions about the location and FIFO nature of the "bottleneck queue" when it calculates that.
I don't think this is good news that you are reproting.
Assuming codel is measuring "sojourn time" and controlling it properly, you should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets should be being dropped to keep that delay down to under 10 milliseconds.
I have no idea how that jibes with low ping times, unless you are getting the ICMP packets spoofed.
-----Original Message-----
Sent: Wednesday, August 15, 2012 1:23pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
d***@reed.com
2012-08-16 17:02:02 UTC
Permalink
You know I have been make the bufferbloat issue with my cable modem go away by

a simple "token buffer" approach on the link into the cable modem itself (p5p1 in this test setup with my my connection offering a reasonably constant 2 Mb/sec rate):

tc qdisc add dev p5p1 root tbf rate 1.8mbit latency 10ms burst 9000

And at one point in time, I "servoed" the rate setting to a "higher level" script that polled the ping time to a server I control so if the 1.8 mbit is "too big" it is shrunk down.

The "cable modem uplink" should not be so hard to get right.

Of course the WLAN queues internal to the home network or in the "mesh" if you run that need codel and fixes to the driver-internal packet queues.

But the first thing to do if you see 2800 msec. uplink buffers is to *fix* that. Then tinker with other things.





And I have in the past

-----Original Message-----
From: "William Katsak" <***@gmail.com>
Sent: Thursday, August 16, 2012 7:08am
To: "Sebastian Moeller" <***@gmx.de>
Cc: "***@reed.com" <***@reed.com>, "cerowrt-***@lists.bufferbloat.net" <cerowrt-***@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released



This was -6, so I was using simple_qos.sh.

Bill

Sent from my iPhone
Post by Sebastian Moeller
Hi William,
I agree with this assessment as far as behavior goes. With my recent experimentation on a Russian DSL line, I was seeing ~1200ms of uplink buffer reported (Netalyzr) natively, but as soon as I got the AQM running properly, that went away completely.
QOS or simple_qos.sh? I might switch to simple_qos next to see the effects there.
best
Sebastian
Bill
Sent from my iPhone
Post by d***@reed.com
Just to clarify, the way Netalyzr attempts to measure "uplink buffering" may not actually measure queue length. It just spews UDP packets at the target, and measures sender-receiver packet delay at the maximum load it can generate. So it's making certain assumptions about the location and FIFO nature of the "bottleneck queue" when it calculates that.
I don't think this is good news that you are reproting.
Assuming codel is measuring "sojourn time" and controlling it properly, you should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets should be being dropped to keep that delay down to under 10 milliseconds.
I have no idea how that jibes with low ping times, unless you are getting the ICMP packets spoofed.
-----Original Message-----
Sent: Wednesday, August 15, 2012 1:23pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router
). And it works quite well for normal usage

Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those
).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Sebastian Moeller
2012-08-20 18:17:16 UTC
Permalink
Hi,

thanks for you input.
Post by d***@reed.com
You know I have been make the bufferbloat issue with my cable modem go away by
tc qdisc add dev p5p1 root tbf rate 1.8mbit latency 10ms burst 9000
And at one point in time, I "servoed" the rate setting to a "higher level" script that polled the ping time to a server I control so if the 1.8 mbit is "too big" it is shrunk down.
The "cable modem uplink" should not be so hard to get right.
Of course the WLAN queues internal to the home network or in the "mesh" if you run that need codel and fixes to the driver-internal packet queues.
But the first thing to do if you see 2800 msec. uplink buffers is to *fix* that. Then tinker with other things.
It looks like the UDP tests packet creation rate and fq_codel's "slow" ramping up of the drop-rate just allows packets to accumulate in the queue. This is supported by the observation that netalyzr's reported buffering scales with fq_codels's limit parameter (up to a ceiling). Since other flows stay relative responsive my interpretation is that codel does it's thing correctly it is just that an inelastic UDP stream is not well suited to codel's gentle drop strategy, which seems tailored for TCP's behavior.

Best & Thanks
Sebastian
Post by d***@reed.com
And I have in the past
-----Original Message-----
Sent: Thursday, August 16, 2012 7:08am
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
This was -6, so I was using simple_qos.sh.
Bill
Sent from my iPhone
Post by Sebastian Moeller
Hi William,
I agree with this assessment as far as behavior goes. With my recent experimentation on a Russian DSL line, I was seeing ~1200ms of uplink buffer reported (Netalyzr) natively, but as soon as I got the AQM running properly, that went away completely.
QOS or simple_qos.sh? I might switch to simple_qos next to see the effects there.
best
Sebastian
Bill
Sent from my iPhone
Just to clarify, the way Netalyzr attempts to measure "uplink buffering" may not actually measure queue length. It just spews UDP packets at the target, and measures sender-receiver packet delay at the maximum load it can generate. So it's making certain assumptions about the location and FIFO nature of the "bottleneck queue" when it calculates that.
I don't think this is good news that you are reproting.
Assuming codel is measuring "sojourn time" and controlling it properly, you should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets should be being dropped to keep that delay down to under 10 milliseconds.
I have no idea how that jibes with low ping times, unless you are getting the ICMP packets spoofed.
-----Original Message-----
Sent: Wednesday, August 15, 2012 1:23pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Sebastian Moeller
2012-08-16 04:51:11 UTC
Permalink
Hi,
Post by d***@reed.com
Just to clarify, the way Netalyzr attempts to measure "uplink buffering" may not actually measure queue length. It just spews UDP packets at the target, and measures sender-receiver packet delay at the maximum load it can generate. So it's making certain assumptions about the location and FIFO nature of the "bottleneck queue" when it calculates that.
I see, pretty much what I expected in regards to methodology, even though I am not sure about the 'maximum load it can generate part'.
Post by d***@reed.com
I don't think this is good news that you are reporting.
My take on this was that the 3 seconds max buffer seemed wrong, even though the interactivity was/is right there where it should be with codel. BTW I was following Dave's recommendation and used openwork's default QOS instead of simple_qos.sh if that matters.
Post by d***@reed.com
Assuming codel is measuring "sojourn time" and controlling it properly, you should not see 2.8 *seconds* of UDP queueing delay on the uplink - packets should be being dropped to keep that delay down to under 10 milliseconds.
I have no idea how that jibes with low ping times, unless you are getting the ICMP packets spoofed.
My totally current pet theory (totally created with out looking at data) is that fq_codel somehow manages to confine the over buffering to the UDP probe flow(s) and keep my pings in unencumbered flows (that or openwrts hfsc packet scheduler saves the interactivity…). But as I stated I have no data to back this up (nor am I likely to get some any time soon).

Thanks
Sebastian
Post by d***@reed.com
-----Original Message-----
Sent: Wednesday, August 15, 2012 1:23pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Dave Taht
2012-08-16 04:58:05 UTC
Permalink
Firstly fq_codel will always stay very flat relative to your workload
for sparse streamss such as a ping or voip dns or gaming...

It's good stuff.

And, I think the source of your 2.8 second thing is fq_codel's current
reaction time, the non-responsiveness of the udp flooding netanylzer
uses
and huge default queue depth in openwrt's qos scripts.

Try this:

***@snapon:~/src/Cerowrt-3.3.8/package/qos-scripts/files/usr/lib/qos$
git diff tcrules.awk
diff --git a/package/qos-scripts/files/usr/lib/qos/tcrules.awk
b/package/qos-scripts/files/usr/lib/qos/tcrules
index a19b651..f3e0d3f 100644
--- a/package/qos-scripts/files/usr/lib/qos/tcrules.awk
+++ b/package/qos-scripts/files/usr/lib/qos/tcrules.awk
@@ -79,7 +79,7 @@ END {
# leaf qdisc
avpkt = 1200
for (i = 1; i <= n; i++) {
- print "tc qdisc add dev "device" parent 1:"class[i]"0
handle "class[i]"00: fq_codel"
+ print "tc qdisc add dev "device" parent 1:"class[i]"0
handle "class[i]"00: fq_codel limit 1200
}

# filter rule
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Sebastian Moeller
2012-08-16 06:09:47 UTC
Permalink
Hi Dave,

marvelous.
Post by Dave Taht
Firstly fq_codel will always stay very flat relative to your workload
for sparse streamss such as a ping or voip dns or gaming...
It's good stuff.
And, I think the source of your 2.8 second thing is fq_codel's current
reaction time, the non-responsiveness of the udp flooding netanylzer
uses
and huge default queue depth in openwrt's qos scripts.
git diff tcrules.awk
diff --git a/package/qos-scripts/files/usr/lib/qos/tcrules.awk
b/package/qos-scripts/files/usr/lib/qos/tcrules
index a19b651..f3e0d3f 100644
--- a/package/qos-scripts/files/usr/lib/qos/tcrules.awk
+++ b/package/qos-scripts/files/usr/lib/qos/tcrules.awk
@@ -79,7 +79,7 @@ END {
# leaf qdisc
avpkt = 1200
for (i = 1; i <= n; i++) {
- print "tc qdisc add dev "device" parent 1:"class[i]"0
handle "class[i]"00: fq_codel"
+ print "tc qdisc add dev "device" parent 1:"class[i]"0
handle "class[i]"00: fq_codel limit 1200
}
# filter rule
So openwrt's qos is still at the 10k packet limit for fq_codel? That means worst case 14.3 MB queue (at 1500 byte packages), best case 0.6103515625 MB (64byte packages), the worst case of which would take around 3 seconds to drain, maybe that is my issue. I will immediately try your patch. Done, now netalyzr reports 1100ms buffering down from 2800ms (and no ath: skbuff alloc of size 1926 failed messages in dmesg, but these did not show up during netalyzr runs). Now the other UDP stress test now works much better (reporting around 1200ms uplink buffering) producing no ath allocation failures. Switching to the hifgr downlink version of the test gave me:

[75755.714843] hostapd: page allocation failure: order:0, mode:0x4020
[75755.714843] Call Trace:
[75755.714843] [<80287200>] dump_stack+0x8/0x34
[75755.714843] [<800b4e28>] warn_alloc_failed+0xe8/0x10c
[75755.714843] [<800b712c>] __alloc_pages_nodemask+0x5a0/0x600
[75755.714843] [<800da950>] new_slab+0xa8/0x280
[75755.714843] [<80288c74>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
[75755.714843] [<800db4f8>] kmem_cache_alloc+0x38/0xe0
[75755.714843] [<801d1b68>] ag71xx_fill_rx_buf+0x34/0xd8
[75755.714843] [<801d2458>] ag71xx_poll+0x464/0x5f4
[75755.714843] [<801ea3d0>] net_rx_action+0x88/0x1c8
[75755.714843] [<80077458>] __do_softirq+0xa0/0x154
[75755.714843] [<80077668>] do_softirq+0x48/0x68
[75755.714843] [<8007789c>] irq_exit+0x4c/0xb4
[75755.714843] [<80062f8c>] ret_from_irq+0x0/0x4
[75755.714843] [<801757a8>] lzma_main+0x9ec/0xbec
[75755.714843] [<80175ef4>] xz_dec_lzma2_run+0x54c/0x824
[75755.714843] [<801744bc>] xz_dec_run+0x31c/0x8f4
[75755.714843] [<80132e74>] squashfs_xz_uncompress+0x164/0x274
[75755.714843] [<8012f368>] squashfs_read_data+0x4a8/0x660
[75755.714843] [<8012f6f4>] squashfs_cache_get+0x1d4/0x30c
[75755.714843] [<80130be8>] squashfs_readpage+0x56c/0x804
[75755.714843] [<800ba130>] __do_page_cache_readahead+0x1b0/0x22c
[75755.714843] [<800ba4b4>] ra_submit+0x28/0x34
[75755.714843] [<800b2e68>] filemap_fault+0x184/0x3cc
[75755.714843] [<800c7fd4>] __do_fault+0xcc/0x450
[75755.714843] [<800cad5c>] handle_pte_fault+0x330/0x6d4
[75755.714843] [<800cb1b4>] handle_mm_fault+0xb4/0xe0
[75755.714843] [<8006c210>] do_page_fault+0x110/0x350
[75755.714843] [<80062f80>] ret_from_exception+0x0/0xc
[75755.714843]
[75755.714843] Mem-Info:
[75755.714843] Normal per-cpu:
[75755.714843] CPU 0: hi: 18, btch: 3 usd: 5
[75755.714843] active_anon:1493 inactive_anon:2534 isolated_anon:0
[75755.714843] active_file:1623 inactive_file:1944 isolated_file:0
[75755.714843] unevictable:0 dirty:0 writeback:16 unstable:0
[75755.714843] free:95 slab_reclaimable:589 slab_unreclaimable:4876
[75755.714843] mapped:1030 shmem:25 pagetables:163 bounce:0
[75755.714843] Normal free:380kB min:1016kB low:1268kB high:1524kB active_anon:5972kB inactive_anon:10136kB active_file:6492kB inactive_file:7776kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65024kB mlocked:0kB dirty:0kB writeback:64kB mapped:4120kB shmem:100kB slab_reclaimable:2356kB slab_unreclaimable:19504kB kernel_stack:552kB pagetables:652kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[75755.714843] lowmem_reserve[]: 0 0
[75755.714843] Normal: 57*4kB 19*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 380kB
[75755.714843] 4204 total pagecache pages
[75755.714843] 611 pages in swap cache
[75755.714843] Swap cache stats: add 1899, delete 1288, find 802/926
[75755.714843] Free swap = 973548kB
[75755.714843] Total swap = 976560kB
[75755.714843] 16384 pages RAM
[75755.714843] 973 pages reserved
[75755.714843] 4143 pages shared
[75755.714843] 13118 pages non-shared
[75755.714843] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[75755.714843] cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 2, min order: 0
[75755.714843] node 0: slabs: 0, objs: 0, free: 0
[75755.718750] ge00: out of memory
(I would have loved to try again, but that specific application restricts e to 2 or 3 invocations per 24 hour periode which I already used up; I really need to find another stress tester some of these days).

But bind survived intact. So thanks for the quick surgery on QOS that surely improved things by a lot. Shall I try to request this change in openWRT proper? I think that for most home routers allowing for >14MB queues to build up in the device sure can cause havoc to stability (I shudder while thinking about routers with 32 or even 16MB ram, and even these could/should profit from codel; so my take is the limit needs to be scaled with available memory wit a potential ceiling at 10k, :) )

Thanks again & best regards
Sebastian
Post by Dave Taht
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Sebastian Moeller
2012-08-20 18:13:25 UTC
Permalink
Hi Dave,

thanks to your patch/instructions below I managed to figure out that, in my case, netalyzr's upstream buffering number depends on the size go the fq_codel limit:
fq_codel limit reported upstream buffering netalyzr flag
default (10k?) 2800ms (red)
10000 2800ms (red)
1200 1300ms (red)
600 580ms (yellow)
100 97ms (green)

(I run 3.3.8-17 with 30000/4000 cable shaped to 29100/3880, 97% of line rate with the default QOS system)

With line rate at 90% the numbers increase slightly:
10000 2900ms (red)
With line rate at 50% the numbers increase massively at the default codel limit:
10000 3900ms (red)
1200 1300ms (red)

these values are pretty reliable (with no inter-run variability at all; or better it looks like netalyzr quantizes the reported values and suppresses the variation).

With 3.3.8-17 's simple_qos.sh (at 97% line rate I get)
600? 580ms (yellow)

With neither simple_qos.sh nor QOS active after a reboot I get:
NA 490ms (yellow)
NA 500ms (yellow)
(lower than with any qos scheme down to estimated limit 500)


Now it looks that, at least in my setup, netalyzr still is able to fill the codel queue somehow (otherwise why would the reported buffering scale with fq_codel limit up to a ceiling?)
The fq_codel bin the UPD test lands in reports that it is dropping packets (this is with limit 1200):
class fq_codel 400:d7 parent 400:
(dropped 3097, overlimits 0 requeues 0)
backlog 1292522b 1199p requeues 0
deficit 29 count 1403 lastcount 1 ldelay 1.2s dropping drop_next 1.3ms
… and the delay seems spot on with 1.2s or 1200ms. I have no better idea than in the short testing period given the non-responsiveness of the UDP test stream fq_codel simply does not drop enough packages to make a noticeable dent in the queued up packet load. Bback of envelope calculation: it takes around 2500 seconds to drop the first ~200 packets in a backlogged fq_codel flow, at netalyzr's ~1000 packets per second rate that leaves roughly 1000 * 2.5 - 200 = 2300 packets in the queue. Since netalyzr will adapt its UDP creation rate somewhat to the available bandwidth it might not be visible at typical DSL speeds…
The nice thing about fq_codel is that other flows still stay responsive, which is pretty impressive.


QUESTION: how do I interpret the following tc output for a fq_codel flow:
class fq_codel 400:1b5 parent 400:
(dropped 3201, overlimits 0 requeues 0)
backlog 1292522b 1199p requeues 0
deficit -891 count 921 lastcount 1 ldelay 1.5s dropping drop_next -4.3ms

Why turned the drop_next negative?



best
Sebastian
Post by Dave Taht
Firstly fq_codel will always stay very flat relative to your workload
for sparse streamss such as a ping or voip dns or gaming...
It's good stuff.
And, I think the source of your 2.8 second thing is fq_codel's current
reaction time, the non-responsiveness of the udp flooding netanylzer
uses
and huge default queue depth in openwrt's qos scripts.
git diff tcrules.awk
diff --git a/package/qos-scripts/files/usr/lib/qos/tcrules.awk
b/package/qos-scripts/files/usr/lib/qos/tcrules
index a19b651..f3e0d3f 100644
--- a/package/qos-scripts/files/usr/lib/qos/tcrules.awk
+++ b/package/qos-scripts/files/usr/lib/qos/tcrules.awk
@@ -79,7 +79,7 @@ END {
# leaf qdisc
avpkt = 1200
for (i = 1; i <= n; i++) {
- print "tc qdisc add dev "device" parent 1:"class[i]"0
handle "class[i]"00: fq_codel"
+ print "tc qdisc add dev "device" parent 1:"class[i]"0
handle "class[i]"00: fq_codel limit 1200
}
# filter rule
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Dave Taht
2012-08-16 04:08:50 UTC
Permalink
re: ath: skbuff alloc of size 1926 failed

as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.

I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)

I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.

See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)

as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).

Lastly: Swap space won't help you on exhausting kernel limits.

I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Sebastian Moeller
2012-08-16 05:15:35 UTC
Permalink
Hi Dave,

thanks for the detailed response...
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
Question is this limit per interface or per flow, or fq bin?
Post by Dave Taht
There may be another means of increasing the size of that slab pool or
making it less onerous.
Interesting idea, I will have a look at that...
Post by Dave Taht
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
Ah great, more goodness on the way to cerowrt I hope :)
Post by Dave Taht
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too.
Well, once bind is gone and the easement is ver the memory pressure is gone and there should be enough memory for bind to start (will check that hypothesis later). But trying to start it manually with something like 23MB free did not allow me to start bind up again, so certainly I was doing something wrong (or OOM killed more than just bind, but that is hard to say as nothing showed up in dmesg or in logread-f about the OOM killer, so maybe bind died from other causes).
Post by Dave Taht
At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
A that should free some MBs for queues to grow in :)
Post by Dave Taht
Lastly: Swap space won't help you on exhausting kernel limits.
I had the naive hope that the swap would allow to push bind's memory out to the page file and give the kernel some more room to breathe, but that did only work to some degree. (In 3.3.8-6 one of the UDP storm tests I did made the router reboot like every other day, adding swap turned this into survival with killed bind and non-functional DNS; I am not sure in retrospect whether adding swap was such a good idea, as after the sudden reboots the router was at least functional again :))
Post by Dave Taht
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi.
I always wanted to stress this with netsurf, but somehow never were able to find a netperf server outside of my cable modem with wich to recreate my failure mode...
Post by Dave Taht
I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Thanks so much for all the hard work and such a fun toy to play with…

Sebastian
Post by Dave Taht
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Sebastian Moeller
2012-08-20 18:24:12 UTC
Permalink
Hi Dave,

so I went to play around with this a bit more. I turned to UDP flooding my cable modem through the router and this surely allows me to create enough load on the wndr3700v2 to cause the allocation errors and as a "bonus" also to drive the router to reboot (driven by the watchdog timer?). Here is the script I used over 5G wireless (from http://blog.ioshints.info/2008/03/udp-flood-in-perl.html)

#!/usr/bin/perl
##############

# udp flood.
##############

use Socket;
use strict;

if ($#ARGV != 3) {
print "flood.pl <ip> <port> <size> <time>\n\n";
print " port=0: use random ports\n";
print " size=0: use random size between 64 and 1024\n";
print " time=0: continuous flood\n";
exit(1);
}

my ($ip,$port,$size,$time) = @ARGV;

my ($iaddr,$endtime,$psize,$pport);

$iaddr = inet_aton("$ip") or die "Cannot resolve hostname $ip\n";
$endtime = time() + ($time ? $time : 1000000);

socket(flood, PF_INET, SOCK_DGRAM, 17);


print "Flooding $ip " . ($port ? $port : "random") . " port with " .
($size ? "$size-byte" : "random size") . " packets" .
($time ? " for $time seconds" : "") . "\n";
print "Break with Ctrl-C\n" unless $time;

for (;time() <= $endtime;) {
$psize = $size ? $size : int(rand(1024-64)+64) ;
$pport = $port ? $port : int(rand(65500))+1;

send(flood, pack("a$psize","flood"), 0, pack_sockaddr_in($pport, $iaddr));}

called as either
udp_flood.pl 192.168.100.1 0 1024 240
or
udp_flood.pl 192.168.100.1 32000 1024 240

The first version with randomized port number spreads the load nicely over many fq_codel bins/flows and seems slightly more likely to cause allocation errors and reboots than the 2nd invocation which restricts itself to port 32000 and presumably just one flow.
I wonder how to make cerowrt survive this kind of stress test…

best
Sebastian
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
Lastly: Swap space won't help you on exhausting kernel limits.
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
d***@reed.com
2012-08-21 02:33:17 UTC
Permalink
I'm confused. Fq_codel does not (to my knowledge) fix bufferbloat *in your cable modem* or *in the CMTS head-end*.

How could it? In order for that to be fixed, you need to manage the buffer in the cable modem itself.

-----Original Message-----
From: "Sebastian Moeller" <***@gmx.de>
Sent: Monday, August 20, 2012 2:24pm
To: "Dave Taht" <***@gmail.com>
Cc: cerowrt-***@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released



Hi Dave,

so I went to play around with this a bit more. I turned to UDP flooding my cable modem through the router and this surely allows me to create enough load on the wndr3700v2 to cause the allocation errors and as a "bonus" also to drive the router to reboot (driven by the watchdog timer?). Here is the script I used over 5G wireless (from http://blog.ioshints.info/2008/03/udp-flood-in-perl.html)

#!/usr/bin/perl
##############

# udp flood.
##############

use Socket;
use strict;

if ($#ARGV != 3) {
print "flood.pl <ip> <port> <size> <time>\n\n";
print " port=0: use random ports\n";
print " size=0: use random size between 64 and 1024\n";
print " time=0: continuous flood\n";
exit(1);
}

my ($ip,$port,$size,$time) = @ARGV;

my ($iaddr,$endtime,$psize,$pport);

$iaddr = inet_aton("$ip") or die "Cannot resolve hostname $ip\n";
$endtime = time() + ($time ? $time : 1000000);

socket(flood, PF_INET, SOCK_DGRAM, 17);


print "Flooding $ip " . ($port ? $port : "random") . " port with " .
($size ? "$size-byte" : "random size") . " packets" .
($time ? " for $time seconds" : "") . "\n";
print "Break with Ctrl-C\n" unless $time;

for (;time() <= $endtime;) {
$psize = $size ? $size : int(rand(1024-64)+64) ;
$pport = $port ? $port : int(rand(65500))+1;

send(flood, pack("a$psize","flood"), 0, pack_sockaddr_in($pport, $iaddr));}

called as either
udp_flood.pl 192.168.100.1 0 1024 240
or
udp_flood.pl 192.168.100.1 32000 1024 240

The first version with randomized port number spreads the load nicely over many fq_codel bins/flows and seems slightly more likely to cause allocation errors and reboots than the 2nd invocation which restricts itself to port 32000 and presumably just one flow.
I wonder how to make cerowrt survive this kind of stress test


best
Sebastian
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
Lastly: Swap space won't help you on exhausting kernel limits.
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router
). And it works quite well for normal usage

Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those
).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Marchon
2012-08-21 02:44:52 UTC
Permalink
It The Fq_codel settings prevent excess buffering on the push of data out over the cable modem itself / it will prevent unnecessary premature reduction of the tcp sliding window size further preventing a cascading backlog that ends up further reducing the sliding window and slowing down the overall outbound transfer rate.

A buffering problem in the cable modem only happens if you feed it data to quickly.






Sent from my iPhone
Post by d***@reed.com
I'm confused. Fq_codel does not (to my knowledge) fix bufferbloat *in your cable modem* or *in the CMTS head-end*.
How could it? In order for that to be fixed, you need to manage the buffer in the cable modem itself.
-----Original Message-----
Sent: Monday, August 20, 2012 2:24pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
so I went to play around with this a bit more. I turned to UDP flooding my cable modem through the router and this surely allows me to create enough load on the wndr3700v2 to cause the allocation errors and as a "bonus" also to drive the router to reboot (driven by the watchdog timer?). Here is the script I used over 5G wireless (from http://blog.ioshints.info/2008/03/udp-flood-in-perl.html)
#!/usr/bin/perl
##############
# udp flood.
##############
use Socket;
use strict;
if ($#ARGV != 3) {
print "flood.pl <ip> <port> <size> <time>\n\n";
print " port=0: use random ports\n";
print " size=0: use random size between 64 and 1024\n";
print " time=0: continuous flood\n";
exit(1);
}
my ($iaddr,$endtime,$psize,$pport);
$iaddr = inet_aton("$ip") or die "Cannot resolve hostname $ip\n";
$endtime = time() + ($time ? $time : 1000000);
socket(flood, PF_INET, SOCK_DGRAM, 17);
print "Flooding $ip " . ($port ? $port : "random") . " port with " .
($size ? "$size-byte" : "random size") . " packets" .
($time ? " for $time seconds" : "") . "\n";
print "Break with Ctrl-C\n" unless $time;
for (;time() <= $endtime;) {
$psize = $size ? $size : int(rand(1024-64)+64) ;
$pport = $port ? $port : int(rand(65500))+1;
send(flood, pack("a$psize","flood"), 0, pack_sockaddr_in($pport, $iaddr));}
called as either
udp_flood.pl 192.168.100.1 0 1024 240
or
udp_flood.pl 192.168.100.1 32000 1024 240
The first version with randomized port number spreads the load nicely over many fq_codel bins/flows and seems slightly more likely to cause allocation errors and reboots than the 2nd invocation which restricts itself to port 32000 and presumably just one flow.
I wonder how to make cerowrt survive this kind of stress test

best
Sebastian
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
Lastly: Swap space won't help you on exhausting kernel limits.
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router
). And it works quite well for normal usage

Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those
).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Sebastian Moeller
2012-08-21 05:28:20 UTC
Permalink
Hi there,

again the UDP flood test was not about the cable modem but just about resilience under (extreme) load. So just plain bread and butter functionality of a router, nothing fancy :) As is it causes a mix of very slow processing due to allocation errors and occasional OOM situations and full reboots. IMHO surviving the stress without these unfortunate outcomes would be preferable…


best
Sebastian
Post by Marchon
It The Fq_codel settings prevent excess buffering on the push of data out over the cable modem itself / it will prevent unnecessary premature reduction of the tcp sliding window size further preventing a cascading backlog that ends up further reducing the sliding window and slowing down the overall outbound transfer rate.
A buffering problem in the cable modem only happens if you feed it data to quickly.
Sent from my iPhone
Post by d***@reed.com
I'm confused. Fq_codel does not (to my knowledge) fix bufferbloat *in your cable modem* or *in the CMTS head-end*.
How could it? In order for that to be fixed, you need to manage the buffer in the cable modem itself.
-----Original Message-----
Sent: Monday, August 20, 2012 2:24pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
so I went to play around with this a bit more. I turned to UDP flooding my cable modem through the router and this surely allows me to create enough load on the wndr3700v2 to cause the allocation errors and as a "bonus" also to drive the router to reboot (driven by the watchdog timer?). Here is the script I used over 5G wireless (from http://blog.ioshints.info/2008/03/udp-flood-in-perl.html)
#!/usr/bin/perl
##############
# udp flood.
##############
use Socket;
use strict;
if ($#ARGV != 3) {
print "flood.pl <ip> <port> <size> <time>\n\n";
print " port=0: use random ports\n";
print " size=0: use random size between 64 and 1024\n";
print " time=0: continuous flood\n";
exit(1);
}
my ($iaddr,$endtime,$psize,$pport);
$iaddr = inet_aton("$ip") or die "Cannot resolve hostname $ip\n";
$endtime = time() + ($time ? $time : 1000000);
socket(flood, PF_INET, SOCK_DGRAM, 17);
print "Flooding $ip " . ($port ? $port : "random") . " port with " .
($size ? "$size-byte" : "random size") . " packets" .
($time ? " for $time seconds" : "") . "\n";
print "Break with Ctrl-C\n" unless $time;
for (;time() <= $endtime;) {
$psize = $size ? $size : int(rand(1024-64)+64) ;
$pport = $port ? $port : int(rand(65500))+1;
send(flood, pack("a$psize","flood"), 0, pack_sockaddr_in($pport, $iaddr));}
called as either
udp_flood.pl 192.168.100.1 0 1024 240
or
udp_flood.pl 192.168.100.1 32000 1024 240
The first version with randomized port number spreads the load nicely over many fq_codel bins/flows and seems slightly more likely to cause allocation errors and reboots than the 2nd invocation which restricts itself to port 32000 and presumably just one flow.
I wonder how to make cerowrt survive this kind of stress test…
best
Sebastian
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
Lastly: Swap space won't help you on exhausting kernel limits.
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
d***@reed.com
2012-08-22 18:23:42 UTC
Permalink
Ahhh, thanks, Sebastian. Now I understand what you are doing is a stress test.







-----Original Message-----
From: "Sebastian Moeller" <***@gmx.de>
Sent: Tuesday, August 21, 2012 1:28am
To: "Marchon" <***@gmail.com>
Cc: "***@reed.com" <***@reed.com>, "cerowrt-***@lists.bufferbloat.net" <cerowrt-***@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released



Hi there,

again the UDP flood test was not about the cable modem but just about resilience under (extreme) load. So just plain bread and butter functionality of a router, nothing fancy :) As is it causes a mix of very slow processing due to allocation errors and occasional OOM situations and full reboots. IMHO surviving the stress without these unfortunate outcomes would be preferable



best
Sebastian
Post by Marchon
It The Fq_codel settings prevent excess buffering on the push of data out over the cable modem itself / it will prevent unnecessary premature reduction of the tcp sliding window size further preventing a cascading backlog that ends up further reducing the sliding window and slowing down the overall outbound transfer rate.
A buffering problem in the cable modem only happens if you feed it data to quickly.
Sent from my iPhone
Post by d***@reed.com
I'm confused. Fq_codel does not (to my knowledge) fix bufferbloat *in your cable modem* or *in the CMTS head-end*.
How could it? In order for that to be fixed, you need to manage the buffer in the cable modem itself.
-----Original Message-----
Sent: Monday, August 20, 2012 2:24pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
so I went to play around with this a bit more. I turned to UDP flooding my cable modem through the router and this surely allows me to create enough load on the wndr3700v2 to cause the allocation errors and as a "bonus" also to drive the router to reboot (driven by the watchdog timer?). Here is the script I used over 5G wireless (from http://blog.ioshints.info/2008/03/udp-flood-in-perl.html)
#!/usr/bin/perl
##############
# udp flood.
##############
use Socket;
use strict;
if ($#ARGV != 3) {
print "flood.pl <ip> <port> <size> <time>\n\n";
print " port=0: use random ports\n";
print " size=0: use random size between 64 and 1024\n";
print " time=0: continuous flood\n";
exit(1);
}
my ($iaddr,$endtime,$psize,$pport);
$iaddr = inet_aton("$ip") or die "Cannot resolve hostname $ip\n";
$endtime = time() + ($time ? $time : 1000000);
socket(flood, PF_INET, SOCK_DGRAM, 17);
print "Flooding $ip " . ($port ? $port : "random") . " port with " .
($size ? "$size-byte" : "random size") . " packets" .
($time ? " for $time seconds" : "") . "\n";
print "Break with Ctrl-C\n" unless $time;
for (;time() <= $endtime;) {
$psize = $size ? $size : int(rand(1024-64)+64) ;
$pport = $port ? $port : int(rand(65500))+1;
send(flood, pack("a$psize","flood"), 0, pack_sockaddr_in($pport, $iaddr));}
called as either
udp_flood.pl 192.168.100.1 0 1024 240
or
udp_flood.pl 192.168.100.1 32000 1024 240
The first version with randomized port number spreads the load nicely over many fq_codel bins/flows and seems slightly more likely to cause allocation errors and reboots than the 2nd invocation which restricts itself to port 32000 and presumably just one flow.
I wonder how to make cerowrt survive this kind of stress test

best
Sebastian
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
Lastly: Swap space won't help you on exhausting kernel limits.
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router
). And it works quite well for normal usage

Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those
).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave TÀht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Dave Taht
2012-08-22 18:54:15 UTC
Permalink
Yes. As it is highly important to me to only get bug reports like this one

http://www.bufferbloat.net/issues/330

only every 5 years or so, surviving stress testing is paramount.

to whit, I'm planning on dropping bind from the next release,
switching to dnsmasq and busybox ntp, and disabling or dropping the
underused polipo proxy -

This would free up somewhere around 8-12MB of memory on boot.

in addition, eric dumazet has proposed two patches on the codel
mailing list that should reduce skb (packet buffer) size when needed
and possible.

https://lists.bufferbloat.net/pipermail/codel/2012-August/000422.html

I'm also looking into a possible memory leak on an error path...

And:

As noted in a prior message there are also some improvements to codel
that could be made, particularly to have a shared buffer cache across
multiple hardware queues.

I would like to be able to thoroughly stress test codel and fq_codel
in the next release of cerowrt.

on 128MB systems such as the 3800 I would hope that we'd have enough
memory for things like bind, but as there are also 32MB systems in the
openwrt mix, doing what we can, throughout the stack, to not run out,
is a good thing.
Post by d***@reed.com
Ahhh, thanks, Sebastian. Now I understand what you are doing is a stress test.
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Kenneth Finnegan
2012-08-22 19:23:39 UTC
Permalink
and disabling or dropping the underused polipo proxy -
I think the proxy being under-used could be fixed if we had CeroWRT
optionally advertise wpad when you start Polipo. When enabled, we
would just need the router to resolve wpad.local.domain the same as
gw.local.domain, and serve a gw.local.domain:80/wpad.dat file
containing something like:

function FindProxyForURL(url, host){
if (isInNet(host, "172.30.42.0", "255.255.255.0")) {
return "DIRECT";
}
return "PROXY gw.local.domain:3128; DIRECT";
}

WPAD is really how the proxy-on-a-LAN experience should be. The HUGE
issue with WPAD is that browsers (at least Firefox) switch to
resolving all DNS queries synchronously instead of async when they
detect a wpad configured network. Any gains from caching what little
web content is (advertised) as cacheable are lost many times over when
every DNS request causes the Firefox UI to FREEZE. Hit a page with
several different domains on it (and what websites don't make you
resolve analytics.google.com, twitter.com, plus.google.com, digg.com,
reddit.com, etc etc) and the entire Firefox GUI locks up for several
seconds.

https://bugzilla.mozilla.org/show_bug.cgi?id=769764

Just some food for thought. I would agree that in the face of memory
pressure, it should be one of the first things to go; the vast
majority of web servers aren't even configured correctly to mark
cacheable content, so caching is usually force by writing
pattern-matching rules which over-ride the (non-existent) caching
meta-data.

Kenneth Finnegan
blog.thelifeofkenneth.com
Dave Taht
2012-08-22 20:44:54 UTC
Permalink
On Wed, Aug 22, 2012 at 12:23 PM, Kenneth Finnegan
Post by Kenneth Finnegan
and disabling or dropping the underused polipo proxy -
I think the proxy being under-used could be fixed if we had CeroWRT
optionally advertise wpad when you start Polipo. When enabled, we
would just need the router to resolve wpad.local.domain the same as
gw.local.domain, and serve a gw.local.domain:80/wpad.dat file
function FindProxyForURL(url, host){
if (isInNet(host, "172.30.42.0", "255.255.255.0")) {
return "DIRECT";
}
return "PROXY gw.local.domain:3128; DIRECT";
}
I note that the dns entry wpad.home.lan is enabled by default in
cero's implementation of bind, and cero is distributing this
information via dhcp as well, but dhcp alone seems not enough.
Enabling
the pac file makes sense...
Post by Kenneth Finnegan
WPAD is really how the proxy-on-a-LAN experience should be. The HUGE
issue with WPAD is that browsers (at least Firefox) switch to
resolving all DNS queries synchronously instead of async when they
detect a wpad configured network. Any gains from caching what little
web content is (advertised) as cacheable are lost many times over when
every DNS request causes the Firefox UI to FREEZE. Hit a page with
several different domains on it (and what websites don't make you
resolve analytics.google.com, twitter.com, plus.google.com, digg.com,
reddit.com, etc etc) and the entire Firefox GUI locks up for several
seconds.
https://bugzilla.mozilla.org/show_bug.cgi?id=769764
DNS queries should be resolved on the proxy, methinks. I'm not sure if
what this bug describes is the blocking you are describing.
Post by Kenneth Finnegan
Just some food for thought. I would agree that in the face of memory
pressure, it should be one of the first things to go; the vast
majority of web servers aren't even configured correctly to mark
cacheable content, so caching is usually force by writing
pattern-matching rules which over-ride the (non-existent) caching
meta-data.
My principal reasons for wanting to bring the concept of proxying back
into realm of the home router is multi-fold, but doesn't actually
involve caching (as that would require setting up a usb memory stick
to do well)

In the age when proxies ruled the earth, and wireless would actually
drop packets (1995-2005), it made a lot of sense to have a web proxy
on the wired/wifi boundry.

1) short RTTs compensate for excessive delays and packet loss on the
wireless side, while providing an accurate RTT (and some buffering) to
the wired-to-the-internet side
2) it makes possible doing ipv6 to ipv4 translation much easier - the
wpad method can just as easily point to an ipv6 address.

There were huge threads regarding the advantages and disadvantages of
"split tcp" in the early days of the bloat list. Example:

https://lists.bufferbloat.net/pipermail/bloat/2011-February/000101.html

Now that we have the beginnings of a sane drop strategy in place, and
bloat has been thoroughly smashed through the stack (I am one line
away from backporting "TCP small queues" btw), I think the overhead of
running a web proxy on the router is low, and it could show benefit in
the general case - keeping dns queries local, smoothing out wifi
access patterns, and making possible the more native ipv6 transition
(and testing) noted above. I really, really, really want to beat up on
ipv6 as hard as possible...

That said, what I care about right now in this upcoming release is
that it not crash under stress, and I can get some good data back as
to codel's behavior when not in a so tightly constrained memory
environment. And/or find a memory leak.

I will probably leave polipo enabled, if I can convince someone to
test the current configuration... (hint, hint)
Post by Kenneth Finnegan
Kenneth Finnegan
blog.thelifeofkenneth.com
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Sebastian Moeller
2012-08-21 05:23:53 UTC
Permalink
Hi,
Post by d***@reed.com
I'm confused. Fq_codel does not (to my knowledge) fix bufferbloat *in your cable modem* or *in the CMTS head-end*.
How could it? In order for that to be fixed, you need to manage the buffer in the cable modem itself.
Oh, I agree; the quoted mail's implicit topic was a report in my ongoing struggle with cerowrt's instability under load. Dave has a number of issues in the cerowrt bug tracker to that regard as well. I had just figured out how to UDP-stress my router without having to use the web service I used before. UDP flooding my modem is not intended to tell me anything about the CPE or CMTS, I just need an IP under my control on the other side of the router that I could direct the load test to that would not consider this an DoS attempt, hence my own cable modem as endpoint…
I have the feeling that a home-router is so much more useful if it does not crash or reboot itself if just asked to do its job, namely routing of packets. And cerowrt is close to that goal if the memory allocation issues can be fixed, at least that is my hope… From my layman's persecutive it sounds so simple, just drop all incoming packets that the router will not be able to handle and keep hanging in there awaiting better times with less traffic. So I am fine with the router not doing anything useful during the UDP flood, but I sure hope it recovers quickly thereafter.

best
Sebastian
Post by d***@reed.com
-----Original Message-----
Sent: Monday, August 20, 2012 2:24pm
Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released
Hi Dave,
so I went to play around with this a bit more. I turned to UDP flooding my cable modem through the router and this surely allows me to create enough load on the wndr3700v2 to cause the allocation errors and as a "bonus" also to drive the router to reboot (driven by the watchdog timer?). Here is the script I used over 5G wireless (from http://blog.ioshints.info/2008/03/udp-flood-in-perl.html)
#!/usr/bin/perl
##############
# udp flood.
##############
use Socket;
use strict;
if ($#ARGV != 3) {
print "flood.pl <ip> <port> <size> <time>\n\n";
print " port=0: use random ports\n";
print " size=0: use random size between 64 and 1024\n";
print " time=0: continuous flood\n";
exit(1);
}
my ($iaddr,$endtime,$psize,$pport);
$iaddr = inet_aton("$ip") or die "Cannot resolve hostname $ip\n";
$endtime = time() + ($time ? $time : 1000000);
socket(flood, PF_INET, SOCK_DGRAM, 17);
print "Flooding $ip " . ($port ? $port : "random") . " port with " .
($size ? "$size-byte" : "random size") . " packets" .
($time ? " for $time seconds" : "") . "\n";
print "Break with Ctrl-C\n" unless $time;
for (;time() <= $endtime;) {
$psize = $size ? $size : int(rand(1024-64)+64) ;
$pport = $port ? $port : int(rand(65500))+1;
send(flood, pack("a$psize","flood"), 0, pack_sockaddr_in($pport, $iaddr));}
called as either
udp_flood.pl 192.168.100.1 0 1024 240
or
udp_flood.pl 192.168.100.1 32000 1024 240
The first version with randomized port number spreads the load nicely over many fq_codel bins/flows and seems slightly more likely to cause allocation errors and reboots than the 2nd invocation which restricts itself to port 32000 and presumably just one flow.
I wonder how to make cerowrt survive this kind of stress test…
best
Sebastian
Post by Dave Taht
re: ath: skbuff alloc of size 1926 failed
as for the ath skbuff problem, I've seen that a lot. I had put hard
packet limits (~600) on fq_codel in -11 and prior that were too low
and it mostly went away, but I hit tail drop behavior everywhere,
instead of codel behavior. What I have now (typically 1200) may well
be too high, but not as overly high as the default (10k packets).
There may be another means of increasing the size of that slab pool or
making it less onerous.
I would like it if codel "kicked in" earlier than it currently does.
The code in ns2 is currently using half the period that the linux code
is. This would control things better, or so I hope (planning on trying
this as I get time)
I am also considering means of artificially upscaling the drop
scheduler when we get close to queue limits.
See some discussions on the codel list for these issues. (sims are
easier to deal with than cerowrt, too!)
as for bind, it should be automagically restarted from xinetd, no need
to fiddle with anything. However, since you are already under massive
memory pressure, it may well fail to start up that way, too. At the
moment, I've largely given up on bind on anything but a more core home
gw, and am running dnsmasq on everything (3700v2, picostations,
nanostations) but the 3800s. (and the ones I run it on, aren't being
used for wifi right now).
Lastly: Swap space won't help you on exhausting kernel limits.
I'm glad you can reproduce the ath: slab problem - I can get it too at
high rates using netperf over wifi. I will try a 3700v2 with and
without bind to see if it's still there in 3.3.8-17. In the meantime
if anyone knows how to get more allocations in that (2048? 4096?) slab
by default, perhaps that will help?
Post by Maciej Soltysiak
Hi Dave,
great work, as always I upgraded my production router to the latest and greatest (since I only have one router…). And it works quite well for normal usage…
Netalyzr reports around 2800ms seconds of uplink buffering, yet saturating the uplink does not affect ping times to a remote target noticeably, basically the same as for all codellized ceo versions I tested so far...
Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: skbuff alloc of size 1926 failed
(and plenty of those…).
Stopping isc-bind
/etc/chroot/named//var/run/named/named.pid not found, trying brute force
killall: named: no process killed
Kicking isc-bind in xinetd
rndc: connect failed: 127.0.0.1#953: connection refused
And bind does not start again and the router becomes less than useful. Now I assume I am doing something wrong, but what, if you have any idea how to solve this short of a reboot of the router (my current method) I would be happy to learn
best regards
sebastian
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Van Jacobson gave a great talk about bufferbloat, BQL, codel, and fq_codel
at last week's ietf meeting. Well worth watching. At the end he outlines
the deployment problems in particular.
http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=IETF84_TSVAREA&chapter=part_3
Far more interesting than this email!
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Török Edwin
2012-08-17 08:52:28 UTC
Permalink
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Hi,

This is the first cerowrt that I tried on my router (was using Openwrt before), and I'm quite happy
with the latency improvements on WiFi (see below).

However I've encountered some issues with bind. After powering on the router this morning DNS wasn't working,
and logread showed a lot of errors from bind about a broken trust chain on every domain name.
Unfortunately the syslog buffer was at its default 16k (increased it to 256k now) so the exact sequence of errors was lost.
I restarted bind (/etc/init.d/named restart), and then DNS was working again. Ping times from the router to google were between 30ms and 450ms,
but ping times from my desktop to google (through same router) was 8ms.
I rebooted the router to try to reproduce the bind issue, but it didn't occur again. Pings from the router to google are also normal again.

Any idea what could've caused this behaviour?

Note that my internet connection is through PPPoE, so when bind starts on boot it might not have IPv4 network connectivity yet.
There's also a tiny delay between IPv4 and IPv6 connectivity, because IPv6 prefix is obtained using dhcp6c after PPPoE has connected.

Another minor issue is that p910nd and luci-app-p910nd were not available via opkg install, but I found them on openwrt.org, so that works now.
DHCPv6-PD had to be configured manually of course, same as with openwrt, the difference is that I only get IPv6 on wired interfaces now,
and not on wireless.
That seems to be by design because the interfaces are not bridged anymore and I get only a /64 from my ISP (slan_len 0), so can't really create
more sub-networks from it.

Onto the good news, here are some measurements (ping time / bandwidth from my laptop connected through WiFi to my desktop connected through GbitE):

no fq_codel on laptop, openwrt, wlan0 5Ghz: 0.859/174.859/923.768/198.308 ms; 120 - 140Mbps
w/ fq_codel on laptop, openwrt, wlan0 5Ghz: 1.693/ 26.727/ 54.936/ 11.746 ms; 120 - 140Mbps
no fq_codel on laptop, cerowrt, wlan0 5Ghz: 2.310/ 15.183/140.495/ 30.337 ms; 75 - 85 Mbps
w/ fq_codel on laptop, cerowrt, wlan0 5Ghz: 1.464/ 1.981/ 2.223/ 0.221 ms; 75 - 85 Mbps

The latency improvement is awesome, and I don't really mind the sacrificed bandwidth to accomplish it.
Is the bandwidth drop intended though? When enabling fq_codel just on my laptop I didn't notice any bandwidth drop at all.
AFAICT my router is the only radio on 5Ghz and it is configured the same way as openwrt was (HT40+).

Note: I use WPA2-PSK, and I disabled the other two SSIDs on the 5Ghz.

Best regards,
--Edwin
Dave Taht
2012-08-17 18:05:27 UTC
Permalink
I'm widening the distribution of this email a little bit in light of
the benchmark results (somewhat too far) below, and some of the other
issues raised.

On Fri, Aug 17, 2012 at 1:52 AM, Török Edwin
Post by Török Edwin
Post by Dave Taht
I'm too tired to write up a full set of release notes, but I've been
testing it all day,
and it looks better than -10 and certainly better than -11, but I won't know
until some more folk sit down and test it, so here it is.
http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/
fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and quagga
routing problems,
and a few tweaks to fq_codel setup that might make voip better.
Go forth and break things!
Hi,
This is the first cerowrt that I tried on my router (was using Openwrt before), and I'm quite happy
with the latency improvements on WiFi (see below).
However I've encountered some issues with bind. After powering on the router this morning DNS wasn't working,
and logread showed a lot of errors from bind about a broken trust chain on every domain name.
Any idea what could've caused this behaviour?
This is http://www.bufferbloat.net/issues/113 (relevant bugs have also
been filed in the dnssec and ntp databases)

a long standing circular problem between getting accurate time via ntp
and dns, so that dnssec can be enabled. dnssec requires time be
accurate within an hour. I have tried multiple ways to fix it, and the
workaround in place doesn't always succeed (and has a bug parsing
ntpdc output, now, too).

It's been my hope that ntp would evolve to do more of the right thing
or that bind would, or that the issues with the dnsval patches would
get resolved, but that involves getting someone to step up to address
them, and I have been too focused on the bufferbloat issue personally
to fight this one of late.

It's a PITA. The workarounds are:

0) in case of failure -

rndc validation disable
/etc/init.d/ntp restart

(this is basically what the workaround attempts to do. The patch to
ntp is supposed to disable dnssec validation but doesn't work under
some scenarios)

1) Disable dnssec entirely in bind

turn off validation in the conf file

2) Use dnsmasq instead of bind (no dnssec there, too) - how documented here:

https://plus.google.com/101384639386588513837/posts/Cgvfn8m9XuC

3) Other workarounds and patches gladly accepted. (this sort of work
can be done on a conventional x86 box). The simplest thought I have is
to hammer validation off and get initial time via something other than
ntp - some web service. It would be better if ntp did the work
directly.


I note that regardless, if your ISP provided DNS forwarder can be
trusted, it's a good idea to point bind's forwarders.conf to that, so
as to get best DNS performance out of bind. Automating "is my local
ISP's DNS trustable" is something also on the very long outstanding
"todo" list....
Post by Török Edwin
Note that my internet connection is through PPPoE, so when bind starts on boot it might not have IPv4 network connectivity yet.
There's also a tiny delay between IPv4 and IPv6 connectivity, because IPv6 prefix is obtained using dhcp6c after PPPoE has connected.
Hmm. This makes the ongoing issues with getting accurate time on boot
even more severe.

A battery backed up clock, or gps provided time, would be good, too.
Using GPS provided time is one of the solutions under consideration
for the edge to edge measurement project.
Post by Török Edwin
Another minor issue is that p910nd and luci-app-p910nd were not available via opkg install, but I found them on openwrt.org, so that works now.
I don't know what they are but I can enable them in the next build.
Post by Török Edwin
DHCPv6-PD had to be configured manually of course, same as with openwrt, the difference is that I only get IPv6 on wired interfaces now,
and not on wireless.
That seems to be by design because the interfaces are not bridged anymore and I get only a /64 from my ISP (slan_len 0), so can't really create
more sub-networks from it.
As multiple providers seem to think that a single /64 "is all you
need", despite the prevalence of guest and other sorts of secondary
networks on ipv4. This is a HUGE problem on the current native ipv6
deployments.

Note that it's not exactly fair to blame the providers, most of the
home CPE gear they are dealing with can barely handle ipv6 in the
first place, being based on ancient kernels and specifications. That
gear is improving, all too slowly, with things like openwrt/cerowrt in
the lead.... (apple seems to be doing fairly well, too)

Having only a single /64 delegated makes ipv6 unusable IMHO.

I (or rather juliusz) solved the single /64-only problem years ago by
switching to using babel and ahcp, which pushes out ipv6 /128 ips.
This method has the added benefit of making switching between multiple
wired and wireless APs utterly transparent, even for long held TCP
connections.

I run my own networks this way whenever possible, as it's *really
nice* to be able to unplug and not lose 20 ssh connections, and plug
back in, to get bandwidth, and have babel figure out the right way to
go automagically.

However fixing both the APs and the hosts (via adding ahcp and babel)
is kind of fixing a global infrastructure issue that is hard to get
the rest of the world to agree to, and things like network manager
don't think this way, either... But I'm glad to see progress being
made in homenet towards having a flooding prefix distribution protocol
based on something like ahcp, this will cut down on NAT usage in ipv4
and lead to a more flexible network in the future. - and I'm sure more
and more native deployments will delegate /60s or better in the
future.

Using dhcpv6 it is also possible to do allocations of /80s but this
breaks the 95% of all devices that only can do SLAAC.

It is best to get at least a /60 delegation from the ISP.

My way of coping with the half-arsed single /64 delegation ipv6 native
deployments I've dealt with thus far has been 6to4 and 6in4, which do
/48s. And kvetching, loudly, in every direction. And trying to make
dhcpv6 work better, as well as ahcp, and many other aspects of ipv6,
such as classification.
Post by Török Edwin
no fq_codel on laptop, openwrt, wlan0 5Ghz: 0.859/174.859/923.768/198.308 ms; 120 - 140Mbps
w/ fq_codel on laptop, openwrt, wlan0 5Ghz: 1.693/ 26.727/ 54.936/ 11.746 ms; 120 - 140Mbps
no fq_codel on laptop, cerowrt, wlan0 5Ghz: 2.310/ 15.183/140.495/ 30.337 ms; 75 - 85 Mbps
w/ fq_codel on laptop, cerowrt, wlan0 5Ghz: 1.464/ 1.981/ 2.223/ 0.221 ms; 75 - 85 Mbps
The latency improvement is awesome, and I don't really mind the sacrificed bandwidth to accomplish it.
A man after my own heart.

Thx! The industry as a whole has been focused on "bandwidth at any
cost, including massive latency", which leads to things like the ~1
second delays you observed on your fq_codel-less test. (and far worse
has been observed in the field) We're focused on improving latency,
because as stuart cheshire says: "once you have latency, you can't get
it back"

We hope that once some other concepts prove out, we can keep the low
latency and add even more bandwidth back.

http://www.bufferbloat.net/projects/cerowrt/wiki/Fq_Codel_on_Wireless

In day-to-day use the lowered latency and jitter in cero currently can
be really "felt" particularly in applications like skype and google
hangouts, and web pages (under load) feel much faster, as DNS lookups
happen really fast...

and (as another example), things like youtube far more rarely "stall out".

It's kind of hard to measure "feel", though. I wish we had better
benchmarks to show what we're accomplishing.
Post by Török Edwin
Is the bandwidth drop intended though? When enabling fq_codel just on my laptop I didn't notice any bandwidth drop at all.
The core non-fq_codel change on cerowrt vs openwrt and/or your laptop
is that the aggregation buffer size at the driver level has been
severely reduced in cerowrt, from it's default of 128 buffers, to 3.
This means that the maximum aggregate size has been cut to 3 packets
from ~42, but more importantly, total outstanding buffers not managed
by codel to 3, rather than 128....

The fact that this costs so little bandwidth (40%) in exchange for
reducing latency and jitter by 25x (or 400x compared to no fq_codel at
all) suggests that in the long run, once we come up with fixes to the
mac80211 layer, we will be able to achieve better utilization,
latency, AND jitter overall than the current hw deployed everywhere.

IF you'd like to have more bandwidth back, you can jiggle the qlen_*
variables in the debloat script up, but remember that tcp's reaction
time is quadratic as to the amount of buffering. I'd be interested in
you repeating your benchmark after doing that? The difference between
3 buffers and 8 is pretty dramatic...

Personally I'd be happy if we could hold wifi jitter below 30ms, and
typical latency below 10ms, in most (home) scenarios. I think that is
eminently doable, and a reasonable compromise between cero's all-out
assault on latency and the marketing need for more bandwidth. fq_codel
all by itself gets close (the fair queuing part helps a lot)

'Course I'd love it if low latency became the subject of all out
marketing wars between home gateway makers, and between ISPs, with
1/100 the technical resources thrown into the problem as has been
expended on raw bandwidth.

Possible themes:

"Hetgear": Frag your friends, faster!
"Binksys": Torrents and TV? no problem.
"Chuckalo": making DNS zippy!

"Boogle fiber: now with 2ms cross town latency!"
"Merizon: Coast to coast in under 60ms!"
"Nomcast: Making webex work better"

But we're not living in that alternate reality (yet), although I think
we're beginning to see some light at the end of the tunnel.

That said, there are infrastructural problems regarding the overuse of
aggregation everywhere, in gpon, in cable modems and CMTSes, and in
other backbone technologies, in addition to the queue management
problem. It's going to be a hard slog to get back to where the
distance between your couch and the internet is consistently less than
between here and the moon.

But worth it, in terms of the global productivity gain, and lowered
annoyance levels, worldwide.

...
Post by Török Edwin
AFAICT my router is the only radio on 5Ghz and it is configured the same way as openwrt was (HT40+).
Note: I use WPA2-PSK, and I disabled the other two SSIDs on the 5Ghz.
Best regards,
--Edwin
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Török Edwin
2012-08-17 19:05:45 UTC
Permalink
Post by Dave Taht
I'm widening the distribution of this email a little bit in light of
the benchmark results (somewhat too far) below, and some of the other
issues raised.
[ok, will write a separate reply for the benchmark numbers]
Post by Dave Taht
On Fri, Aug 17, 2012 at 1:52 AM, Török Edwin
Post by Török Edwin
However I've encountered some issues with bind. After powering on the router this morning DNS wasn't working,
and logread showed a lot of errors from bind about a broken trust chain on every domain name.
Any idea what could've caused this behaviour?
This is http://www.bufferbloat.net/issues/113 (relevant bugs have also
been filed in the dnssec and ntp databases)
a long standing circular problem between getting accurate time via ntp
and dns, so that dnssec can be enabled.
I was using unbound on openwrt for dnssec before and I haven't noticed this problem.
However I had some .ro time servers configured, and apparently they use quite a wide range
for their RRSIG, so maybe I was just lucky not to hit a situation where both .ro and .org would fail to validate.
RRSIG NS 5 2 7200 20120819122953 20120720122953....
RRSIG NSEC 8 1 86400 20120824000000 20120816230000 ...

While the .org RRSIG has quite a recent timestamp:
org. 900 IN RRSIG SOA 7 1 900 20120907184119 20120817174119

Added the .ro timeservers to cerowrt now, and will see if the problem occurs again.
Post by Dave Taht
Post by Török Edwin
Another minor issue is that p910nd and luci-app-p910nd were not available via opkg install, but I found them on openwrt.org, so that works now.
I don't know what they are but I can enable them in the next build.
Thanks, they are a simple way to share your USB printer across the network without running CUPS on the router itself.
Post by Dave Taht
Post by Török Edwin
DHCPv6-PD had to be configured manually of course, same as with openwrt, the difference is that I only get IPv6 on wired interfaces now,
and not on wireless.
That seems to be by design because the interfaces are not bridged anymore and I get only a /64 from my ISP (slan_len 0), so can't really create
more sub-networks from it.
As multiple providers seem to think that a single /64 "is all you
need", despite the prevalence of guest and other sorts of secondary
networks on ipv4. This is a HUGE problem on the current native ipv6
deployments.
Note that it's not exactly fair to blame the providers, most of the
home CPE gear they are dealing with can barely handle ipv6 in the
first place, being based on ancient kernels and specifications. That
gear is improving, all too slowly, with things like openwrt/cerowrt in
the lead.... (apple seems to be doing fairly well, too)
Having only a single /64 delegated makes ipv6 unusable IMHO.
Still usable on my main machine, a pain to make it work with VMs though.
(only way I could make it work was to bridge them to host's ethernet).
Post by Dave Taht
I (or rather juliusz) solved the single /64-only problem years ago by
switching to using babel and ahcp, which pushes out ipv6 /128 ips.
This method has the added benefit of making switching between multiple
wired and wireless APs utterly transparent, even for long held TCP
connections.
Thanks, will have to try that out.
Post by Dave Taht
I run my own networks this way whenever possible, as it's *really
nice* to be able to unplug and not lose 20 ssh connections, and plug
back in, to get bandwidth, and have babel figure out the right way to
go automagically.
However fixing both the APs and the hosts (via adding ahcp and babel)
is kind of fixing a global infrastructure issue that is hard to get
the rest of the world to agree to, and things like network manager
don't think this way, either... But I'm glad to see progress being
made in homenet towards having a flooding prefix distribution protocol
based on something like ahcp, this will cut down on NAT usage in ipv4
and lead to a more flexible network in the future. - and I'm sure more
and more native deployments will delegate /60s or better in the
future.
Using dhcpv6 it is also possible to do allocations of /80s but this
breaks the 95% of all devices that only can do SLAAC.
It is best to get at least a /60 delegation from the ISP.
They probably won't listen. For the amount I pay them I'm happy I get IPv6 at all :)
Post by Dave Taht
My way of coping with the half-arsed single /64 delegation ipv6 native
deployments I've dealt with thus far has been 6to4 and 6in4, which do
/48s.
I get slightly higher throughput in IPv6 though (75 Mbps vs 45 Mbps).
Apparenly ISP forgot to shape my IPv6 traffic the same way it does with my IPv4.

Although... if I have QoS enabled in cerowrt and I set download speed limit to 47000 kbit/s
I'm still able to do 62000kbit/s with wget.
If I set the limit to 27000 then wget -4 speed drops to 24000 kbit/s, but -6 speed stays the same.

Doesn't QoS apply to IPv6 as well?
Post by Dave Taht
IF you'd like to have more bandwidth back, you can jiggle the qlen_*
variables in the debloat script up, but remember that tcp's reaction
time is quadratic as to the amount of buffering. I'd be interested in
you repeating your benchmark after doing that? The difference between
3 buffers and 8 is pretty dramatic...
Will do, and report back (to both ML).

Thanks for the detailed reply.

--Edwin
Dave Taht
2012-08-17 19:52:53 UTC
Permalink
On Fri, Aug 17, 2012 at 12:05 PM, Török Edwin
Post by Török Edwin
Post by Dave Taht
I'm widening the distribution of this email a little bit in light of
the benchmark results (somewhat too far) below, and some of the other
issues raised.
[ok, will write a separate reply for the benchmark numbers]
Post by Dave Taht
On Fri, Aug 17, 2012 at 1:52 AM, Török Edwin
Post by Török Edwin
However I've encountered some issues with bind. After powering on the router this morning DNS wasn't working,
and logread showed a lot of errors from bind about a broken trust chain on every domain name.
Any idea what could've caused this behaviour?
This is http://www.bufferbloat.net/issues/113 (relevant bugs have also
been filed in the dnssec and ntp databases)
a long standing circular problem between getting accurate time via ntp
and dns, so that dnssec can be enabled.
I was using unbound on openwrt for dnssec before and I haven't noticed this problem.
How is that on memory and configurability?
Post by Török Edwin
However I had some .ro time servers configured, and apparently they use quite a wide range
for their RRSIG, so maybe I was just lucky not to hit a situation where both .ro and .org would fail to validate.
RRSIG NS 5 2 7200 20120819122953 20120720122953....
RRSIG NSEC 8 1 86400 20120824000000 20120816230000 ...
org. 900 IN RRSIG SOA 7 1 900 20120907184119 20120817174119
Added the .ro timeservers to cerowrt now, and will see if the problem occurs again.
You were lucky, and it will. openwrt/cerowrt can periodically write
the current time to flash, but not often enough for dnssec on a fresh
boot, and more often would be mildly bad on flash wear.

I wasn't aware however that some timeservers were available that
Post by Török Edwin
Post by Dave Taht
Post by Török Edwin
Another minor issue is that p910nd and luci-app-p910nd were not available via opkg install, but I found them on openwrt.org, so that works now.
I don't know what they are but I can enable them in the next build.
Thanks, they are a simple way to share your USB printer across the network without running CUPS on the router itself.
Useful. OK. Added to next build (as modules)
Post by Török Edwin
Post by Dave Taht
Post by Török Edwin
DHCPv6-PD had to be configured manually of course, same as with openwrt, the difference is that I only get IPv6 on wired interfaces now,
and not on wireless.
That seems to be by design because the interfaces are not bridged anymore and I get only a /64 from my ISP (slan_len 0), so can't really create
more sub-networks from it.
As multiple providers seem to think that a single /64 "is all you
need", despite the prevalence of guest and other sorts of secondary
networks on ipv4. This is a HUGE problem on the current native ipv6
deployments.
Note that it's not exactly fair to blame the providers, most of the
home CPE gear they are dealing with can barely handle ipv6 in the
first place, being based on ancient kernels and specifications. That
gear is improving, all too slowly, with things like openwrt/cerowrt in
the lead.... (apple seems to be doing fairly well, too)
Having only a single /64 delegated makes ipv6 unusable IMHO.
Still usable on my main machine, a pain to make it work with VMs though.
(only way I could make it work was to bridge them to host's ethernet).
one of the many use cases where a single /64 is not enough.
Post by Török Edwin
Post by Dave Taht
I (or rather juliusz) solved the single /64-only problem years ago by
switching to using babel and ahcp, which pushes out ipv6 /128 ips.
This method has the added benefit of making switching between multiple
wired and wireless APs utterly transparent, even for long held TCP
connections.
Thanks, will have to try that out.
http://www.bufferbloat.net/projects/cerowrt/wiki/Mesh has some info on
that. Although it makes it look overly complex. On linux with network
manager off, it's 3-5 lines of script...

There is a #babel channel on irc and #bufferbloat to help get that going.
Post by Török Edwin
Post by Dave Taht
I run my own networks this way whenever possible, as it's *really
nice* to be able to unplug and not lose 20 ssh connections, and plug
back in, to get bandwidth, and have babel figure out the right way to
go automagically.
However fixing both the APs and the hosts (via adding ahcp and babel)
is kind of fixing a global infrastructure issue that is hard to get
the rest of the world to agree to, and things like network manager
don't think this way, either... But I'm glad to see progress being
made in homenet towards having a flooding prefix distribution protocol
based on something like ahcp, this will cut down on NAT usage in ipv4
and lead to a more flexible network in the future. - and I'm sure more
and more native deployments will delegate /60s or better in the
future.
Using dhcpv6 it is also possible to do allocations of /80s but this
breaks the 95% of all devices that only can do SLAAC.
It is best to get at least a /60 delegation from the ISP.
They probably won't listen. For the amount I pay them I'm happy I get IPv6 at all :)
Enough noise and it'll happen. There is enough demand from small
business and so on for it to happen. Eventually.
Post by Török Edwin
Post by Dave Taht
My way of coping with the half-arsed single /64 delegation ipv6 native
deployments I've dealt with thus far has been 6to4 and 6in4, which do
/48s.
I get slightly higher throughput in IPv6 though (75 Mbps vs 45 Mbps).
Apparenly ISP forgot to shape my IPv6 traffic the same way it does with my IPv4.
Although... if I have QoS enabled in cerowrt and I set download speed limit to 47000 kbit/s
I'm still able to do 62000kbit/s with wget.
If I set the limit to 27000 then wget -4 speed drops to 24000 kbit/s, but -6 speed stays the same.
Doesn't QoS apply to IPv6 as well?
Openwrt's QoS system has a dependency on conntrack, so it does NOT
handle native ipv6, and does not shape ipv6 at all. This is a bad
thing,
and I've not found a way to fix it given the current structure and
arcane-ness (awk??) of the existing openwrt qos script. I'm not happy
with how it does prioritization of smaller packets, either....

The need to treat ipv6 traffic as well as ipv4 traffic in a
shaper/limiter is in part why the simple_qos.sh script exists. It does
ipv6 and diffserv...

I have been trying to come up with a ipv6 and ipv4 enabled alternative
to std openwrt qos for a while, but that one is buggy (on restart),
and not gui enabled, and has issues at high and low bandwidths in htb
setup as to quantum size. Also evolving something towards the
available complexity/utility of the openwrt is hard.

So I've been building ever simpler models here in my own quest for the
"one true bandwidth limiter/shaper", but working out bugs in the
underlying fq_codel system(s) has been taking priority. I THINK a 2
tier model will work almost as well as a 3 or 4 tier one, and it helps
on queue length, or using perhaps using qfq-esq weighting rather than
htb tiers will be better... and then there's wifi to fix which has 4
tiers... and the issues are too long to go into here....

PLEASE feel free to hack on either openwrt qos or simple qos or any of
the many alternatives available (dan siemon had a good start at one),
gargoyle is doing interesting stuff, etc.

Most of the time these days, I think the ultimate goal will be to do a
tbf version of fq_codel. htb and hfsc add interesting sorts of
overhead, and fq_codel handles multiple queues well already, so...
Post by Török Edwin
Post by Dave Taht
IF you'd like to have more bandwidth back, you can jiggle the qlen_*
variables in the debloat script up, but remember that tcp's reaction
time is quadratic as to the amount of buffering. I'd be interested in
you repeating your benchmark after doing that? The difference between
3 buffers and 8 is pretty dramatic...
Will do, and report back (to both ML).
Yes, that last message was rather overlong and unfocused for the bloat
list. Thx.
Post by Török Edwin
Thanks for the detailed reply.
--Edwin
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Török Edwin
2012-08-17 20:13:13 UTC
Permalink
Post by Dave Taht
On Fri, Aug 17, 2012 at 12:05 PM, Török Edwin
Post by Török Edwin
I was using unbound on openwrt for dnssec before and I haven't noticed this problem.
How is that on memory and configurability?
It was quite easy to configure, and I didn't need to touch it since the initial setup.
I think I just followed the instructions for Debian:
http://wiki.debian.org/DNSSEC#Unbound

I've attached my unbound.conf here if you want to see what it knows. According to the config file
it should use a 4M cache by default.
I didn't measure memory usage, or do any other benchmark to compare it against bind.
Post by Dave Taht
Post by Török Edwin
However I had some .ro time servers configured, and apparently they use quite a wide range
for their RRSIG, so maybe I was just lucky not to hit a situation where both .ro and .org would fail to validate.
RRSIG NS 5 2 7200 20120819122953 20120720122953....
RRSIG NSEC 8 1 86400 20120824000000 20120816230000 ...
org. 900 IN RRSIG SOA 7 1 900 20120907184119 20120817174119
Added the .ro timeservers to cerowrt now, and will see if the problem occurs again.
You were lucky, and it will. openwrt/cerowrt can periodically write
the current time to flash, but not often enough for dnssec on a fresh
boot, and more often would be mildly bad on flash wear.
I wasn't aware however that some timeservers were available that
[this sentence seems to have been cut off]
Post by Dave Taht
Post by Török Edwin
Post by Török Edwin
Another minor issue is that p910nd and luci-app-p910nd were not available via opkg install, but I found them on openwrt.org, so that works now.
Best regards,
--Edwin
Michael Richardson
2012-08-18 20:16:54 UTC
Permalink
Post by Török Edwin
I was using unbound on openwrt for dnssec before and I haven't noticed this problem.
Dave> How is that on memory and configurability?
Post by Török Edwin
However I had some .ro time servers configured, and apparently
they use quite a wide range for their RRSIG, so maybe I was just
lucky not to hit a situation where both .ro and .org would fail
to validate. RRSIG NS 5 2 7200 20120819122953 20120720122953....
RRSIG NSEC 8 1 86400 20120824000000 20120816230000 ...
While the .org RRSIG has quite a recent timestamp: org. 900 IN
RRSIG SOA 7 1 900 20120907184119 20120817174119
Added the .ro timeservers to cerowrt now, and will see if the
problem occurs again.
Dave> You were lucky, and it will. openwrt/cerowrt can periodically
Dave> write the current time to flash, but not often enough for
Dave> dnssec on a fresh boot, and more often would be mildly bad on
Dave> flash wear.

My opinion is that we should
a) either turn off DNSSEC validation until we find a time server
on first boot.
b) ignore signatures that do not validate because they are too "new"

If we are writing the file system such that time can really never go
backwards, then we are pretty much immune to most replay attacks
egrevious replay attacks.
(b) would require a new option to BIND/unbound.
d***@lang.hm
2012-08-20 20:16:42 UTC
Permalink
Post by Michael Richardson
If we are writing the file system such that time can really never go
backwards, then we are pretty much immune to most replay attacks
If time cannot go backwards, what do you do if someone accidently sets the
time in the future?

David Lang
George Lambert
2012-08-20 20:41:48 UTC
Permalink
Check and set the time by syncing to NTP Servers - not user supplied times
if the network
is available. to see if they have set times > those set by NTP Server

http://tf.nist.gov/tf-cgi/servers.cgi

The global address *time.nist.gov* is resolved to all of the server
addresses below in a round-robin sequence to equalize the load across all
of the servers.


Network Time Protocol (RFC-1305)

The Network Time Protocol (NTP) is the most commonly used Internet time
protocol, and the one that provides the best performance. Large computers
and workstations often include NTP software with their operating systems.
The client software runs continuously as a background task that
periodically gets updates from one or more servers. The client software
ignores responses from servers that appear to be sending the wrong time,
and averages the results from those that appear to be correct.

Many of the available NTP software clients for personal computers don’t do
any averaging at all. Instead, they make a single timing request to a
signal server (just like a Daytime or Time client) and then use this
information to set their computer’s clock. The proper name for this type of
client is SNTP (Simple Network Time Protocol).

The NIST servers listen for a NTP request on port 123, and respond by
sending a udp/ip data packet in the NTP format. The data packet includes a
64-bit timestamp containing the time in UTC seconds since January 1, 1900
with a resolution of 200 ps.

Most of the NIST time servers do not require any authentication when
requesting the time in NTP format, and no keys or passwords are needed to
use this service. In addition to this standard NTP service (which will not
be modified), we have begun testing an authenticated version of NTP using a
single time server that implements the symmetric key encryption method
defined in the NTP documentation. In order to use this server, you must
apply to NIST for an encryption key, which will be linked to the network
address of your system. This service is being offered on an experimental
basis only, and it may not be continued after the initial testing period.
For more details, please see the authenticated ntp
description<http://www.nist.gov/pml/div688/grp40/auth-ntp.cfm>
.
Daytime Protocol (RFC-867)

This protocol is widely used by small computers running MS-DOS and similar
operating systems. The server listens on port 13, and responds to requests
in either tcp/ip or udp/ip formats. The standard does not specify an exact
format for the Daytime Protocol, but requires that the time is sent using
standard ASCII characters. NIST chose a time code format similar to the one
used by its dial-up Automated Computer Time Service
(ACTS)<http://www.nist.gov/pml/div688/grp40/acts.cfm>,
as shown below:

*JJJJJ YR-MO-DA HH:MM:SS TT L H msADV UTC(NIST) OTM*

where:

- JJJJJ is the Modified Julian Date (MJD). The MJD has a starting point
of midnight on November 17, 1858. You can obtain the MJD by subtracting
exactly 2 400 000.5 days from the Julian Date, which is an integer day
number obtained by counting days from the starting point of noon on 1
January 4713 B.C. (Julian Day zero).

- YR-MO-DA is the date. It shows the last two digits of the year, the
month, and the current day of month.

- HH:MM:SS is the time in hours, minutes, and seconds. The time is
always sent as Coordinated Universal Time (UTC). An offset needs to be
applied to UTC to obtain local time. For example, Mountain Time in the U.
S. is 7 hours behind UTC during Standard Time, and 6 hours behind UTC
during Daylight Saving Time.

- TT is a two digit code (00 to 99) that indicates whether the United
States is on Standard Time (ST) or Daylight Saving Time (DST). It also
indicates when ST or DST is approaching. This code is set to 00 when ST is
in effect, or to 50 when DST is in effect. During the month in which the
time change actually occurs, this number will decrement every day until the
change occurs. For example, during the month of November, the U.S. changes
from DST to ST. On November 1, the number will change from 50 to the actual
number of days until the time change. It will decrement by 1 every day
until the change occurs at 2 a.m. local time when the value is 1. Likewise,
the spring change is at 2 a.m. local time when the value reaches 51.

- L is a one-digit code that indicates whether a leap second will be
added or subtracted at midnight on the last day of the current month. If
the code is 0, no leap second will occur this month. If the code is 1, a
positive leap second will be added at the end of the month. This means that
the last minute of the month will contain 61 seconds instead of 60. If the
code is 2, a second will be deleted on the last day of the month. Leap
seconds occur at a rate of about one per year. They are used to correct for
irregularity in the earth's rotation. The correction is made just before
midnight UTC (not local time).

- H is a health digit that indicates the health of the server. If H = 0,
the server is healthy. If H = 1, then the server is operating properly but
its time may be in error by up to 5 seconds. This state should change to
fully healthy within 10 minutes. If H = 2, then the server is operating
properly but its time is known to be wrong by more than 5 seconds. If H =
3, then a hardware or software failure has occurred and the amount of the
time error is unknown. If H = 4 the system is operating in a special
maintenance mode and both its accuracy and its response time may be
degraded. This value is not used for production servers except in special
circumstances. The transmitted time will still be correct to within ±1
second in this mode.

- msADV displays the number of milliseconds that NIST advances the time
code to partially compensate for network delays. The advance is currently
set to 50.0 milliseconds.

- The label UTC(NIST) is contained in every time code. It indicates that
you are receiving Coordinated Universal Time (UTC) from the National
Institute of Standards and Technology (NIST).

- OTM (on-time marker) is an asterisk (*). The time values sent by the
time code refer to the arrival time of the OTM. In other words, if the time
code says it is 12:45:45, this means it is 12:45:45 when the OTM arrives.
Post by Michael Richardson
If we are writing the file system such that time can really never go
Post by Michael Richardson
backwards, then we are pretty much immune to most replay attacks
If time cannot go backwards, what do you do if someone accidently sets the
time in the future?
David Lang
______________________________**_________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/**listinfo/cerowrt-devel<https://lists.bufferbloat.net/listinfo/cerowrt-devel>
--
P THINK BEFORE PRINTING: is it really necessary?

This e-mail and its attachments are confidential and solely for the
intended addressee(s). Do not share or use them without approval. If
received in error, contact the sender
and delete them.
d***@lang.hm
2012-08-20 20:48:34 UTC
Permalink
Post by George Lambert
Check and set the time by syncing to NTP Servers - not user supplied times
if the network
is available. to see if they have set times > those set by NTP Server
In theory you are right, in practice you are not.

it's not uncommon for systems to point at a local set of timeservers (GPS
based for example), sometimes things go wrong with those servers, and so
people configure a local fallback (because they need the clocks on the
systems to remain consistant for things like kerberos to keep working).
This leads to a failure mode where if something goes wrong on that system,
the time can get set via NTP to some time in the future.

There needs to be a way to recover from such conditions.

The recent problems that people had with leap seconds is an indication
that even if you do use Internet NTP servers, sometimes things go wrong.

David Lang
George Lambert
2012-08-20 21:27:07 UTC
Permalink
Good point - that is just how I do it on my server network, sorry.

reminds me of a good point

Q: what is the different between in theory and in practice?
A: In theory, NOTHING.

;-) so we test and iterate until we get it right in practice.


G.
Post by George Lambert
Check and set the time by syncing to NTP Servers - not user supplied times
Post by George Lambert
if the network
is available. to see if they have set times > those set by NTP Server
In theory you are right, in practice you are not.
it's not uncommon for systems to point at a local set of timeservers (GPS
based for example), sometimes things go wrong with those servers, and so
people configure a local fallback (because they need the clocks on the
systems to remain consistant for things like kerberos to keep working).
This leads to a failure mode where if something goes wrong on that system,
the time can get set via NTP to some time in the future.
There needs to be a way to recover from such conditions.
The recent problems that people had with leap seconds is an indication
that even if you do use Internet NTP servers, sometimes things go wrong.
David Lang
--
P THINK BEFORE PRINTING: is it really necessary?

This e-mail and its attachments are confidential and solely for the
intended addressee(s). Do not share or use them without approval. If
received in error, contact the sender
and delete them.
Michael Richardson
2012-08-20 23:19:27 UTC
Permalink
George> Check and set the time by syncing to NTP Servers - not user supplied times
George> if the network
George> is available. to see if they have set times > those set by NTP Server

George> http://tf.nist.gov/tf-cgi/servers.cgi

George> The global address *time.nist.gov* is resolved to all of the server
George> addresses below in a round-robin sequence to equalize the load across all
George> of the servers.

Good idea, but you need DNS to find that server, and you need
time to do DNSSEC.

If the time is set years into the future, then DNSSEC may also fail, as
the signatures would be too old. Accepting that might be a problem.

If the time can be set like this by an operator, then there is a
problem, and an operator will have to deal with it. It's best to stick
to what we can do automatically.

--
Michael Richardson
-at the cottage-
Maciej Soltysiak
2012-08-21 22:03:42 UTC
Permalink
Post by Michael Richardson
Good idea, but you need DNS to find that server, and you need
time to do DNSSEC.
How about this:
1) do a 1 time `host time.nist.gov 8.8.8.8` and feed that to NTP config file
2) make NTP get time from the IP of time.nist.gov resolved from step 1
3) start bind with dnssec

p.s. you could change 8.8.8.8 to DNS server you got from ISP DHCP if
you have it, if not, fall back this one time to 8.8.8.8

Regards,
Maciej
George Lambert
2012-08-21 22:31:10 UTC
Permalink
8.8.8.8 is anycast - so that seems rational to me too.

George.
Post by Maciej Soltysiak
Post by Michael Richardson
Good idea, but you need DNS to find that server, and you need
time to do DNSSEC.
1) do a 1 time `host time.nist.gov 8.8.8.8` and feed that to NTP config file
2) make NTP get time from the IP of time.nist.gov resolved from step 1
3) start bind with dnssec
p.s. you could change 8.8.8.8 to DNS server you got from ISP DHCP if
you have it, if not, fall back this one time to 8.8.8.8
Regards,
Maciej
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
P THINK BEFORE PRINTING: is it really necessary?

This e-mail and its attachments are confidential and solely for the
intended addressee(s). Do not share or use them without approval. If
received in error, contact the sender
and delete them.
Michael Richardson
2012-08-22 01:21:29 UTC
Permalink
Post by Michael Richardson
Good idea, but you need DNS to find that server, and you need
time to do DNSSEC.
Maciej> How about this:
Maciej> 1) do a 1 time `host time.nist.gov 8.8.8.8` and feed that to NTP config file
Maciej> 2) make NTP get time from the IP of time.nist.gov resolved from step 1
Maciej> 3) start bind with dnssec

Sure, you could do this.

There is no significant security advantage of doing this, vs starting
bind with DNSSEC time validation disabled. A malicious attacker who
wants to attack you also controls the answer that 8.8.8.8 returns, and
also controls the NTP answer on port 123. Bad guys owns your uplink.
It's as easy as plugging a *WRT box in front of yours, or any place
upstream. (if you are paranoid, you are paranoid)

Or turn off DNSSEC validation until you have some notion of time.
That way, you wouldn't claim to have done validation.
--
Michael Richardson
-at the cottage-
Török Edwin
2012-08-18 09:38:01 UTC
Permalink
Post by Dave Taht
I'm widening the distribution of this email a little bit in light of
the benchmark results (somewhat too far) below, and some of the other
issues raised.
On Fri, Aug 17, 2012 at 1:52 AM, Török Edwin
Post by Török Edwin
no fq_codel on laptop, openwrt, wlan0 5Ghz: 0.859/174.859/923.768/198.308 ms; 120 - 140Mbps
w/ fq_codel on laptop, openwrt, wlan0 5Ghz: 1.693/ 26.727/ 54.936/ 11.746 ms; 120 - 140Mbps
no fq_codel on laptop, cerowrt, wlan0 5Ghz: 2.310/ 15.183/140.495/ 30.337 ms; 75 - 85 Mbps
w/ fq_codel on laptop, cerowrt, wlan0 5Ghz: 1.464/ 1.981/ 2.223/ 0.221 ms; 75 - 85 Mbps
The latency improvement is awesome, and I don't really mind the sacrificed bandwidth to accomplish it.
A man after my own heart.
Thx! The industry as a whole has been focused on "bandwidth at any
cost, including massive latency", which leads to things like the ~1
second delays you observed on your fq_codel-less test. (and far worse
has been observed in the field) We're focused on improving latency,
because as stuart cheshire says: "once you have latency, you can't get
it back"
Yeah latency is most annoying thing on wireless (even more so on 3G),
if I really want more LAN bandwidth I can just plug in the Ethernet cable :)
Although I measured nttcp -r too now (see below) and the bandwidth drop is quite significant there,
can only do 39 Mbps.
Post by Dave Taht
We hope that once some other concepts prove out, we can keep the low
latency and add even more bandwidth back.
http://www.bufferbloat.net/projects/cerowrt/wiki/Fq_Codel_on_Wireless
In day-to-day use the lowered latency and jitter in cero currently can
be really "felt" particularly in applications like skype and google
hangouts, and web pages (under load) feel much faster, as DNS lookups
happen really fast...
and (as another example), things like youtube far more rarely "stall out".
It's kind of hard to measure "feel", though. I wish we had better
benchmarks to show what we're accomplishing.
Post by Török Edwin
Is the bandwidth drop intended though? When enabling fq_codel just on my laptop I didn't notice any bandwidth drop at all.
The core non-fq_codel change on cerowrt vs openwrt and/or your laptop
is that the aggregation buffer size at the driver level has been
severely reduced in cerowrt, from it's default of 128 buffers, to 3.
This means that the maximum aggregate size has been cut to 3 packets
from ~42, but more importantly, total outstanding buffers not managed
by codel to 3, rather than 128....
The fact that this costs so little bandwidth (40%) in exchange for
reducing latency and jitter by 25x (or 400x compared to no fq_codel at
all) suggests that in the long run, once we come up with fixes to the
mac80211 layer, we will be able to achieve better utilization,
latency, AND jitter overall than the current hw deployed everywhere.
IF you'd like to have more bandwidth back, you can jiggle the qlen_*
variables in the debloat script up, but remember that tcp's reaction
time is quadratic as to the amount of buffering. I'd be interested in
you repeating your benchmark after doing that? The difference between
3 buffers and 8 is pretty dramatic...
Here are some measurements (50 pings, nttcp -D -n200000)

Setup:
laptop with Linux 3.5.0, iwl4965 AGN, 5.22 Ghz (channel 44) only radio on 5GHz spectrum.
desktop with Linux 3.5.1 running nttcp -i, connected through GbE to cerowrt router

Baseline (only ping, no other traffic): 0.806/ 1.323/ 8.753/ 1.333 ms
no fq_codel on laptop, cerowrt defaults, nttcp -t: 1.192/16.605/107.351/25.265 ms; 94 Mbps
no fq_codel on laptop, cerowrt qlen_*=4, nttcp -t: 1.285/25.108/105.519/22.607 ms; 107 Mbps
no fq_codel on laptop, cerowrt qlen_*=5, nttcp -t: 1.118/15.538/ 68.262/15.773 ms; 104 Mbps
no fq_codel on laptop, cerowrt qlen_*=6, nttcp -t: 1.353/15.592/116.026/18.782 ms; 113 Mbps
no fq_codel on laptop, cerowrt qlen_*=7, nttcp -t: 1.344/18.719/112.757/23.691 ms; 113 Mbps
no fq_codel on laptop, cerowrt qlen_*=8, nttcp -t: 1.717/20.842/ 59.412/16.120 ms; 127 Mbps
no fq_codel on laptop, cerowrt qlen_*=9, nttcp -t: 1.663/21.406/141.338/26.270 ms; 123 Mbps
no fq_codel on laptop, cerowrt qlen_*=10,nttcp -t: 1.818/21.361/117.023/20.727 ms; 120 Mbps
no fq_codel on laptop, cerowrt qlen_*=12,nttcp -t: 2.195/24.277/131.490/21.161 ms; 127 Mbps

no fq_codel on laptop, cerowrt defaults, nttcp -r: 1.332/ 3.129/ 41.900/ 5.221 ms; 39 Mbps
no fq_codel on laptop, cerowrt qlen_*=4, nttcp -r: 1.514/ 3.205/ 8.595/ 1.817 ms; 46 Mbps
no fq_codel on laptop, cerowrt qlen_*=5, nttcp -r: 1.342/ 7.250/114.095/17.591 ms; 52 Mbps
no fq_codel on laptop, cerowrt qlen_*=6, nttcp -r: 1.299/ 3.200/ 12.080/ 1.972 ms; 58 Mbps
no fq_codel on laptop, cerowrt qlen_*=7, nttcp -r: 1.617/ 5.220/ 76.478/10.971 ms; 63 Mbps
no fq_codel on laptop, cerowrt qlen_*=8, nttcp -r: 1.810/ 4.267/ 24.560/ 4.065 ms; 67 Mbps
no fq_codel on laptop, cerowrt qlen_*=9, nttcp -r: 2.015/ 5.090/ 78.256/10.710 ms; 71 Mbps
no fq_codel on laptop, cerowrt qlen_*=10,nttcp -r: 1.931/ 4.107/ 15.141/ 3.049 ms; 75 Mbps
no fq_codel on laptop, cerowrt qlen_*=11,nttcp -r: 1.982/ 3.849/ 12.163/ 2.407 ms; 80 Mbps
no fq_codel on laptop, cerowrt qlen_*=12,nttcp -r: 2.025/ 5.173/ 16.890/ 3.763 ms; 81 Mbps
no fq_codel on laptop, cerowrt qlen_*=20,nttcp -r: 1.525/ 3.492/ 13.737/ 2.465 ms; 82 Mbps
no fq_codel on laptop, cerowrt qlen_*=30,nttcp -r: 1.907/11.243/142.129/24.017 ms; 104 Mbps
no fq_codel on laptop, cerowrt qlen_*=35,nttcp -r: 2.893/ 7.895/130.859/17.621 ms; 119 Mbps
no fq_codel on laptop, cerowrt qlen_*=40,nttcp -r: 2.917/ 7.766/ 98.252/13.105 ms; 120 Mbps
no fq_codel on laptop, cerowrt qlen_*=50,nttcp -r: 0.951/ 7.810/ 47.646/ 6.428 ms; 131 Mbps
no fq_codel on laptop, cerowrt qlen_*=100,nttcp -r:5.149/ 8.766/ 14.371/ 2.191 ms; 128 Mbps

To get twice the speed a qlen=11 is enough already, and to get all the speed back a qlen=35 is needed.

And here are the results with fq_codel on the laptop too (just nttcp -t as thats the one affected):

fq_codel on laptop, cerowrt defaults, nttcp -t: 1.248/12.960/108.490/16.733 ms; 90 Mbps
fq_codel on laptop, cerowrt qlen_*=4, nttcp -t: 1.205/10.843/ 76.983/12.460 ms; 105 Mbps
fq_codel on laptop, cerowrt qlen_*=8, nttcp -t: 4.034/16.088/ 98.611/17.050 ms; 120 Mbps
fq_codel on laptop, cerowrt qlen_*=11, nttcp -t: 3.766/15.687/ 56.684/11.135 ms; 114 Mbps
fq_codel on laptop, cerowrt qlen_*=35, nttcp -t: 11.360/26.742/ 48.051/ 7.489 ms; 113 Mbps

Shouldn't wireless N be able to do 200 - 300 Mbps though? If I enable debugging in iwl4965 I see that it
starts TX aggregation, so not sure whats wrong (router or laptop?). With encryption off I can get at most 160 Mbps.

iw dev sw10 station dump shows:
...
signal: -56 [-60, -59] dBm
signal avg: -125 [-65, -58] dBm
tx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
rx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI

On laptop:
tx bitrate: 300.0 Mbit/s MCS 15 40Mhz short GI
Post by Dave Taht
Personally I'd be happy if we could hold wifi jitter below 30ms, and
typical latency below 10ms, in most (home) scenarios. I think that is
eminently doable, and a reasonable compromise between cero's all-out
assault on latency and the marketing need for more bandwidth. fq_codel
all by itself gets close (the fair queuing part helps a lot)
'Course I'd love it if low latency became the subject of all out
marketing wars between home gateway makers, and between ISPs, with
1/100 the technical resources thrown into the problem as has been
expended on raw bandwidth.
"Hetgear": Frag your friends, faster!
Yeah gamers would probably care about latency, game server admins too.
Post by Dave Taht
"Binksys": Torrents and TV? no problem.
"Chuckalo": making DNS zippy!
"Boogle fiber: now with 2ms cross town latency!"
"Merizon: Coast to coast in under 60ms!"
"Nomcast: Making webex work better"
But we're not living in that alternate reality (yet), although I think
we're beginning to see some light at the end of the tunnel.
That said, there are infrastructural problems regarding the overuse of
aggregation everywhere, in gpon, in cable modems and CMTSes, and in
other backbone technologies, in addition to the queue management
problem. It's going to be a hard slog to get back to where the
distance between your couch and the internet is consistently less than
between here and the moon.
But worth it, in terms of the global productivity gain, and lowered
annoyance levels, worldwide.
:)

Best regards,
--Edwin
Jonathan Morton
2012-08-18 10:20:52 UTC
Permalink
Post by Török Edwin
Shouldn't wireless N be able to do 200 - 300 Mbps though? If I enable debugging in iwl4965 I see that it
starts TX aggregation, so not sure whats wrong (router or laptop?). With encryption off I can get at most 160 Mbps.
That's only the raw data rate - many non-ideal effects conspire to reduce this by at least half in practice.

I don't think anyone has ever seen the full theoretical throughput on wireless - at this point it's just a marketing number to indicate "this one is newer and better" to the technically illiterate.

- Jonathan
Dave Taht
2012-08-18 17:07:45 UTC
Permalink
Thx again for the benchmarks on your hardware! Can I get you to go one
more time to the well?

There's a subtle point to be made which basically involves the
difference between testing in lab conditions and in the real world.

On Sat, Aug 18, 2012 at 2:38 AM, Török Edwin
Post by Török Edwin
Baseline (only ping, no other traffic): 0.806/ 1.323/ 8.753/ 1.333 ms
no fq_codel on laptop, cerowrt defaults, nttcp -t: 1.192/16.605/107.351/25.265 ms; 94 Mbps
no fq_codel on laptop, cerowrt qlen_*=4, nttcp -t: 1.285/25.108/105.519/22.607 ms; 107 Mbps
no fq_codel on laptop, cerowrt qlen_*=12,nttcp -t: 2.195/24.277/131.490/21.161 ms; 127 Mbps
Stripping out the incremental steps some will save you some time
on benchmarking, so lets go with 3,4,12,35,100. Wireless data is
incredibly noisy and I usually end up going with cdf plots like this
old one

http://www.teklibre.com/~d/bloat/hoqvssfqred.ps

to cope with noisy data with tons and tons of voip-like pings

http://www.teklibre.com/~d/bloat/ping_log.ps (also old)

but moving forward, we can do some stuff with this, so see below..

(to explain the first plot: sfqred was the predecessor to fq_codel,
and the first showed a distinct advantage towards optimizing for new
streams, which ended up (more elegantly) in fq_codel. The second plot
shows the effect of a small bandwidth change on latency, when the
underlying buffering was large. Yes, I need to get around to newer
plots but we still have some analysis and optimization to do of the
underlying codel algo)
Post by Török Edwin
no fq_codel on laptop, cerowrt defaults, nttcp -r: 1.332/ 3.129/ 41.900/ 5.221 ms; 39 Mbps
no fq_codel on laptop, cerowrt qlen_*=4, nttcp -r: 1.514/ 3.205/ 8.595/ 1.817 ms; 46 Mbps
no fq_codel on laptop, cerowrt qlen_*=12,nttcp -r: 2.025/ 5.173/ 16.890/ 3.763 ms; 81 Mbps
no fq_codel on laptop, cerowrt qlen_*=35,nttcp -r: 2.893/ 7.895/130.859/17.621 ms; 119 Mbps
no fq_codel on laptop, cerowrt qlen_*=50,nttcp -r: 0.951/ 7.810/ 47.646/ 6.428 ms; 131 Mbps
no fq_codel on laptop, cerowrt qlen_*=100,nttcp -r:5.149/ 8.766/ 14.371/ 2.191 ms; 128 Mbps
To get twice the speed a qlen=11 is enough already, and to get all the speed back a qlen=35 is needed.
This is an incomplete conclusion. It is incomplete in that A) these
tests were done under laboratory conditions at the highest data rate
(MCS15), and B), it was with a single point to point link to an AP
which normally would be handling more than one client. C) it tests a
single full throttle TCP stream when typical websites and usage
involve 70+ dns lookups and 70 separate short streams.

I can live with B and C) for now, although I note that the chrome
benchmark while doing a full blown stream test as you are doing now in
the background and ping is quite useful for looking at C. Let's tackle
A...
Post by Török Edwin
fq_codel on laptop, cerowrt defaults, nttcp -t: 1.248/12.960/108.490/16.733 ms; 90 Mbps
fq_codel on laptop, cerowrt qlen_*=4, nttcp -t: 1.205/10.843/ 76.983/12.460 ms; 105 Mbps
fq_codel on laptop, cerowrt qlen_*=8, nttcp -t: 4.034/16.088/ 98.611/17.050 ms; 120 Mbps
fq_codel on laptop, cerowrt qlen_*=11, nttcp -t: 3.766/15.687/ 56.684/11.135 ms; 114 Mbps
fq_codel on laptop, cerowrt qlen_*=35, nttcp -t: 11.360/26.742/ 48.051/ 7.489 ms; 113 Mbps
So, if you could move your laptop to where it gets MCS4 on a fairly
reliable basis, and repeat the tests? a wall or three will do it.

I will predict several things:

1) the bulk of the buffering problem is going to move to your laptop,
as it has weaker antennas than the wndrs. Most likely you will end up
with tx on the one side higher than rx on the other.

2) you will see much higher jitter and latency and much lower
throughput. Your results will also get wildly more variable run to
run. (I tend to run tests for 2 minutes or longer and toss out the
first few seconds)

3) The lower fixed buffering sizes on cero's qlens will start making a
lot more sense, but it may be hard to see due to 1 and 2.

The thing I don't honestly know is how well fq_codel reacts to sudden
bandwidth changes when the underlying device driver (the iwl in this
case) is overbuffered or how well codel's target idea really works in
the wifi case in general. It would be nice to have some data on it.
(hint, hint)

Some work was done on debloating the iwl last year, I don't know if
any of the work made it into mainline.

Lastly, I put a version of Linux 3.6-rc2 up here.

http://snapon.lab.bufferbloat.net/~cero1/deb/

It has a fix to codel in it that was needed (I think but have not
checked to see if it's in 3.5.1), and it also incorporates "TCP small
queues", which reduces tcp-related buffering in pfifo_fast enormously,
and helps on other qdiscs as well. Switching to it will invalidate the
testing you've done so far...

(another reason why I'm reluctant to post graphs on codel/fq_codel
right now is that good stuff keeps happening above/below it in Linux),

please don't change your kernel out before trying that test... (and I
make no warranties about the reliability/usefulness of a rc2!)
Post by Török Edwin
Shouldn't wireless N be able to do 200 - 300 Mbps though? If I enable debugging in iwl4965 I see that it
starts TX aggregation, so not sure whats wrong (router or laptop?). With encryption off I can get at most 160 Mbps.
A UDP test will get you in the 270Mbit range usually.
Post by Török Edwin
...
signal: -56 [-60, -59] dBm
signal avg: -125 [-65, -58] dBm
tx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
rx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
tx bitrate: 300.0 Mbit/s MCS 15 40Mhz short GI
In non-lab conditions you generally don't lock into a rate. The
minstrel algorithm tries various strategies to get the packets
through, so you can
get a grip on what's really happening by looking at the rc_stats file
for your particular device.

example here:


http://www.bufferbloat.net/projects/cerowrt/wiki/Minstrel_Wireless_Rate_Selection
Török Edwin
2012-08-25 13:56:04 UTC
Permalink
Post by Dave Taht
Thx again for the benchmarks on your hardware! Can I get you to go one
more time to the well?
Yes, but you have to wait until I have some time to do it.
Post by Dave Taht
Stripping out the incremental steps some will save you some time
on benchmarking, so lets go with 3,4,12,35,100. Wireless data is
incredibly noisy and I usually end up going with cdf plots like this
old one
Post by Török Edwin
To get twice the speed a qlen=11 is enough already, and to get all the speed back a qlen=35 is needed.
This is an incomplete conclusion. It is incomplete in that A) these
tests were done under laboratory conditions at the highest data rate
(MCS15), and B), it was with a single point to point link to an AP
which normally would be handling more than one client. C) it tests a
single full throttle TCP stream when typical websites and usage
involve 70+ dns lookups and 70 separate short streams.
I can live with B and C) for now, although I note that the chrome
benchmark while doing a full blown stream test as you are doing now in
the background and ping is quite useful for looking at C. Let's tackle
A...
Post by Török Edwin
fq_codel on laptop, cerowrt defaults, nttcp -t: 1.248/12.960/108.490/16.733 ms; 90 Mbps
fq_codel on laptop, cerowrt qlen_*=4, nttcp -t: 1.205/10.843/ 76.983/12.460 ms; 105 Mbps
fq_codel on laptop, cerowrt qlen_*=8, nttcp -t: 4.034/16.088/ 98.611/17.050 ms; 120 Mbps
fq_codel on laptop, cerowrt qlen_*=11, nttcp -t: 3.766/15.687/ 56.684/11.135 ms; 114 Mbps
fq_codel on laptop, cerowrt qlen_*=35, nttcp -t: 11.360/26.742/ 48.051/ 7.489 ms; 113 Mbps
So, if you could move your laptop to where it gets MCS4 on a fairly
reliable basis, and repeat the tests? a wall or three will do it.
I've put my laptop in a place where I got MCS4 on TX most of the time.
RX is MCS4 most of the time too, but it is switching to MCS5, 7, 11, 12 and back to MCS4
quite a lot.
Post by Dave Taht
please don't change your kernel out before trying that test... (and I
make no warranties about the reliability/usefulness of a rc2!)
Here are the results with fq_codel on the laptop, and same 3.5.0 kernel:

qlen 100, nttcp -t: 5.966/57.104/192.017/26.674 ms; 52.2376 Mbps
qlen 35, nttcp -t: 15.636/54.823/108.921/19.762 ms; 52.4675 Mbps
qlen 12, nttcp -t: 4.768/29.439/132.924/27.159 ms; 51.2619 Mbps
qlen 4, nttcp -t: 2.631/20.500/152.741/31.549 ms; 40.3949 Mbps
qlen def, ntccp -t: 2.010/21.851/317.085/49.323 ms; 35.8268 Mbps

qlen 100, nttcp -r: 23.225/44.101/142.835/21.181 ms; 36.6789 Mbps
qlen 35, nttcp -r: 3.755/23.413/ 83.530/15.329 ms; 35.4602 Mbps
qlen 12, nttcp -r: 4.318/10.251/ 96.773/12.008 ms; 31.1557 Mbps
qlen 4, nttcp -r: 2.733/ 4.507/ 16.353/ 1.917 ms; 24.6688 Mbps
qlen def, nttcp -r: 2.119/ 4.999/ 64.968/ 7.275 ms; 27.3645 Mbps

Note that the laptop was on battery this time, so that may add some jitter
(CPU freq switching, wifi power saving?), but shouldn't matter for >10ms quantities.

Looks like the iwl4965 is somewhat bloated, with those 100ms+ latencies.

I don't know what happened there, but with the default qlen (2,3,3,3) I get the 317 ms max latency,
whereas with qlen 4 I get 152 ms max latency on TX. The average is also better with qlen 4.
Same observation goes for the RX side.
Post by Dave Taht
1) the bulk of the buffering problem is going to move to your laptop,
as it has weaker antennas than the wndrs. Most likely you will end up
with tx on the one side higher than rx on the other.
Yes the laptop TX latencies are worse.
Post by Dave Taht
2) you will see much higher jitter and latency and much lower
throughput. Your results will also get wildly more variable run to
run. (I tend to run tests for 2 minutes or longer and toss out the
first few seconds)
On TX it is quite consistently in MCS4 (according to watch iw wlan0 station dump),
but on RX its jumping quite a lot.
Post by Dave Taht
3) The lower fixed buffering sizes on cero's qlens will start making a
lot more sense, but it may be hard to see due to 1 and 2.
qlen 12 and 4 look good. The default looks worse though.
Post by Dave Taht
The thing I don't honestly know is how well fq_codel reacts to sudden
bandwidth changes when the underlying device driver (the iwl in this
case) is overbuffered or how well codel's target idea really works in
the wifi case in general. It would be nice to have some data on it.
(hint, hint)
The bandwidth varies quite a lot on RX even if both the laptop and router
are perfectly still. So the -r numbers above should be what you are looking for.
If you want some other data let me know.
Post by Dave Taht
Some work was done on debloating the iwl last year, I don't know if
any of the work made it into mainline.
Lastly, I put a version of Linux 3.6-rc2 up here.
http://snapon.lab.bufferbloat.net/~cero1/deb/
It has a fix to codel in it that was needed (I think but have not
checked to see if it's in 3.5.1), and it also incorporates "TCP small
queues", which reduces tcp-related buffering in pfifo_fast enormously,
and helps on other qdiscs as well. Switching to it will invalidate the
testing you've done so far...
I assume these are in the upstream 3.6-rc3 too, right?

Here is just one measurement done with 3.6-rc3 on the laptop and fq_codel
(same location as above tests, approx MCS4):
qlen def, nttcp -t, 2.871/15.655/375.777/44.212 ms; 35.2776 Mbps
qlen def, nttcp -r, 1.406/ 3.434/ 12.763/ 1.649 ms; 24.3334 Mbps

It looks somewhat better.
Post by Dave Taht
(another reason why I'm reluctant to post graphs on codel/fq_codel
right now is that good stuff keeps happening above/below it in Linux),
Post by Török Edwin
Shouldn't wireless N be able to do 200 - 300 Mbps though? If I enable debugging in iwl4965 I see that it
starts TX aggregation, so not sure whats wrong (router or laptop?). With encryption off I can get at most 160 Mbps.
A UDP test will get you in the 270Mbit range usually.
nttcp -T -u -D -n2000 gives ~180 Mbps at most, and with -r I can't make sense of it (looks like most gets dropped):
Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s
l 16384 0.08 0.00 1.6090 13107.2000 5 61.38 500000.0
1 8192000 0.08 0.04 845.8113 1820.6973 2003 25850.83 55646.6
Post by Dave Taht
Post by Török Edwin
...
signal: -56 [-60, -59] dBm
signal avg: -125 [-65, -58] dBm
tx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
rx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
tx bitrate: 300.0 Mbit/s MCS 15 40Mhz short GI
In non-lab conditions you generally don't lock into a rate. The
minstrel algorithm tries various strategies to get the packets
through, so you can
get a grip on what's really happening by looking at the rc_stats file
for your particular device.
http://www.bufferbloat.net/projects/cerowrt/wiki/Minstrel_Wireless_Rate_Selection
I looked at the rc_stats file by cd-ing into the stations dir on the router. After disabling/enabling the radio
the stations subdir was gone though:
***@OpenWrt:~# ls /sys/kernel/debug/ieee80211/phy1/netdev\:sw10/stations/ -al
drwxr-xr-x 2 root root 0 Aug 25 10:28 .
drwxr-xr-x 3 root root 0 Aug 25 10:28 ..

So unfortunately I'm without an rc_stats now (until I reboot the router probably?).

Best regards,
--Edwin
Dave Taht
2012-08-25 18:09:10 UTC
Permalink
On Sat, Aug 25, 2012 at 6:56 AM, Török Edwin
Post by Török Edwin
Post by Dave Taht
Thx again for the benchmarks on your hardware! Can I get you to go one
more time to the well?
Yes, but you have to wait until I have some time to do it.
No worries. Doing good science takes time.
Post by Török Edwin
Post by Dave Taht
Stripping out the incremental steps some will save you some time
on benchmarking, so lets go with 3,4,12,35,100. Wireless data is
incredibly noisy and I usually end up going with cdf plots like this
old one
Post by Török Edwin
To get twice the speed a qlen=11 is enough already, and to get all the speed back a qlen=35 is needed.
This is an incomplete conclusion. It is incomplete in that A) these
tests were done under laboratory conditions at the highest data rate
(MCS15), and B), it was with a single point to point link to an AP
which normally would be handling more than one client. C) it tests a
single full throttle TCP stream when typical websites and usage
involve 70+ dns lookups and 70 separate short streams.
I can live with B and C) for now, although I note that the chrome
benchmark while doing a full blown stream test as you are doing now in
the background and ping is quite useful for looking at C. Let's tackle
A...
Post by Török Edwin
fq_codel on laptop, cerowrt defaults, nttcp -t: 1.248/12.960/108.490/16.733 ms; 90 Mbps
fq_codel on laptop, cerowrt qlen_*=4, nttcp -t: 1.205/10.843/ 76.983/12.460 ms; 105 Mbps
fq_codel on laptop, cerowrt qlen_*=8, nttcp -t: 4.034/16.088/ 98.611/17.050 ms; 120 Mbps
fq_codel on laptop, cerowrt qlen_*=11, nttcp -t: 3.766/15.687/ 56.684/11.135 ms; 114 Mbps
fq_codel on laptop, cerowrt qlen_*=35, nttcp -t: 11.360/26.742/ 48.051/ 7.489 ms; 113 Mbps
So, if you could move your laptop to where it gets MCS4 on a fairly
reliable basis, and repeat the tests? a wall or three will do it.
I've put my laptop in a place where I got MCS4 on TX most of the time.
RX is MCS4 most of the time too, but it is switching to MCS5, 7, 11, 12 and back to MCS4
quite a lot.
Post by Dave Taht
please don't change your kernel out before trying that test... (and I
make no warranties about the reliability/usefulness of a rc2!)
qlen 100, nttcp -t: 5.966/57.104/192.017/26.674 ms; 52.2376 Mbps
qlen 35, nttcp -t: 15.636/54.823/108.921/19.762 ms; 52.4675 Mbps
qlen 12, nttcp -t: 4.768/29.439/132.924/27.159 ms; 51.2619 Mbps
qlen 4, nttcp -t: 2.631/20.500/152.741/31.549 ms; 40.3949 Mbps
qlen def, ntccp -t: 2.010/21.851/317.085/49.323 ms; 35.8268 Mbps
qlen 100, nttcp -r: 23.225/44.101/142.835/21.181 ms; 36.6789 Mbps
qlen 35, nttcp -r: 3.755/23.413/ 83.530/15.329 ms; 35.4602 Mbps
qlen 12, nttcp -r: 4.318/10.251/ 96.773/12.008 ms; 31.1557 Mbps
qlen 4, nttcp -r: 2.733/ 4.507/ 16.353/ 1.917 ms; 24.6688 Mbps
qlen def, nttcp -r: 2.119/ 4.999/ 64.968/ 7.275 ms; 27.3645 Mbps
Note that the laptop was on battery this time, so that may add some jitter
(CPU freq switching, wifi power saving?), but shouldn't matter for >10ms quantities.
Thank you for so clearly showing the trendline and relationship between
overbuffering, bandwidth, latency, and jitter on linux wifi in this
combination of these two drivers and OSes!

(I am inclined to throw out the second qlen 4 result as anomalous however)

(Did I add enough qualifications to the above statement?)

It does look like qlen 12 (presently) fits within my overall goals.
However (to me) the next step is switching the ath9k driver's buffering itself
from a straight fifo to (a tree?) trying to inspect its queue(s)
for possible aggregate-able packets and fq-ing (again) the result.

a better method (probably) would be for it to tell the overlying qdisc
"I want up to x packets or y bytes for station z", and the overlying
qdisc to be doing that job, and thus the codel notion of "maxpacket"
could apply to each station.

"maxpacket" is kind of misnamed, what it means is the maximum number
of bytes that can be delivered in one go - so it is MTU for devices
that don't have TSO or GSO enabled, size of a TSO (something less than
64k) for TSO/GSO, and should (probably!!! we're not there yet) be
equal to
"proposed next aggregate-able size/bytes for this destination" in wifi.
Post by Török Edwin
Looks like the iwl4965 is somewhat bloated, with those 100ms+ latencies.
Ya think? It turns out one of my laptops has the 5100AGN, which is similar.
Somewhere on this list last year was a long discussion and some
proposed patches for the iwl series... I think the guy working on it
got swamped by some phd work, though.

right now that box is used as a wired endpoint (and has a SR71 card in it).

A few x86 boxes were just donated that I can replace that with, and do
a bit more wireless testing next week than I presently do. (and the
x86 boxes will dramatically expand testing longer RTTs, which I care
about a lot)

(THANK YOU VYATTA!)

Regrettably, first I have to get those boxes here, then setup, a
working OS and kernel on them, and then running netem...
Post by Török Edwin
I don't know what happened there, but with the default qlen (2,3,3,3) I get the 317 ms max latency,
whereas with qlen 4 I get 152 ms max latency on TX. The average is also better with qlen 4.
Same observation goes for the RX side.
We have a potential interaction with the default quantums I'm using on cero,
which are 256, rather than 1500, (which is the default). In that case,
we can end up with 3 timestamped ipv4 acks in a row, but not 4, so a
given stream can "leak over" into the next potential aggregate, which
might be arbitrarily shortened by incorporating another portion of a
stream for another destination.

So, I'm inclined to bump it up to 12 for the cerowrt userbase as the
cost in normal usage is low and the benefit high, (note the same
problem above will occur, just slightly less often, and on average the
aggregates will be larger, so it's a win)

but in the interest of science and continuing to analyze codel's
behavior I'm going to keep it at 3 for while longer. (feel free to use
values that make you happy, just clearly tell me when you do, please!)

fiddling with the qlen is a very blunt hammer for the real job that
needs to be happening in the qdisc and driver, regardless. I hope we
can get much smarter about it soon, but at least in my case that
requires more insight into the ath9k than I have currently. Felix is
probably pretty wrapped up in the openwrt freeze, Andrew has another
day job...
Post by Török Edwin
Post by Dave Taht
1) the bulk of the buffering problem is going to move to your laptop,
as it has weaker antennas than the wndrs. Most likely you will end up
with tx on the one side higher than rx on the other.
Yes the laptop TX latencies are worse.
Post by Dave Taht
2) you will see much higher jitter and latency and much lower
throughput. Your results will also get wildly more variable run to
run. (I tend to run tests for 2 minutes or longer and toss out the
first few seconds)
On TX it is quite consistently in MCS4 (according to watch iw wlan0 station dump),
but on RX its jumping quite a lot.
As good as the minstrel algorithm is, I've often felt it could be improved
with deeper analysis of what really happens in the wireless-n cases,
particularly in the case of retries and within-aggregate packet loss.

Tuning it for -g took a year of data collection and a ton of analysis
and cash... and the (excellent) paper on it is unpublishable because
it so far exceeds the MPU. I doubt there are 12 people in the world
that deeply understand how minstrel works, and I wish there were
thousands... there is a wealth of information in it that could be used
for other things, like improving the behavior of mesh routing
protocols.
Post by Török Edwin
Post by Dave Taht
3) The lower fixed buffering sizes on cero's qlens will start making a
lot more sense, but it may be hard to see due to 1 and 2.
qlen 12 and 4 look good. The default looks worse though.
Post by Dave Taht
The thing I don't honestly know is how well fq_codel reacts to sudden
bandwidth changes when the underlying device driver (the iwl in this
case) is overbuffered or how well codel's target idea really works in
the wifi case in general. It would be nice to have some data on it.
(hint, hint)
The bandwidth varies quite a lot on RX even if both the laptop and router
are perfectly still. So the -r numbers above should be what you are looking for.
If you want some other data let me know.
I'll try not to abuse your time, but if I can convince you to be able to
duplicate your experiments exactly, when needed, it would be an enormous help.
Post by Török Edwin
Post by Dave Taht
Some work was done on debloating the iwl last year, I don't know if
any of the work made it into mainline.
Lastly, I put a version of Linux 3.6-rc2 up here.
http://snapon.lab.bufferbloat.net/~cero1/deb/
It has a fix to codel in it that was needed (I think but have not
checked to see if it's in 3.5.1), and it also incorporates "TCP small
queues", which reduces tcp-related buffering in pfifo_fast enormously,
and helps on other qdiscs as well. Switching to it will invalidate the
testing you've done so far...
I assume these are in the upstream 3.6-rc3 too, right?
yes. The rc3 I just put up there has some subtle changes to codel in
it, however that differ from the mainline. I'll have to clearly
distinguish between that and mainline better in the future.
Post by Török Edwin
Here is just one measurement done with 3.6-rc3 on the laptop and fq_codel
qlen def, nttcp -t, 2.871/15.655/375.777/44.212 ms; 35.2776 Mbps
qlen def, nttcp -r, 1.406/ 3.434/ 12.763/ 1.649 ms; 24.3334 Mbps
It looks somewhat better.
12 remains the sanest win right now. But too early to change it this month.

thx again!
Post by Török Edwin
Post by Dave Taht
(another reason why I'm reluctant to post graphs on codel/fq_codel
right now is that good stuff keeps happening above/below it in Linux),
Post by Török Edwin
Shouldn't wireless N be able to do 200 - 300 Mbps though? If I enable debugging in iwl4965 I see that it
starts TX aggregation, so not sure whats wrong (router or laptop?). With encryption off I can get at most 160 Mbps.
A UDP test will get you in the 270Mbit range usually.
Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s
l 16384 0.08 0.00 1.6090 13107.2000 5 61.38 500000.0
1 8192000 0.08 0.04 845.8113 1820.6973 2003 25850.83 55646.6
I'll think about this on another day. Feel free to do pfifo_fast on this test
a couple times in either direction to get a baseline.

Doing badly on this test right now doesn't bother me at all...
Post by Török Edwin
Post by Dave Taht
Post by Török Edwin
...
signal: -56 [-60, -59] dBm
signal avg: -125 [-65, -58] dBm
tx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
rx bitrate: 300.0 MBit/s MCS 15 40Mhz short GI
tx bitrate: 300.0 Mbit/s MCS 15 40Mhz short GI
In non-lab conditions you generally don't lock into a rate. The
minstrel algorithm tries various strategies to get the packets
through, so you can
get a grip on what's really happening by looking at the rc_stats file
for your particular device.
http://www.bufferbloat.net/projects/cerowrt/wiki/Minstrel_Wireless_Rate_Selection
I looked at the rc_stats file by cd-ing into the stations dir on the router. After disabling/enabling the radio
drwxr-xr-x 2 root root 0 Aug 25 10:28 .
drwxr-xr-x 3 root root 0 Aug 25 10:28 ..
So unfortunately I'm without an rc_stats now (until I reboot the router probably?).
Best regards,
--Edwin
--
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"
Loading...