[Cerowrt-devel] Comcast upped service levels -> WNDR3800 can't cope...

Discussion:

Aaron Wood

2014-08-29 16:57:26 UTC

Comcast has upped the download rates in my area, from 50Mbps to 100Mbps.
This morning I tried to find the limit of the WNDR3800. And I found it.
50Mbps is still well within capabilities, 100Mbps isn't.

And as I've seen Dave say previously, it's right around 80Mbps total
(download + upload).

http://burntchrome.blogspot.com/2014/08/new-comcast-speeds-new-cerowrt-sqm.html

I tried disabling downstream shaping to see what the result was, and it
wasn't pretty. I also tried using the "simplest.qos" script, and that
didn't really gain me anything, so I went back to the simple.qos script
(those results aren't included above).

It looks like it's definitely time for a new router platform (for me).

Or, we need to find a way to implement the system such that it doesn't max
out a 680MHz mips core just to push 100Mbps of data. That's roughly 10K
cpu cycles per packet, which seems like an awful lot. Unless the other
problem is that the memory bus just can't keep up. My experience of a lot
of these processors is that the low-level offload engines have great DMA
capabilities for "wire-speed" operation, but that the processor core itself
can't move data to save it's life.

What's the limit of the EdgeRouter Lite?

Or should I start looking for something like this:

http://www.gateworks.com/product/item/ventana-gw5310-network-processor

(although that's an expensive board, given the very low production volume,
for the same cost I could probably build a small passively-cooled
mini/micro-atx setup running x86 and dual NICs).

-Aaron

Rick Jones

2014-08-29 17:03:14 UTC

Permalink

On 08/29/2014 09:57 AM, Aaron Wood wrote:
> Or, we need to find a way to implement the system such that it doesn't
> max out a 680MHz mips core just to push 100Mbps of data. That's roughly
> 10K cpu cycles per packet, which seems like an awful lot. Unless the
> other problem is that the memory bus just can't keep up. My experience
> of a lot of these processors is that the low-level offload engines have
> great DMA capabilities for "wire-speed" operation, but that the
> processor core itself can't move data to save it's life.

In the long ago and far away, it used to be opined that one
could/would/should get 1 Mbit/s per MHz. Though that may have been for
a situation where there wasn't much besides just the plain TCP/IP stack
running (eg without firewall bits etc going).

Does "perf" run on MIPS in the kernel you are running?

rick jones

Jonathan Morton

2014-08-29 17:15:05 UTC

Permalink

On 29 Aug, 2014, at 7:57 pm, Aaron Wood wrote:

> That's roughly 10K cpu cycles per packet, which seems like an awful lot.

I could analyse the chief algorithms to see how many clock cycles per packet are theoretically possible - a number one could approach with an embedded core in the NIC, rather than as part of a full kernel.

- Jonathan Morton

Jonathan Morton

2014-08-29 17:19:30 UTC

Permalink

On 29 Aug, 2014, at 7:57 pm, Aaron Wood wrote:

> Comcast has upped the download rates in my area, from 50Mbps to 100Mbps.

FWIW, it looks like the unshaped latency has about halved with the doubling of capacity. That's consistent with the buffer size and (lack of) management remaining the same.

If PIE were enabled, it'd look a whole lot better than that, I'm sure.

- Jonathan Morton

Sebastian Moeller

2014-08-29 17:25:14 UTC

Permalink

Hi Aaron,

On Aug 29, 2014, at 18:57 , Aaron Wood <***@gmail.com> wrote:

> Comcast has upped the download rates in my area, from 50Mbps to 100Mbps. This morning I tried to find the limit of the WNDR3800. And I found it. 50Mbps is still well within capabilities, 100Mbps isn't.
>
> And as I've seen Dave say previously, it's right around 80Mbps total (download + upload).
>
> http://burntchrome.blogspot.com/2014/08/new-comcast-speeds-new-cerowrt-sqm.html
>
> I tried disabling downstream shaping to see what the result was, and it wasn't pretty.

You could try to set the interface to 100Mbps with ethtool and exercise cerowrt BQL implementation a bit ;)

> I also tried using the "simplest.qos" script, and that didn't really gain me anything, so I went back to the simple.qos script (those results aren't included above).
>
> It looks like it's definitely time for a new router platform (for me).
>
> Or, we need to find a way to implement the system such that it doesn't max out a 680MHz mips core just to push 100Mbps of data. That's roughly 10K cpu cycles per packet, which seems like an awful lot. Unless the other problem is that the memory bus just can't keep up. My experience of a lot of these processors is that the low-level offload engines have great DMA capabilities for "wire-speed" operation, but that the processor core itself can't move data to save it's life.

Could you try simplest.qos and replace HTB with TBF? We still do not know whether there is a cheaper option than HTB that still works okay-ish (I only have 16D 2U, so can not easily test myself). I guess that TBF is just as expensive as HTB since both shaw more or less the same token bucket algorithm...

>
> What's the limit of the EdgeRouter Lite?

I think this tops out at ~ 80-90Mbps combined, but there is no BQL yet. Given the price of tho unit it would be really nice if that would work for the 150-200Mbps combined that seem to be needed in the near future.

>
> Or should I start looking for something like this:
>
> http://www.gateworks.com/product/item/ventana-gw5310-network-processor
>
> (although that's an expensive board, given the very low production volume, for the same cost I could probably build a small passively-cooled mini/micro-atx setup running x86 and dual NICs).
>
> -Aaron
> _______________________________________________
> Bloat mailing list
> ***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

Dave Taht

2014-08-29 18:06:32 UTC

Permalink

On Fri, Aug 29, 2014 at 9:57 AM, Aaron Wood <***@gmail.com> wrote:
> Comcast has upped the download rates in my area, from 50Mbps to 100Mbps.
> This morning I tried to find the limit of the WNDR3800. And I found it.
> 50Mbps is still well within capabilities, 100Mbps isn't.
>
> And as I've seen Dave say previously, it's right around 80Mbps total
> (download + upload).
>
> http://burntchrome.blogspot.com/2014/08/new-comcast-speeds-new-cerowrt-sqm.html

Thank you very much, as always, for doing public benchmarking with a good setup!

Yes we hit kind of an unexpected wall on everything shipped with a processor
originally designed in 1989, and the prevalance of hardware offloads to bridge
the gap and lower costs between 100mbit and a gige is a real PITA.

I had only had data on x86 (good to 100s or 1000s of mbits) and a few
mips platforms to go on until recently.

> I tried disabling downstream shaping to see what the result was, and it
> wasn't pretty.

Well, I'll argue that only seeing an increase of 20ms or so with the upstream
only, fq_codeled, (vs 120ms not) is not bad and within tolerances of most
applications, even voip. Secondly the characteristics of normal
traffic, as opposed
to the benchmark, make it pretty hard to hit that 100mbit download limit,
so a mere outbound rate limiter will suffice.

As for a comment in that blog, nothing is running pie yet. Some enhancements
to how buffering is configured on the modem may well have finally been put
into play in your default configuration. The best short term option the cablecos
have had has been to increase bandwidth as it's the only easy option they
can tweak, no matter how much it adds to the cost of their cable plant, etc.,
lacking further co-operation with the CMTS vendors and cablemodem makers.

> I also tried using the "simplest.qos" script, and that
> didn't really gain me anything, so I went back to the simple.qos script
> (those results aren't included above).
>
> It looks like it's definitely time for a new router platform (for me).

At the moment the rangeley platforms look like a good bet for future work,
but they are terribly pricey. A plus is that openwrt already supports them.
I got one earlier this week but didn't get the right case and power supply
for it...

Less pricy is the edgerouter pro and some
of the new arm based routers with 1ghz dual cores. In the first case
that involves switching to debian which is a headache, and the htb
implementation appears to be inaccurate at higher rates so far, why
I don't know.

In the second
there are tons of binary blobs and beta code to deal with.

I lean, personally, towards taking a vacation.

>
> Or, we need to find a way to implement the system such that it doesn't max
> out a 680MHz mips core just to push 100Mbps of data.

An option is to explore conventional policing, rather than rate shaping, the
downstream. Policing is much lighter weight, but has really drastic effects
when limits are hit, and you have to tune the burst parameter sanely.

> That's roughly 10K cpu
> cycles per packet, which seems like an awful lot. Unless the other problem
> is that the memory bus just can't keep up. My experience of a lot of these
> processors is that the low-level offload engines have great DMA capabilities
> for "wire-speed" operation, but that the processor core itself can't move
> data to save it's life.

The cpu caches are 32k/32k, the memory interface 16 bit. The rate limiter
(the thing eating all the cycles, not the fq_codel algorithm!) is
single threaded and has global locks,
and is at least partially interrupt bound at 100Mbits/sec.

One thing to look at tuning might be the htb burst parameter to the sqm system.
we tune the quantum presently, but not the burst.

I would very much like to find a better inbound software rate limiting
algorithm and/or
to improve what exists today. In the future, finding something that
could be easily
implemented in hardware would be good.

I have some thoughts
towards adding tbf to bql directly which might be a win, but that is
outbound only.

>
> What's the limit of the EdgeRouter Lite?

It's a little higher but not good. It is a dual core 680mhz mips
platform, but has
a lot of single-threadedness and possible other problems. Work on it continues
over on the relevant ubnt forums, but I haven't had time to profile the kernel
in any case to look hardware at where the limits are coming from. Hope someone
else does with more time and chops.

> Or should I start looking for something like this:
>
> http://www.gateworks.com/product/item/ventana-gw5310-network-processor
>
> (although that's an expensive board, given the very low production volume,
> for the same cost I could probably build a small passively-cooled
> mini/micro-atx setup running x86 and dual NICs).

There is that option as well. I would certainly like to find a low end x86 box
that could rate limit + fq_codel at up to 300Mbits/sec. Toke's x86 boxes
have proven out to do 100Mbit/10Mbit correctly, but I don't remember their
specs, nor has he tried to push them past that, yet.

> -Aaron
>
> _______________________________________________
> Bloat mailing list
> ***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

Jonathan Morton

2014-08-30 11:02:57 UTC

Permalink

On 29 Aug, 2014, at 9:06 pm, Dave Taht wrote:

> The cpu caches are 32k/32k, the memory interface 16 bit. The rate limiter
> (the thing eating all the cycles, not the fq_codel algorithm!) is
> single threaded and has global locks,
> and is at least partially interrupt bound at 100Mbits/sec.

Looking at the code, HTB is considerably more complex than TBF in Linux, and not all of the added complexity is due to being classful (though a lot of it is). It seems that TBF has dire warnings all over it about having limited packet-rate capacity which depends on the value of HZ, while HTB has some sort of solution to that problem.

Meanwhile, FQ has per-flow throttling which looks like it could be torn out and used as a simple replacement for TBF. I should take a closer look and check whether it would just suffer from the same problems, but if it won't, then that could be a potential life-extender for the 3800.

- Jonathan Morton

Toke Høiland-Jørgensen

2014-08-30 13:03:19 UTC

Permalink

Jonathan Morton <***@gmail.com> writes:

> Looking at the code, HTB is considerably more complex than TBF in
> Linux, and not all of the added complexity is due to being classful
> (though a lot of it is). It seems that TBF has dire warnings all over
> it about having limited packet-rate capacity which depends on the
> value of HZ, while HTB has some sort of solution to that problem.

Last I checked, those warnings were out-dated. Everything is in
nanosecond resolution now, including TBF. I've been successfully using
TBF in my experiments at bandwidths up to 100Mbps (on Intel Core2 x86
boxes, that is).

-Toke

Jonathan Morton

2014-08-30 17:33:45 UTC

Permalink

On 30 Aug, 2014, at 4:03 pm, Toke Høiland-Jørgensen wrote:

> Jonathan Morton <***@gmail.com> writes:
>
>> Looking at the code, HTB is considerably more complex than TBF in
>> Linux, and not all of the added complexity is due to being classful
>> (though a lot of it is). It seems that TBF has dire warnings all over
>> it about having limited packet-rate capacity which depends on the
>> value of HZ, while HTB has some sort of solution to that problem.
>
> Last I checked, those warnings were out-dated. Everything is in
> nanosecond resolution now, including TBF. I've been successfully using
> TBF in my experiments at bandwidths up to 100Mbps (on Intel Core2 x86
> boxes, that is).

Closer inspection of the kernel code does trace to the High Resolution Timers, which is good. I wish they'd update the comments to go with that sort of thing.

I've managed to run some tests now, and my old PowerBook G4 can certainly handle either HTB or TBF in the region of 200Mbps, at least for simple tests over a LAN. The ancient Sun GEM chipset (integrated into the PowerBook's northbridge, actually) doesn't seem willing to push more than about 470Mbps outbound, even without filtering - but that might be normal for a decidedly PCI/AGP-era machine. I'll need to investigate more closely to see whether there's a CPU load difference between HTB and TBF in practice.

I have two other machines which are able to talk to each other at ~980Mbps. They're both AMD based, and one of them is a "nettop" style MiniITX system, based around the E-450 APU. The choice of NIC, and more specifically the way it is attached to the system, seems to matter most - these both use an RTL8111-family PCIe chipset.

- Jonathan Morton

Jonathan Morton

2014-08-30 20:47:57 UTC

Permalink

On 30 Aug, 2014, at 8:33 pm, Jonathan Morton wrote:

> I'll need to investigate more closely to see whether there's a CPU load difference between HTB and TBF in practice.

Replying to myself, but...

The surprising result is that TBF seems to consume about TWICE the CPU time, at the same bandwidth, as HTB does. Even FQ in flow-rate-limit mode is roughly on par with HTB on that score. There is clearly something very wrong with how TBF is doing it.

HTB (running with only a default class) and FQ are both able to send and regulate 200Mbps using about an eighth of a 1.5GHz PowerPC, all told. It doesn't even seem to matter which qdisc I slap on top of HTB. TBF needs almost a quarter of the same CPU to do the same thing.

- Jonathan Morton

Dave Taht

2014-08-30 22:30:28 UTC

Permalink

Could I get you to also try HFSC?
On Aug 30, 2014 1:48 PM, "Jonathan Morton" <***@gmail.com> wrote:

>
> On 30 Aug, 2014, at 8:33 pm, Jonathan Morton wrote:
>
> > I'll need to investigate more closely to see whether there's a CPU load
> difference between HTB and TBF in practice.
>
> Replying to myself, but...
>
> The surprising result is that TBF seems to consume about TWICE the CPU
> time, at the same bandwidth, as HTB does. Even FQ in flow-rate-limit mode
> is roughly on par with HTB on that score. There is clearly something very
> wrong with how TBF is doing it.
>
> HTB (running with only a default class) and FQ are both able to send and
> regulate 200Mbps using about an eighth of a 1.5GHz PowerPC, all told. It
> doesn't even seem to matter which qdisc I slap on top of HTB. TBF needs
> almost a quarter of the same CPU to do the same thing.
>
> - Jonathan Morton
>
>

Jonathan Morton

2014-08-31 10:18:46 UTC

Permalink

On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:

> Could I get you to also try HFSC?

Once I got a kernel running that included it, and figured out how to make it do what I wanted...

...it seems to be indistinguishable from HTB and FQ in terms of CPU load.

Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves. Something about TBF causes more overhead - it goes through periods of lower CPU use similar to the other shapers, but then spends periods at considerably higher CPU load, all without changing the overall throughput.

The flip side of this is that TBF might be producing a smoother stream of packets. The receiving computer (which is fast enough to notice such things) reports a substantially larger number of recv() calls are required to take in the data from TBF than from anything else - averaging about 4.4KB rather than 9KB or so. But at these data rates, it probably matters little.

FWIW, apparently Apple's variant of the GEM chipset doesn't support jumbo frames. This does, however, mean that I'm definitely working with an MTU of 1500, similar to what would be sent over the Internet.

These tests were all run using nttpc. I wanted to finally try out RRUL, but the wrappers fail to install via pip on my Gentoo boxes. I'll need to investigate further before I can make pretty graphs like everyone else.

- Jonathan Morton

Toke Høiland-Jørgensen

2014-08-31 10:21:29 UTC

Permalink

Jonathan Morton <***@gmail.com> writes:

> These tests were all run using nttpc. I wanted to finally try out
> RRUL, but the wrappers fail to install via pip on my Gentoo boxes.
> I'll need to investigate further before I can make pretty graphs like
> everyone else.

The pip package of netperf-wrapper is, unfortunately, quite outdated. I
keep meaning to do a new release and upload it; will try to get around
to it real soon now! :)

-Toke

Dave Taht

2014-09-01 17:01:45 UTC

Permalink

On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton <***@gmail.com> wrote:
>
> On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
>
>> Could I get you to also try HFSC?
>
> Once I got a kernel running that included it, and figured out how to make it do what I wanted...
>
> ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.

If you are feeling really inspired, try cbq. :) One thing I sort of
like about cbq is that it (I think)
(unlike htb presently) operates off an estimated size for the next
packet (which isn't dynamic, sadly),
where the others buffer up an extra packet until they can be delivered.

In my quest for absolutely minimal latency I'd love to be rid of that
last extra non-in-the-fq_codel-qdisc packet... either with a "peek"
operation or with a running estimate. I think this would (along with
killing the maxpacket check in codel) allow for a faster system with
less tuning (no tweaks below 2.5mbit in particular) across the entire
operational range of ethernet.

There would also need to be some support for what I call "GRO
slicing", where a large receive is split back into packets if a drop
decision could be made.

It would be cool to be able to program the ethernet hardware itself to
return completion interrupts at a given transmit rate (so you could
program the hardware to be any bandwidth not just 10/100/1000). Some
hardware so far as I know supports this with a "pacing" feature.

This doesn't help on inbound rate limiting, unfortunately, just egress.

> Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves.

You will see it bound by the softirq thread, but, what, exactly,
inside that, is kind of unknown. (I presently lack time to build up
profilable kernels on these low end arches. )

> Something about TBF causes more overhead - it goes through periods of lower CPU use similar to the other shapers, but then spends periods at considerably higher CPU load, all without changing the overall throughput.

> The flip side of this is that TBF might be producing a smoother stream of packets. The receiving computer (which is fast enough to notice such things) reports a substantially larger number of recv() calls are required to take in the data from TBF than from anything else - averaging about 4.4KB rather than 9KB or so. But at these data rates, it probably matters little.

Well, htb has various tuning options (see quantum and burst) that
alter it's behavior along the lines of what you re seeing from tbf.

>
> FWIW, apparently Apple's variant of the GEM chipset doesn't support jumbo frames. This does, however, mean that I'm definitely working with an MTU of 1500, similar to what would be sent over the Internet.
>
> These tests were all run using nttpc. I wanted to finally try out RRUL, but the wrappers fail to install via pip on my Gentoo boxes. I'll need to investigate further before I can make pretty graphs like everyone else.
>
> - Jonathan Morton
>

--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

Jonathan Morton

2014-09-01 18:06:56 UTC

Permalink

On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote:

> On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton <***@gmail.com> wrote:
>>
>> On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
>>
>>> Could I get you to also try HFSC?
>>
>> Once I got a kernel running that included it, and figured out how to make it do what I wanted...
>>
>> ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.
>
> If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think)
> (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly),
> where the others buffer up an extra packet until they can be delivered.

It's also hilariously opaque to configure, which is probably why nobody uses it - the RED problem again - and the top link when I Googled for best practice on it gushes enthusiastically about Linux 2.2! The idea of manually specifying an "average packet size" in particular feels intuitively wrong to me. Still, I might be able to try it later on.

Most class-based shapers are probably more complex to set up for simple needs than they need to be. I have to issue three separate 'tc' invocations for a minimal configuration of each of them, repeating several items of data between them. They scale up reasonably well to complex situations, but such uses are relatively rare.

> In my quest for absolutely minimal latency I'd love to be rid of that
> last extra non-in-the-fq_codel-qdisc packet... either with a "peek"
> operation or with a running estimate.

I suspect that something like fq_codel which included its own shaper (with the knobs set sensibly by default) would gain more traction via ease of use - and might even answer your wish.

> It would be cool to be able to program the ethernet hardware itself to
> return completion interrupts at a given transmit rate (so you could
> program the hardware to be any bandwidth not just 10/100/1000). Some
> hardware so far as I know supports this with a "pacing" feature.

Is there a summary of hardware features like this anywhere? It'd be nice to see what us GEM and RTL proles are missing out on. :-)

>> Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves.
>
> You will see it bound by the softirq thread, but, what, exactly,
> inside that, is kind of unknown. (I presently lack time to build up
> profilable kernels on these low end arches. )

When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook only has to run the server end of netperf), the bandwidth maxed out at about 300Mbps each way, and the softirq was bouncing around 60% CPU. I'm pretty sure most of that is shoving stuff across the PCI bus (even though it's internal to the northbridge), or at least waiting for it to go there. I'm happy to assume that the rest was mostly kernel-userspace interface overhead to the netserver instances.

But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.

It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a "proper" desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?

- Jonathan Morton

Dave Taht

2014-09-01 18:32:18 UTC

Permalink

On Mon, Sep 1, 2014 at 11:06 AM, Jonathan Morton <***@gmail.com> wrote:
>
> On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote:
>
>> On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton <***@gmail.com> wrote:
>>>
>>> On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
>>>
>>>> Could I get you to also try HFSC?
>>>
>>> Once I got a kernel running that included it, and figured out how to make it do what I wanted...
>>>
>>> ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.
>>
>> If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think)
>> (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly),
>> where the others buffer up an extra packet until they can be delivered.
>
> It's also hilariously opaque to configure, which is probably why nobody uses it - the RED problem again - and the top link when I Googled for best practice on it gushes enthusiastically about Linux 2.2! The idea of manually specifying an "average packet size" in particular feels intuitively wrong to me. Still, I might be able to try it later on.

I felt a ewma of egress packet sizes would be a better estimator, yes.

> Most class-based shapers are probably more complex to set up for simple needs than they need to be. I have to issue three separate 'tc' invocations for a minimal configuration of each of them, repeating several items of data between them.
> They scale up reasonably well to complex situations, but such uses are relatively rare.

>> In my quest for absolutely minimal latency I'd love to be rid of that
>> last extra non-in-the-fq_codel-qdisc packet... either with a "peek"
>> operation or with a running estimate.
>
> I suspect that something like fq_codel which included its own shaper (with the knobs set sensibly by default) would gain more traction via ease of use - and might even answer your wish.

I agree that a simpler to use qdisc would be good. I'd like something
that preserves multiple (3-4) service classes (as pfifo_fast and
sch_fq do) using drr, deals with diffserv, and could be invoked with a
command line like:

tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std

I had started at that (basically pouring cerowrt's "simple.qos" code
into C with a simple lookup table for diffserv) many moons ago, but
the contents of the yurtlab and that code was stolen - and I was (and
remain) completely stuck on how to do soft rate limiting saner,
particularly in asymmetric scenarios.

("cake" stood for "Common Applications Kept Enhanced". fq_codel is not
a drop in replacement for pfifo_fast due to the classless nature of
it. sch_fq comes closer, but it's more server oriented. QFQ with 4
weighted bands + fq_codel can be made to do the levels of service
stuff fairly straight forwardly at line rate, but the tc "filter" code
tends to get rather long to handle all the diffserv classes...

So... we keep polishing the sqm system, and I keep tracking progress
in how diffserv classification will be done in the future (in ietf
groups like rmcat and dart), and figuring out how to deal better with
aggregating macs in general is what keeps me awake nights, more than
finishing cake...

We'll get there, eventually.

>> It would be cool to be able to program the ethernet hardware itself to
>> return completion interrupts at a given transmit rate (so you could
>> program the hardware to be any bandwidth not just 10/100/1000). Some
>> hardware so far as I know supports this with a "pacing" feature.
>
> Is there a summary of hardware features like this anywhere? It'd be nice to see what us GEM and RTL proles are missing out on. :-)

I'd like one. There are certain 3rd party firmwares like octeon's
where it seems possible to add more features to the firmware
co-processor, in particular.

>
>>> Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves.
>>
>> You will see it bound by the softirq thread, but, what, exactly,
>> inside that, is kind of unknown. (I presently lack time to build up
>> profilable kernels on these low end arches. )
>
> When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook only has to run the server end of netperf), the bandwidth maxed out at about 300Mbps each way, and the softirq was bouncing around 60% CPU. I'm pretty sure most of that is shoving stuff across the PCI bus (even though it's internal to the northbridge), or at least waiting for it to go there. I'm happy to assume that the rest was mostly kernel-userspace interface overhead to the netserver instances.

perf and the older oprofile are our friends here.

> But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.
>
> It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a "proper" desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?

Both good questions worth further exploration.

> - Jonathan Morton
>

--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

Aaron Wood

2014-09-01 20:25:08 UTC

Permalink

> > But this doesn't really answer the question of why the WNDR has so much
> lower a ceiling with shaping than without. The G4 is powerful enough that
> the overhead of shaping simply disappears next to the overhead of shoving
> data around. Even when I turn up the shaping knob to a value quite close
> to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the
> shapers stick to the requested limit like glue, and even the worst offender
> is within 10%. I estimate that it's using only about 500 clocks per packet
> *unless* it saturates the PCI bus.
> >
> > It's possible, however, that we're not really looking at a CPU
> limitation, but a timer problem. The PowerBook is a "proper" desktop
> computer with hardware to match (modulo its age). If all the shapers now
> depend on the high-resolution timer, how high-resolution is the WNDR's
> timer?
>
> Both good questions worth further exploration.

Doing some napkin math and some spec reading, I think that the memory bus
is a likely factory. The G4 had a fairly impressive memory bus for the day
(64-bit?). The WNDR3800 appears to be used in an x16 configuration (based
on the numbers on the memory parts). It may have *just* enough bw to push
concurrent 3x3 802.11n through the software bridge interface, which
short-circuits a lot of processing (IIRC).

The typical way I've seen a home router being benchmarked for the
"marketing numbers" is to flow tcp data to/from a wifi client to a wired
client. Single socket is used, for a uni-directional stream of data. So
long as they can hit peak rates (peak MCS), it will get marked as good for
"up to 900Mbps!!" or whatever they want to say.

The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB)
the various buffers for fq_codel and htb may stay in L2 on the G4, but
there simply isn't room in the AR7161 for that, which puts further pressure
on the bus.

-Aaron

Jonathan Morton

2014-09-01 21:43:28 UTC

Permalink

On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:

>>> But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.
>>>
>>> It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a "proper" desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?

>> Both good questions worth further exploration.

> Doing some napkin math and some spec reading, I think that the memory bus is a likely factory. The G4 had a fairly impressive memory bus for the day (64-bit?). The WNDR3800 appears to be used in an x16 configuration (based on the numbers on the memory parts). It may have *just* enough bw to push concurrent 3x3 802.11n through the software bridge interface, which short-circuits a lot of processing (IIRC).
>
> The typical way I've seen a home router being benchmarked for the "marketing numbers" is to flow tcp data to/from a wifi client to a wired client. Single socket is used, for a uni-directional stream of data. So long as they can hit peak rates (peak MCS), it will get marked as good for "up to 900Mbps!!" or whatever they want to say.
>
> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, but there simply isn't room in the AR7161 for that, which puts further pressure on the bus.

I don't think that's it.

First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't have the external L3 cache interface, so it only has the 256KB or 512KB internal L2 cache (I forget which). The desktop version (7457A) used external cache. The G4 was considered to be *crippled* by its FSB by the end of its run, since it never adopted high-performance signalling techniques, nor moved the memory controller on-die; it was quoted that the G5 (970) could move data using *single-byte* operations faster than the *peak* throughput of the G4's FSB. The only reason the G5 never made it into a PowerBook was because it wasn't battery-friendly in the slightest.

But that makes little difference to your argument - compared to a cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even if it is already a decade old.

More compelling is that even at 16-bit width, the WNDR's RAM should have more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, which corresponds in total to about half the PCI bus's theoretical capacity. (The GEM reports 66MHz capability, but it shares the bus with an IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit RAM should be able to match PCI if it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.

The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies at least 200MHz unless the integrator was colossally stingy. Further, a little digging suggests that the memory bus should be 32-bit wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed. For an embedded SoC, that's really not too bad - it should be able to sustain 1GB/sec, in one direction at a time.

So that takes care of the argument for simply moving the payload around. In any case, the WNDR demonstrably *can* cope with the available bandwidth if the shaping is turned off.

For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.

And while the MIPS 24K core is old, it's also been die-shrunk over the intervening years, so it runs a lot faster than it originally did. I very much doubt that it's as refined as my G4, but it could probably hold its own relative to a comparable ARM SoC such as the Raspberry Pi. (Unfortunately, the latter doesn't have the I/O capacity to do high-speed networking - USB only.) Atheros publicity materials indicate that they increased the I-cache to 64KB for performance reasons, but saw no need to increase the D-cache at the same time.

Which brings me back to the timers, and other items of black magic.

Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.

- Jonathan Morton

Aaron Wood

2014-09-01 22:14:42 UTC

Permalink

Luckily, I don't mind being wrong (or even _way_ off the mark).

I don't think that's it.
>
> First a nitpick: the PowerBook version of the late-model G4 (7447A)
> doesn't have the external L3 cache interface, so it only has the 256KB or
> 512KB internal L2 cache (I forget which). The desktop version (7457A) used
> external cache. The G4 was considered to be *crippled* by its FSB by the
> end of its run, since it never adopted high-performance signalling
> techniques, nor moved the memory controller on-die; it was quoted that the
> G5 (970) could move data using *single-byte* operations faster than the
> *peak* throughput of the G4's FSB. The only reason the G5 never made it
> into a PowerBook was because it wasn't battery-friendly in the slightest.
>

And the specs on the G4 that I'd dug up were desktop specs.

> But that makes little difference to your argument - compared to a cheap
> CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware,
> even if it is already a decade old.
>
> More compelling is that even at 16-bit width, the WNDR's RAM should have
> more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x
> 32-bit, and I can push a steady 30MB/sec in both directions simultaneously,
> which corresponds in total to about half the PCI bus's theoretical
> capacity. (The GEM reports 66MHz capability, but it shares the bus with an
> IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit
> RAM should be able to match PCI if it runs at 66MHz, which is the lower
> limit of JEDEC standards for SDRAM.
>
> The AR7161 datasheet says it has a DDR-capable SDRAM interface, which
> implies at least 200MHz unless the integrator was colossally stingy.
> Further, a little digging suggests that the memory bus should be 32-bit
> wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz,
> half the CPU core speed. For an embedded SoC, that's really not too bad -
> it should be able to sustain 1GB/sec, in one direction at a time.
>

The kernel boot messages report 170MHz DDR operation, for 340MHz data-rates.

But, I don't think it's 32-bit, I think it's running two banks of 64MB
chips in x16 mode. That's based on my experiences with other, similar
chips. The AR7161 datasheet here:
https://wikidevi.com/files/Atheros/specsheets/AR7161.pdf notes it's DDR1,
but not the bus width.

But even if it had an 8-bit bus, it sounds like it would have the ability
to move packets pretty well, so that's not the case (2Gbps vs. 8Gbps)

> So that takes care of the argument for simply moving the payload around.
> In any case, the WNDR demonstrably *can* cope with the available bandwidth
> if the shaping is turned off.
>
> For the purposes of shaping, the CPU shouldn't need to touch the majority
> of the payload - only the headers, which are relatively small. The bulk of
> the payload should DMA from one NIC to RAM, then DMA back out of RAM to the
> other NIC. It has to do that anyway to route them, and without shaping
> there'd be more of them to handle. The difference might be in the data
> structures used by the shaper itself, but I think those are also reasonably
> compact. It doesn't even have to touch userspace, since it's not acting as
> the endpoint as my PowerBook was during my tests.
>

In an ideal case, yes. But is that how this gets managed? (I have no
idea, I'm certainly not a kernel developer).

If the packet data is getting moved about from buffer to buffer (for
instance to do the htb calculations?) could that substantially change the
processing load?

> And while the MIPS 24K core is old, it's also been die-shrunk over the
> intervening years, so it runs a lot faster than it originally did. I very
> much doubt that it's as refined as my G4, but it could probably hold its
> own relative to a comparable ARM SoC such as the Raspberry Pi.
> (Unfortunately, the latter doesn't have the I/O capacity to do high-speed
> networking - USB only.) Atheros publicity materials indicate that they
> increased the I-cache to 64KB for performance reasons, but saw no need to
> increase the D-cache at the same time.
>

But, they also have a core that's designed to do little to no processing of
the data, just DMA from one side to the other, while validating the
firewall rules... So it may be sufficient d-cache for that, without having
the capacity to do anything else.

> Which brings me back to the timers, and other items of black magic.
>

Which would point to under-utilizing the processor core, while still having
high load? (I'm not seeing that, I'm curious if that would be the case).

>
> Incidentally, transfer speed benchmarks involving wireless will certainly
> be limited by the wireless link. I assume that's not a factor here.
>

That's the usual suspicion. But these are RF-chamber, short-range lab
setups where the radios are running at full speed in perfect environments...

======

What this makes me realize is that I should go instrument the cpu stats
with each of the various operating modes:

* no shaping, anywhere
* egress shaping
* egress and ingress shaping at various limited levels:
* 10Mbps
* 20Mbps
* 50Mbps
* 100Mbps

-Aaron

David Lang

2014-09-02 09:09:59 UTC

Permalink

On Mon, 1 Sep 2014, Aaron Wood wrote:

> What this makes me realize is that I should go instrument the cpu stats
> with each of the various operating modes:
>
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
> * 10Mbps
> * 20Mbps
> * 50Mbps
> * 100Mbps

Please do, my understanding is that it's the ingress shaping that's the biggest
problem.

but getting good numbers would help a lot.

David Lang

Jonathan Morton

2014-09-02 09:27:06 UTC

Permalink

On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote:

>> For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
>
> In an ideal case, yes. But is that how this gets managed? (I have no idea, I'm certainly not a kernel developer).

It would be monumentally stupid to integrate two GigE MACs onto an SoC, and then to call it a "network processor", without adequate DMA support. I don't think Atheros are that stupid.

Here's a more detailed datasheet:
http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf

"Another memory factor is the ability to support multiple I/O operations in parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5 ports that enable simultaneous access to and from five sources: the two gigabit Ethernet ports, the PCI port, the USB 2.0 port and the MIPS processor."

It's a reasonable question, however, whether the driver uses that support properly. Mainline Linux kernel code seems to support the SoC but not the Ethernet; if it were just a minor variant of some other Atheros hardware, I'd have expected to see it integrated into one of the existing drivers. Or maybe it is, and my greps just aren't showing it.

At minimum, however, there are MMIO ranges reported for each MAC during OpenWRT's boot sequence. That's where the ring buffers are. The most the CPU has to do is read each packet from RAM and write it into those buffers, or vice versa for receive - I think that's what my PowerBook has to do. Ideally, a bog-standard DMA engine would take over that simple duty. Either way, that's something that has to happen whether it's shaped or not, so it's unlikely to be our problem.

The same goes for the wireless MACs, incidentally. These are standard ath9k mini-PCI cards, and the drivers *are* in mainline. There shouldn't be any surprises with them.

> If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load?

The qdiscs only deal with packet and socket headers, not the full packet data. Even then, they largely pass pointers around, inserting the headers into linked lists rather than copying them into arrays. I believe a lot of attention has been directed at cache-friendliness in this area, and the MIPS caches are of conventional type.

>> Which brings me back to the timers, and other items of black magic.
>
> Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case).

It probably wouldn't manifest as high system load. Rather, poor timer resolution or latency would show up as excessive delays between packets, during which the CPU is idle. The packet egress times may turn out to be quantised - that would be a smoking gun, if detectable.

>> Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.
>
> That's the usual suspicion. But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments...

Sure. But even turbocharged 'n' gear tops out at 450Mbps signalling, and much less than that is available even theoretically for TCP/IP throughput. My point is that you're probably not running *your* tests over wireless.

> What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes:
>
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
> * 10Mbps
> * 20Mbps
> * 50Mbps
> * 100Mbps

Smaller increments at the high end of the range may prove to be useful. I would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a bottleneck in a peripheral device, such as the PCI bus. The way the kernel classifies that usage may also be revealing.

> Heck, what about running HTB simply from a 1ms timer instead of from a data driven timer?

That might be what's already happening. We have to figure out that before we can work out a solution.

- Jonathan Morton

Joel Wirāmu Pauling

2014-09-02 10:05:21 UTC

Permalink

On a somewhat related note - I've just received my NZ/AU Region
Almond+ which is an arm9 Dual core router based on the Cortina CSC SoC
:

https://www.cortina-systems.com/product/digital-home-processors/16-products/996-cs7542-cs7522

More details :

On 2 September 2014 21:27, Jonathan Morton <***@gmail.com> wrote:
>
> On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote:
>
>>> For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
>>
>> In an ideal case, yes. But is that how this gets managed? (I have no idea, I'm certainly not a kernel developer).
>
> It would be monumentally stupid to integrate two GigE MACs onto an SoC, and then to call it a "network processor", without adequate DMA support. I don't think Atheros are that stupid.
>
> Here's a more detailed datasheet:
> http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf
>
> "Another memory factor is the ability to support multiple I/O operations in parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5 ports that enable simultaneous access to and from five sources: the two gigabit Ethernet ports, the PCI port, the USB 2.0 port and the MIPS processor."
>
> It's a reasonable question, however, whether the driver uses that support properly. Mainline Linux kernel code seems to support the SoC but not the Ethernet; if it were just a minor variant of some other Atheros hardware, I'd have expected to see it integrated into one of the existing drivers. Or maybe it is, and my greps just aren't showing it.
>
> At minimum, however, there are MMIO ranges reported for each MAC during OpenWRT's boot sequence. That's where the ring buffers are. The most the CPU has to do is read each packet from RAM and write it into those buffers, or vice versa for receive - I think that's what my PowerBook has to do. Ideally, a bog-standard DMA engine would take over that simple duty. Either way, that's something that has to happen whether it's shaped or not, so it's unlikely to be our problem.
>
> The same goes for the wireless MACs, incidentally. These are standard ath9k mini-PCI cards, and the drivers *are* in mainline. There shouldn't be any surprises with them.
>
>> If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load?
>
> The qdiscs only deal with packet and socket headers, not the full packet data. Even then, they largely pass pointers around, inserting the headers into linked lists rather than copying them into arrays. I believe a lot of attention has been directed at cache-friendliness in this area, and the MIPS caches are of conventional type.
>
>>> Which brings me back to the timers, and other items of black magic.
>>
>> Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case).
>
> It probably wouldn't manifest as high system load. Rather, poor timer resolution or latency would show up as excessive delays between packets, during which the CPU is idle. The packet egress times may turn out to be quantised - that would be a smoking gun, if detectable.
>
>>> Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.
>>
>> That's the usual suspicion. But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments...
>
> Sure. But even turbocharged 'n' gear tops out at 450Mbps signalling, and much less than that is available even theoretically for TCP/IP throughput. My point is that you're probably not running *your* tests over wireless.
>
>> What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes:
>>
>> * no shaping, anywhere
>> * egress shaping
>> * egress and ingress shaping at various limited levels:
>> * 10Mbps
>> * 20Mbps
>> * 50Mbps
>> * 100Mbps
>
> Smaller increments at the high end of the range may prove to be useful. I would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a bottleneck in a peripheral device, such as the PCI bus. The way the kernel classifies that usage may also be revealing.
>
>> Heck, what about running HTB simply from a 1ms timer instead of from a data driven timer?
>
> That might be what's already happening. We have to figure out that before we can work out a solution.
>
> - Jonathan Morton
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

Joel Wirāmu Pauling

2014-09-02 10:05:42 UTC

Permalink

https://wikidevi.com/wiki/Securifi_Almond%2B

On 2 September 2014 22:05, Joel Wirāmu Pauling <***@aenertia.net> wrote:
> On a somewhat related note - I've just received my NZ/AU Region
> Almond+ which is an arm9 Dual core router based on the Cortina CSC SoC
> :
>
> https://www.cortina-systems.com/product/digital-home-processors/16-products/996-cs7542-cs7522
>
> More details :
>
> On 2 September 2014 21:27, Jonathan Morton <***@gmail.com> wrote:
>>
>> On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote:
>>
>>>> For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
>>>
>>> In an ideal case, yes. But is that how this gets managed? (I have no idea, I'm certainly not a kernel developer).
>>
>> It would be monumentally stupid to integrate two GigE MACs onto an SoC, and then to call it a "network processor", without adequate DMA support. I don't think Atheros are that stupid.
>>
>> Here's a more detailed datasheet:
>> http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf
>>
>> "Another memory factor is the ability to support multiple I/O operations in parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5 ports that enable simultaneous access to and from five sources: the two gigabit Ethernet ports, the PCI port, the USB 2.0 port and the MIPS processor."
>>
>> It's a reasonable question, however, whether the driver uses that support properly. Mainline Linux kernel code seems to support the SoC but not the Ethernet; if it were just a minor variant of some other Atheros hardware, I'd have expected to see it integrated into one of the existing drivers. Or maybe it is, and my greps just aren't showing it.
>>
>> At minimum, however, there are MMIO ranges reported for each MAC during OpenWRT's boot sequence. That's where the ring buffers are. The most the CPU has to do is read each packet from RAM and write it into those buffers, or vice versa for receive - I think that's what my PowerBook has to do. Ideally, a bog-standard DMA engine would take over that simple duty. Either way, that's something that has to happen whether it's shaped or not, so it's unlikely to be our problem.
>>
>> The same goes for the wireless MACs, incidentally. These are standard ath9k mini-PCI cards, and the drivers *are* in mainline. There shouldn't be any surprises with them.
>>
>>> If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load?
>>
>> The qdiscs only deal with packet and socket headers, not the full packet data. Even then, they largely pass pointers around, inserting the headers into linked lists rather than copying them into arrays. I believe a lot of attention has been directed at cache-friendliness in this area, and the MIPS caches are of conventional type.
>>
>>>> Which brings me back to the timers, and other items of black magic.
>>>
>>> Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case).
>>
>> It probably wouldn't manifest as high system load. Rather, poor timer resolution or latency would show up as excessive delays between packets, during which the CPU is idle. The packet egress times may turn out to be quantised - that would be a smoking gun, if detectable.
>>
>>>> Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.
>>>
>>> That's the usual suspicion. But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments...
>>
>> Sure. But even turbocharged 'n' gear tops out at 450Mbps signalling, and much less than that is available even theoretically for TCP/IP throughput. My point is that you're probably not running *your* tests over wireless.
>>
>>> What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes:
>>>
>>> * no shaping, anywhere
>>> * egress shaping
>>> * egress and ingress shaping at various limited levels:
>>> * 10Mbps
>>> * 20Mbps
>>> * 50Mbps
>>> * 100Mbps
>>
>> Smaller increments at the high end of the range may prove to be useful. I would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a bottleneck in a peripheral device, such as the PCI bus. The way the kernel classifies that usage may also be revealing.
>>
>>> Heck, what about running HTB simply from a 1ms timer instead of from a data driven timer?
>>
>> That might be what's already happening. We have to figure out that before we can work out a solution.
>>
>> - Jonathan Morton
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-***@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel

Aaron Wood

2014-09-03 06:15:25 UTC

Permalink

>
> What this makes me realize is that I should go instrument the cpu stats
> with each of the various operating modes:
>
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
> * 10Mbps
> * 20Mbps
> * 50Mbps
> * 100Mbps
>

So I set this up tonight, and have a big pile of data to go through. But
the headline finding is that the WNDR3800 can't do more than 200Mbps
ingress, with shaping turned off. The GbE switch fabric and my setup were
just fine (pushed some very nice numbers through those interfaces when on
the switch), but going through the routing engine (NATing), and 200Mbps is
about all it could do.

I took tcp captures of it shaping past it's limit (configured for 150/12),
with then rrul, tcp_download, tcp_upload tests.

And I took a series of tests walking down from 100/12, 90/12, 80/12, ...
down to 40/12, while capturing /proc/stats and /proc/softirqs once a second
(roughly), so that can be processed to pull out where the load might be
(initial peeking hints that it's all time spent in softirq).

If anyone wants the raw data, let me know, I'll upload it somewhere. The
rrul pcap is large, the rest of it can be e-mailed easily.

-Aaron

David Lang

2014-09-03 06:36:41 UTC

Permalink

On Tue, 2 Sep 2014, Aaron Wood wrote:

>>
>> What this makes me realize is that I should go instrument the cpu stats
>> with each of the various operating modes:
>>
>> * no shaping, anywhere
>> * egress shaping
>> * egress and ingress shaping at various limited levels:
>> * 10Mbps
>> * 20Mbps
>> * 50Mbps
>> * 100Mbps
>>
>
> So I set this up tonight, and have a big pile of data to go through. But
> the headline finding is that the WNDR3800 can't do more than 200Mbps
> ingress, with shaping turned off. The GbE switch fabric and my setup were
> just fine (pushed some very nice numbers through those interfaces when on
> the switch), but going through the routing engine (NATing), and 200Mbps is
> about all it could do.

it's actually probably the connection tracking, not the routing engine or
iptables. I've seen this a lot on high-traffic systems. I saw something earlier
this week about how the connection tracking has a global lock, so it's
effectivly single threaded, but there is work being done to fix this. Now, lock
contention isn't an issue on a single-core box like the 3800, but the rest of
the work is.

If you can find a place to set it up without NAT, (or with 1:1 NAT that doesn't
need connection tracking), you will see much better performance from it.

For the Scale conference, I disable connection tracking and run them as bridges
to a dedicated VLAN per SSID and do the firewalling and NAT upstream from the
APs

David Lang

> I took tcp captures of it shaping past it's limit (configured for 150/12),
> with then rrul, tcp_download, tcp_upload tests.
>
> And I took a series of tests walking down from 100/12, 90/12, 80/12, ...
> down to 40/12, while capturing /proc/stats and /proc/softirqs once a second
> (roughly), so that can be processed to pull out where the load might be
> (initial peeking hints that it's all time spent in softirq).
>
> If anyone wants the raw data, let me know, I'll upload it somewhere. The
> rrul pcap is large, the rest of it can be e-mailed easily.
>
> -Aaron
>

Jonathan Morton

2014-09-03 11:08:02 UTC

Permalink

On 3 Sep, 2014, at 9:15 am, Aaron Wood wrote:

> What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes:
>
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
> * 10Mbps
> * 20Mbps
> * 50Mbps
> * 100Mbps
>
> So I set this up tonight, and have a big pile of data to go through. But the headline finding is that the WNDR3800 can't do more than 200Mbps ingress, with shaping turned off. The GbE switch fabric and my setup were just fine (pushed some very nice numbers through those interfaces when on the switch), but going through the routing engine (NATing), and 200Mbps is about all it could do.
>
> I took tcp captures of it shaping past it's limit (configured for 150/12), with then rrul, tcp_download, tcp_upload tests.
>
> And I took a series of tests walking down from 100/12, 90/12, 80/12, ... down to 40/12, while capturing /proc/stats and /proc/softirqs once a second (roughly), so that can be processed to pull out where the load might be (initial peeking hints that it's all time spent in softirq).
>
> If anyone wants the raw data, let me know, I'll upload it somewhere. The rrul pcap is large, the rest of it can be e-mailed easily.

Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.

Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.

- Jonathan Morton

Aaron Wood

2014-09-03 15:12:30 UTC

Permalink

On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com>
wrote:

> Given that the CPU load is confirmed as high, the pcap probably isn't as
> useful. The rest would be interesting to look at.
>
> Are you able to test with smaller packet sizes? That might help to
> isolate packet-throughput (ie. connection tracking) versus byte-throughput
> problems.
>
> - Jonathan Morton
>

Doing another test setup will take a few days (maybe not until the
weekend). But I can get the data uploaded, and do some preliminary
crunching on it.

-Aaron

Sebastian Moeller

2014-09-03 19:22:58 UTC

Permalink

Hi Aaron,

On Sep 3, 2014, at 17:12 , Aaron Wood <***@gmail.com> wrote:

> On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com> wrote:
> Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.
>
> Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.
>
> - Jonathan Morton
>
> Doing another test setup will take a few days (maybe not until the weekend). But I can get the data uploaded, and do some preliminary crunching on it.

So the current SQM system allows to shape on multiple interfaces, so you could set up the shaper on se00 and test between sw10 and se00 (should work if you reliably get fast enough wifi connection, something like combined shaped bandwidth <= 70% of wifi rate should work). That would avoid the whole firewall and connection tracking logic.
My home wifi environment is quite variable/noisy and not well-suited for this test: with rrul_be I got stuck at around 70Mbps combined bandwidth, with different distributions of the up and down-leg for no-shaping, shaping to 50Mbps10Mbps, and shaping to 100Mbps50Mbps. SIRQ got pretty much pegged at 96-99% during all netperf-wrapper runs, so I assume this to be the bottleneck (the radio was in the > 200mbps range during the test with occasional drops to 150mbps). So my conclusion would: be it really is the shaping that is limited on my wndr3700v2 with cerowrt 3.10.50-1, again if I would be confident about the measurement which I am not (but EOUTOFTIME). That or my rf environment might only allow for roughly 70-80Mbps combined throughput. For what it is worth: test where performed between macbook running macosx 10.9.4 and hp proliant n54l running 64bit openSuse 13.1, kernel 3.11.10-17 (AMD turion with tg3 gbit ethernet adapter (BQL enabled), running fq_codel on eth0), with shaping on the se00 interface.

Best Regards
Sebastian

>
> -Aaron
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

Dave Taht

2014-09-03 19:30:42 UTC

Permalink

On Wed, Sep 3, 2014 at 12:22 PM, Sebastian Moeller <***@gmx.de> wrote:
> Hi Aaron,
>
>
> On Sep 3, 2014, at 17:12 , Aaron Wood <***@gmail.com> wrote:
>
>> On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com> wrote:
>> Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.
>>
>> Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.
>>
>> - Jonathan Morton
>>
>> Doing another test setup will take a few days (maybe not until the weekend). But I can get the data uploaded, and do some preliminary crunching on it.
>
> So the current SQM system allows to shape on multiple interfaces, so you could set up the shaper on se00 and test between sw10 and se00 (should work if you reliably get fast enough wifi connection, something like combined shaped bandwidth <= 70% of wifi rate should work). That would avoid the whole firewall and connection tracking logic.
> My home wifi environment is quite variable/noisy and not well-suited for this test: with rrul_be I got stuck at around 70Mbps combined bandwidth, with different distributions of the up and down-leg for no-shaping, shaping to 50Mbps10Mbps, and shaping to 100Mbps50Mbps. SIRQ got pretty much pegged at 96-99% during all netperf-wrapper runs, so I assume this to be the bottleneck (the radio was in the > 200mbps range during the test with occasional drops to 150mbps). So my conclusion would: be it really is the shaping that is limited on my wndr3700v2 with cerowrt 3.10.50-1, again if I would be confident about the measurement which I am not (but EOUTOFTIME). That or my rf environment might only allow for roughly 70-80Mbps combined throughput. For what it is worth: test where performed between macbook running macosx 10.9.4 and hp proliant n54l running 64bit openSuse 13.1, kernel 3.11.10-17 (AMD turion with tg3 gbit ethernet adapter (BQL enabled), running fq_codel on eth0), with sha
> ping on the se00 interface.

A note on wifi throughput. CeroWrt routes, rather than bridges,
between interfaces. So I would expect for simple benchmarks, openwrt
(which bridges) might show much better wifi<-> ethernet behavior.

We route, rather than bridge wifi, because of 1) it made it easier to
debug it, and 2) the theory that multicast on busier networks messes
up wifi far more than not-bridging slows it down. Have not accumulated
a lot of proof of this, but this
was kind of enlightening:
http://tools.ietf.org/html/draft-desmouceaux-ipv6-mcast-wifi-power-usage-00

I note that my regular benchmarking environment has mostly been 2 or
more routers with nat and firewalling disabled.

Given the trend towards looking at iptables and nat overhead on this
thread, an ipv6 benchmark on this box might be revealing.

> Best Regards
> Sebastian
>
>
>>
>> -Aaron
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-***@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

--
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

Bill Ver Steeg (versteb)

2014-09-03 23:17:09 UTC

Permalink

Speaking of IPv6 performance testing- In a recent FTTH field deployment, the network operator deployed an IPv6-only network and tunneled all subscriber IPv4 traffic over an IPv6 tunnel to the upstream network edge. It then unpacked the IPv4 traffic from the IPv6 tunnel and sent it on its merry way.

Long story short, the 4o6 tunneling code in the residential gateway was not nearly as performant as the IPv6 forwarding code. I actually got better IPv4 throughput running an IPv6 VPN on my end device, then sending my IPv4 traffic through that tunnel - thus avoiding the tunnel code on the gateway. If I recall correctly, the tunnel code capped out at about 20 Mbps and the IPv6 code went up to the 50Mbps SLA rate. I stumbled into this while running some IPTV video tests while running throughput benchmarks on my PC (with apparently pseudo-random results, until we figured out the various tunnels). Took me a while to figure it out. Delay also spiked when the gateway got bogged down......

More capable gateways were deployed in the latter stages of the deployment, and they seemed to keep up with the 50 Mbps SLA rate.

Bill Ver Steeg

-----Original Message-----
From: bloat-***@lists.bufferbloat.net [mailto:bloat-***@lists.bufferbloat.net] On Behalf Of Dave Taht
Sent: Wednesday, September 03, 2014 3:31 PM
To: Sebastian Moeller
Cc: cerowrt-***@lists.bufferbloat.net; bloat
Subject: Re: [Bloat] [Cerowrt-devel] Comcast upped service levels -> WNDR3800 can't cope...

On Wed, Sep 3, 2014 at 12:22 PM, Sebastian Moeller <***@gmx.de> wrote:
> Hi Aaron,
>
>
> On Sep 3, 2014, at 17:12 , Aaron Wood <***@gmail.com> wrote:
>
>> On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com> wrote:
>> Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.
>>
>> Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.
>>
>> - Jonathan Morton
>>
>> Doing another test setup will take a few days (maybe not until the weekend). But I can get the data uploaded, and do some preliminary crunching on it.
>
> So the current SQM system allows to shape on multiple interfaces, so you could set up the shaper on se00 and test between sw10 and se00 (should work if you reliably get fast enough wifi connection, something like combined shaped bandwidth <= 70% of wifi rate should work). That would avoid the whole firewall and connection tracking logic.
> My home wifi environment is quite variable/noisy and not
> well-suited for this test: with rrul_be I got stuck at around 70Mbps combined bandwidth, with different distributions of the up and down-leg for no-shaping, shaping to 50Mbps10Mbps, and shaping to 100Mbps50Mbps. SIRQ got pretty much pegged at 96-99% during all netperf-wrapper runs, so I assume this to be the bottleneck (the radio was in the > 200mbps range during the test with occasional drops to 150mbps). So my conclusion would: be it really is the shaping that is limited on my wndr3700v2 with cerowrt 3.10.50-1, again if I would be confident about the measurement which I am not (but EOUTOFTIME). That or my rf environment might only allow for roughly 70-80Mbps combined throughput. For what it is worth: test where performed between macbook running macosx 10.9.4 and hp proliant n54l running 64bit openSuse 13.1, kernel 3.11.10-17 (AMD turion with tg3 gbit ethernet adapter (BQL enabled), running fq_codel on eth0), with sha ping on the se00 interface.

A note on wifi throughput. CeroWrt routes, rather than bridges, between interfaces. So I would expect for simple benchmarks, openwrt (which bridges) might show much better wifi<-> ethernet behavior.

We route, rather than bridge wifi, because of 1) it made it easier to debug it, and 2) the theory that multicast on busier networks messes up wifi far more than not-bridging slows it down. Have not accumulated a lot of proof of this, but this was kind of enlightening:
http://tools.ietf.org/html/draft-desmouceaux-ipv6-mcast-wifi-power-usage-00

I note that my regular benchmarking environment has mostly been 2 or more routers with nat and firewalling disabled.

Given the trend towards looking at iptables and nat overhead on this thread, an ipv6 benchmark on this box might be revealing.

> Best Regards
> Sebastian
>
>
>>
>> -Aaron
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-***@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

--
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast
_______________________________________________
Bloat mailing list
***@lists.bufferbloat.n

Dave Taht

2014-09-04 00:33:00 UTC

Permalink

On Wed, Sep 3, 2014 at 4:17 PM, Bill Ver Steeg (versteb)
<***@cisco.com> wrote:
> Speaking of IPv6 performance testing- In a recent FTTH field deployment, the network operator deployed an IPv6-only network and tunneled all subscriber IPv4 traffic over an IPv6 tunnel to the upstream network edge. It then unpacked the IPv4 traffic from the IPv6 tunnel and sent it on its merry way.

I tried to deploy ipv4 over ipv6 encapsulation when I was in Nicaragua
6 years back (the alternative was triple nat,
IPv4 addresses were really scarce on the ground), and got beat by the
encapsulation overhead, performance, multiple
bugs and bufferbloat, then. I figure that most of that has improved -
in particular I imagine that their encapsulated traffic
still has a 1500 MTU for the ipv4 traffic?

The original gear I'd had for the experiment could do a 2k MTU, which
was very helpful in making the ipv4 encapsulation
1500 bytes as much of the internet expects, but a later version
couldn't quite get past 1540 bytes without problems.

> Long story short, the 4o6 tunneling code in the residential gateway was not nearly as performant as the IPv6 forwarding code. I actually got better IPv4 throughput running an IPv6 VPN on my end device, then sending my IPv4 traffic through that tunnel - thus avoiding the tunnel code on the gateway. If I recall correctly, the tunnel code capped out at about 20 Mbps and the IPv6 code went up to the 50Mbps SLA rate. I stumbled into this while running some IPTV video tests while running throughput benchmarks on my PC (with apparently pseudo-random results, until we figured out the various tunnels). Took me a while to figure it out. Delay also spiked when the gateway got bogged down......

I can believe it. I have seen many "bumps in the wire" do bad things
when run past their limits. Notable were several
PPPoe and PPPoA boxes. Older cablemodems, and last generation access
points are going to all have similar problems
when hooked up at these higher speeds. In the future, stuff that does
this sort of tunneling or encapsulation, or while
coverting from one media type to another, (say ethernet->cable,
ethernet->gpon, etc) may also run into it when the provider ups their
access speeds from one band to another, as both comcast and verizon
have.

This is of course, both a problem and an opportunity. A problem
because it will generate more support calls, and an opportunity to
sell better gear into the marketplace as ISP speeds are upgraded.

Some enterprising manufacturer could make a point of pitching their
product(s) as actually capable of modern transfer speeds on modern
ISPs, doing benchmarks, etc.

Given the mass delusional product naming in the home ap marketplace,
where nearly every product is named and pitched
over the base capability of the standards used, rather than the sordid
reality, I don't think anything short of a consumer reports, or legal
action, will result in sanity here.

Gigabit "routers", indeed, when only the switch is cable of that!
Nothing I've tried below 100 bucks can forward, well, at a gigabit,
with a number of real-world firewall rules. Even using x86 gear is
kind of problematic thus far.

http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

> More capable gateways were deployed in the latter stages of the deployment, and they seemed to keep up with the 50 Mbps SLA rate.

What was the measured latency under load?

>
> Bill Ver Steeg
>
>
>
>
>
> -----Original Message-----
> From: bloat-***@lists.bufferbloat.net [mailto:bloat-***@lists.bufferbloat.net] On Behalf Of Dave Taht
> Sent: Wednesday, September 03, 2014 3:31 PM
> To: Sebastian Moeller
> Cc: cerowrt-***@lists.bufferbloat.net; bloat
> Subject: Re: [Bloat] [Cerowrt-devel] Comcast upped service levels -> WNDR3800 can't cope...
>
> On Wed, Sep 3, 2014 at 12:22 PM, Sebastian Moeller <***@gmx.de> wrote:
>> Hi Aaron,
>>
>>
>> On Sep 3, 2014, at 17:12 , Aaron Wood <***@gmail.com> wrote:
>>
>>> On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com> wrote:
>>> Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.
>>>
>>> Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.
>>>
>>> - Jonathan Morton
>>>
>>> Doing another test setup will take a few days (maybe not until the weekend). But I can get the data uploaded, and do some preliminary crunching on it.
>>
>> So the current SQM system allows to shape on multiple interfaces, so you could set up the shaper on se00 and test between sw10 and se00 (should work if you reliably get fast enough wifi connection, something like combined shaped bandwidth <= 70% of wifi rate should work). That would avoid the whole firewall and connection tracking logic.
>> My home wifi environment is quite variable/noisy and not
>> well-suited for this test: with rrul_be I got stuck at around 70Mbps combined bandwidth, with different distributions of the up and down-leg for no-shaping, shaping to 50Mbps10Mbps, and shaping to 100Mbps50Mbps. SIRQ got pretty much pegged at 96-99% during all netperf-wrapper runs, so I assume this to be the bottleneck (the radio was in the > 200mbps range during the test with occasional drops to 150mbps). So my conclusion would: be it really is the shaping that is limited on my wndr3700v2 with cerowrt 3.10.50-1, again if I would be confident about the measurement which I am not (but EOUTOFTIME). That or my rf environment might only allow for roughly 70-80Mbps combined throughput. For what it is worth: test where performed between macbook running macosx 10.9.4 and hp proliant n54l running 64bit openSuse 13.1, kernel 3.11.10-17 (AMD turion with tg3 gbit ethernet adapter (BQL enabled), running fq_codel on eth0), with sha ping on the se00 interface.
>
>
> A note on wifi throughput. CeroWrt routes, rather than bridges, between interfaces. So I would expect for simple benchmarks, openwrt (which bridges) might show much better wifi<-> ethernet behavior.
>
> We route, rather than bridge wifi, because of 1) it made it easier to debug it, and 2) the theory that multicast on busier networks messes up wifi far more than not-bridging slows it down. Have not accumulated a lot of proof of this, but this was kind of enlightening:
> http://tools.ietf.org/html/draft-desmouceaux-ipv6-mcast-wifi-power-usage-00
>
> I note that my regular benchmarking environment has mostly been 2 or more routers with nat and firewalling disabled.
>
> Given the trend towards looking at iptables and nat overhead on this thread, an ipv6 benchmark on this box might be revealing.
>
>> Best Regards
>> Sebastian
>>
>>
>>>
>>> -Aaron
>>> _______________________________________________
>>> Cerowrt-devel mailing list
>>> Cerowrt-***@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-***@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
>
>
> --
> Dave Täht
>
> https://www.bufferbloat.net/projects/make-wifi-fast
> _______________________________________________
> Bloat mailing list
> ***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

Jonathan Morton

2014-09-04 03:36:20 UTC

Permalink

On 4 Sep, 2014, at 3:33 am, Dave Taht wrote:

> Gigabit "routers", indeed, when only the switch is cable of that!

I have long thought that advertising regulators need to have a *lot* more teeth. Right now, even when a decision comes down that an advert is blatantly misleading, all they can really do is say "please don't do it again". Here's a reasonably typical example:

http://www.asa.org.uk/Rulings/Adjudications/2014/8/British-Telecommunications-plc/SHP_ADJ_265259.aspx

Many adverts and marketing techniques that I believe are misleading (at best) are never even considered by the regulators, probably because few people outside the technical community even understand that a problem exists, and those that do tend to seriously bungle the solution (not least because they get lobbied by the special interests).

It's bad enough that there's an ISO standard inexplicably defining a megabyte as 1,024,000 bytes, for storage-media purposes. Yes, that's not a typo - it's 2^10 * 10^3. That official standard supposedly justifies all those "1.44MB" floppy disks (with a raw unformatted capacity of 1440KB), and the "terabyte" hard disks that are actually a full 10% smaller than 2^40 bytes. SSDs often use the "slack" between the definitions to implement the necessary error-correction and wear-levelling overhead without changing the marketable number (so 256GB of flash chips installed, 256GB capacity reported to the consumer, but there's a 7% difference between the two).

Honestly though, they can get away with calling them "gigabit routers" because they have "gigabit" external interfaces. They can also point to all the PCI GigE NICs that can only do 750Mbps, because that's where the PCI bus saturates, but nobody prevents *them* from being labelled 1000base-T and therefore "gigabit ethernet".

It's worse in the wireless world because the headline rate is the maximum signalling rate under ideal conditions. The actual throughput under typical home/office/conference conditions bears zero resemblance to that figure for any number of reasons, but even under ideal conditions the actual throughput is a surprisingly small fraction of the signalling rate.

Consumer reports type stuff could be interesting, though. I haven't seen any of the big tech-review sites take on networking seriously, except for basic throughput checks on bare Ethernet (which mostly reveal whether a GigE chipset is attached via PCI or PCIe). It's a complicated subject; Anandtech conceded that accurate tests of the KillerNIC's marketing claims were particularly difficult to arrange, but they did a lot of subjective testing in an attempt to compensate.

One could, in principle, give out a bronze award for equipment which fails to meet (the spirit of) its marketing claims, but is still useful in the real world. A silver award for equipment which *does* meet its marketing claims and generally works as it should. A gold award would be reserved for equipment which both merits a silver award and genuinely stands out in the market. And at the opposite end of the scale, a "rusty pipe" award for truly excrable efforts, similar to LowEndMac's "Road Apple" award. All protected by copyright and trademark laws, which are rather easier to enforce in a legally binding manner than advertising regulations.

Incidentally, for those amused (or frustrated) by embedded hardware design decisions, the "Road Apple" awards list is well worth a read - and potentially eye-opening. Watch out for the PowerPC Mac with dual 16-bit I/O buses.

- Jonathan Morton

Bill Ver Steeg (versteb)

2014-09-04 14:05:04 UTC

Permalink

Dave-

Inline responses with bvs ---

Bill Ver Steeg

-----Original Message-----
From: Dave Taht [mailto:***@gmail.com]
Sent: Wednesday, September 03, 2014 8:33 PM
To: Bill Ver Steeg (versteb)
Cc: Sebastian Moeller; cerowrt-***@lists.bufferbloat.net; bloat
Subject: Re: [Bloat] [Cerowrt-devel] Comcast upped service levels -> WNDR3800 can't cope...

On Wed, Sep 3, 2014 at 4:17 PM, Bill Ver Steeg (versteb) <***@cisco.com> wrote:
> Speaking of IPv6 performance testing- In a recent FTTH field deployment, the network operator deployed an IPv6-only network and tunneled all subscriber IPv4 traffic over an IPv6 tunnel to the upstream network edge. It then unpacked the IPv4 traffic from the IPv6 tunnel and sent it on its merry way.

I tried to deploy ipv4 over ipv6 encapsulation when I was in Nicaragua
6 years back (the alternative was triple nat,
IPv4 addresses were really scarce on the ground), and got beat by the encapsulation overhead, performance, multiple bugs and bufferbloat, then. I figure that most of that has improved - in particular I imagine that their encapsulated traffic still has a 1500 MTU for the ipv4 traffic?

Bvs --- I do not recall the MTU, but it was almost certainly 1500+. The MTU was not an issue in any testing we did.

The original gear I'd had for the experiment could do a 2k MTU, which was very helpful in making the ipv4 encapsulation
1500 bytes as much of the internet expects, but a later version couldn't quite get past 1540 bytes without problems.

> Long story short, the 4o6 tunneling code in the residential gateway was not nearly as performant as the IPv6 forwarding code. I actually got better IPv4 throughput running an IPv6 VPN on my end device, then sending my IPv4 traffic through that tunnel - thus avoiding the tunnel code on the gateway. If I recall correctly, the tunnel code capped out at about 20 Mbps and the IPv6 code went up to the 50Mbps SLA rate. I stumbled into this while running some IPTV video tests while running throughput benchmarks on my PC (with apparently pseudo-random results, until we figured out the various tunnels). Took me a while to figure it out. Delay also spiked when the gateway got bogged down......

I can believe it. I have seen many "bumps in the wire" do bad things when run past their limits. Notable were several PPPoe and PPPoA boxes. Older cablemodems, and last generation access points are going to all have similar problems when hooked up at these higher speeds. In the future, stuff that does this sort of tunneling or encapsulation, or while coverting from one media type to another, (say ethernet->cable,
ethernet->gpon, etc) may also run into it when the provider ups their
access speeds from one band to another, as both comcast and verizon have.

This is of course, both a problem and an opportunity. A problem because it will generate more support calls, and an opportunity to sell better gear into the marketplace as ISP speeds are upgraded.

Bvs --- speaking as an employee of a vendor of such equipment, problems in the field are never an "opportunity". If anything, they are an opportunity for the network operator to buy gear from somebody else 8-(.

Some enterprising manufacturer could make a point of pitching their
product(s) as actually capable of modern transfer speeds on modern ISPs, doing benchmarks, etc.

Given the mass delusional product naming in the home ap marketplace, where nearly every product is named and pitched over the base capability of the standards used, rather than the sordid reality, I don't think anything short of a consumer reports, or legal action, will result in sanity here.

Gigabit "routers", indeed, when only the switch is cable of that!
Nothing I've tried below 100 bucks can forward, well, at a gigabit, with a number of real-world firewall rules. Even using x86 gear is kind of problematic thus far.

http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

> More capable gateways were deployed in the latter stages of the deployment, and they seemed to keep up with the 50 Mbps SLA rate.

What was the measured latency under load?

Bvs --- We saw what one would expect from a tail-drop based system. The delay was basically constant at some 10s of ms until we hit the upstream/downstream SLA rate, then it started to creep up. If I recall correctly, it actually did not creep up much - maybe by 40ms or so..... If I had to guess (and I am guessing because we did not provide this piece of equipment in this deployment - we just supplied the Set Top Box and the core network) I would say that the buffer sizes were set to a fixed size that was based on a range of downstream SLA rates. Because we were running at quite high speeds, the delay associated with a full buffer was fairly modest. Had we been running at a lower SLA rate, I suspect that our delay would have been much higher.

>
> Bill Ver Steeg
>
>
>
>
>
> -----Original Message-----
> From: bloat-***@lists.bufferbloat.net
> [mailto:bloat-***@lists.bufferbloat.net] On Behalf Of Dave Taht
> Sent: Wednesday, September 03, 2014 3:31 PM
> To: Sebastian Moeller
> Cc: cerowrt-***@lists.bufferbloat.net; bloat
> Subject: Re: [Bloat] [Cerowrt-devel] Comcast upped service levels -> WNDR3800 can't cope...
>
> On Wed, Sep 3, 2014 at 12:22 PM, Sebastian Moeller <***@gmx.de> wrote:
>> Hi Aaron,
>>
>>
>> On Sep 3, 2014, at 17:12 , Aaron Wood <***@gmail.com> wrote:
>>
>>> On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com> wrote:
>>> Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.
>>>
>>> Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.
>>>
>>> - Jonathan Morton
>>>
>>> Doing another test setup will take a few days (maybe not until the weekend). But I can get the data uploaded, and do some preliminary crunching on it.
>>
>> So the current SQM system allows to shape on multiple interfaces, so you could set up the shaper on se00 and test between sw10 and se00 (should work if you reliably get fast enough wifi connection, something like combined shaped bandwidth <= 70% of wifi rate should work). That would avoid the whole firewall and connection tracking logic.
>> My home wifi environment is quite variable/noisy and not
>> well-suited for this test: with rrul_be I got stuck at around 70Mbps combined bandwidth, with different distributions of the up and down-leg for no-shaping, shaping to 50Mbps10Mbps, and shaping to 100Mbps50Mbps. SIRQ got pretty much pegged at 96-99% during all netperf-wrapper runs, so I assume this to be the bottleneck (the radio was in the > 200mbps range during the test with occasional drops to 150mbps). So my conclusion would: be it really is the shaping that is limited on my wndr3700v2 with cerowrt 3.10.50-1, again if I would be confident about the measurement which I am not (but EOUTOFTIME). That or my rf environment might only allow for roughly 70-80Mbps combined throughput. For what it is worth: test where performed between macbook running macosx 10.9.4 and hp proliant n54l running 64bit openSuse 13.1, kernel 3.11.10-17 (AMD turion with tg3 gbit ethernet adapter (BQL enabled), running fq_codel on eth0), with sha ping on the se00 interface.
>
>
> A note on wifi throughput. CeroWrt routes, rather than bridges, between interfaces. So I would expect for simple benchmarks, openwrt (which bridges) might show much better wifi<-> ethernet behavior.
>
> We route, rather than bridge wifi, because of 1) it made it easier to debug it, and 2) the theory that multicast on busier networks messes up wifi far more than not-bridging slows it down. Have not accumulated a lot of proof of this, but this was kind of enlightening:
> http://tools.ietf.org/html/draft-desmouceaux-ipv6-mcast-wifi-power-usa
> ge-00
>
> I note that my regular benchmarking environment has mostly been 2 or more routers with nat and firewalling disabled.
>
> Given the trend towards looking at iptables and nat overhead on this thread, an ipv6 benchmark on this box might be revealing.
>
>> Best Regards
>> Sebastian
>>
>>
>>>
>>> -Aaron
>>> _______________________________________________
>>> Cerowrt-devel mailing list
>>> Cerowrt-***@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-***@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
>
>
> --
> Dave Täht
>
> https://www.bufferbloat.net/projects/make-wifi-fast
> _______________________________________________
> Bloat mailing list
> ***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht

https://www.bufferbloa

Michael Richardson

2014-09-04 15:10:06 UTC

Permalink

Bill Ver Steeg (versteb) <***@cisco.com> wrote:
> Long story short, the 4o6 tunneling code in the residential gateway was
> not nearly as performant as the IPv6 forwarding code. I actually got

So, you are saying that the v6 network was faster than the v4 network?
I don't see what the problem is :-)

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] ***@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

Sebastian Moeller

2014-09-04 07:04:18 UTC

Permalink

Hi Dave,

On Sep 3, 2014, at 21:30 , Dave Taht <***@gmail.com> wrote:

> On Wed, Sep 3, 2014 at 12:22 PM, Sebastian Moeller <***@gmx.de> wrote:
>> Hi Aaron,
>>
>>
>> On Sep 3, 2014, at 17:12 , Aaron Wood <***@gmail.com> wrote:
>>
>>> On Wed, Sep 3, 2014 at 4:08 AM, Jonathan Morton <***@gmail.com> wrote:
>>> Given that the CPU load is confirmed as high, the pcap probably isn't as useful. The rest would be interesting to look at.
>>>
>>> Are you able to test with smaller packet sizes? That might help to isolate packet-throughput (ie. connection tracking) versus byte-throughput problems.
>>>
>>> - Jonathan Morton
>>>
>>> Doing another test setup will take a few days (maybe not until the weekend). But I can get the data uploaded, and do some preliminary crunching on it.
>>
>> So the current SQM system allows to shape on multiple interfaces, so you could set up the shaper on se00 and test between sw10 and se00 (should work if you reliably get fast enough wifi connection, something like combined shaped bandwidth <= 70% of wifi rate should work). That would avoid the whole firewall and connection tracking logic.
>> My home wifi environment is quite variable/noisy and not well-suited for this test: with rrul_be I got stuck at around 70Mbps combined bandwidth, with different distributions of the up and down-leg for no-shaping, shaping to 50Mbps10Mbps, and shaping to 100Mbps50Mbps. SIRQ got pretty much pegged at 96-99% during all netperf-wrapper runs, so I assume this to be the bottleneck (the radio was in the > 200mbps range during the test with occasional drops to 150mbps). So my conclusion would: be it really is the shaping that is limited on my wndr3700v2 with cerowrt 3.10.50-1, again if I would be confident about the measurement which I am not (but EOUTOFTIME). That or my rf environment might only allow for roughly 70-80Mbps combined throughput. For what it is worth: test where performed between macbook running macosx 10.9.4 and hp proliant n54l running 64bit openSuse 13.1, kernel 3.11.10-17 (AMD turion with tg3 gbit ethernet adapter (BQL enabled), running fq_codel on eth0), with sha
>> ping on the se00 interface.
>
>
> A note on wifi throughput. CeroWrt routes, rather than bridges,
> between interfaces. So I would expect for simple benchmarks, openwrt
> (which bridges) might show much better wifi<-> ethernet behavior.

Interesting, I just tried to make quick and dirty test with the goal of getting rid of NAT and fire-walling from the test path, so I am very happy that cerowrt routes by default. That way shaping on se00 is quite a good test of the internet routing performance.

>
> We route, rather than bridge wifi, because of 1) it made it easier to
> debug it, and 2) the theory that multicast on busier networks messes
> up wifi far more than not-bridging slows it down.

I am already sold on this idea! I think there should be a good reason to call it a “home router” and not a home bridge ;) (though some of the stock firmwares make me feel someone “had a bridge to sell” ;) )

> Have not accumulated
> a lot of proof of this, but this
> was kind of enlightening:
> http://tools.ietf.org/html/draft-desmouceaux-ipv6-mcast-wifi-power-usage-00
>
> I note that my regular benchmarking environment has mostly been 2 or
> more routers with nat and firewalling disabled.

I would love to recreate that, but my home setup is not really wired to test this (upstream of cerowrt sits only the 100Mbit switch of the ISP’s del modem/-router combination, so no way to plug in a faster receiver there)

>
> Given the trend towards looking at iptables and nat overhead on this
> thread, an ipv6 benchmark on this box might be revealing.

I would love to test this as well, but I have not gotten IPv6 to work reliably at my home.

Best Regards
Sebastian

IPv6 NOTE: Everyone with a real dual-stack IPv6 and IPv4 connection to the internet (so not tunneled over IPv4) and an ATM-based DSL connection (might be the empty set...) needs to use the htb-private method for link layer adjustments, as the td-stab method currently does not take the different header sizes for IPv4 and IPv6 into account (pure IPv6 connections or where IPv4 is tunneled in IPv6 packets should be fine they just need to increase the per packet overhead by 20 bytes over the IPv4 recommendation…).

>
>> Best Regards
>> Sebastian
>>
>>
>>>
>>> -Aaron
>>> _______________________________________________
>>> Cerowrt-devel mailing list
>>> Cerowrt-***@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-***@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
>
>
> --
> Dave Täht
>
> https://www.bufferbloat.net/projects/make-wifi-fast

Jonathan Morton

2014-09-04 11:15:30 UTC

Permalink

On 4 Sep, 2014, at 10:04 am, Sebastian Moeller wrote:

> IPv6 NOTE: Everyone with a real dual-stack IPv6 and IPv4 connection to the internet (so not tunneled over IPv4) and an ATM-based DSL connection (might be the empty set...)

I believe at least A&A (Andrews & Arnold) in the UK have that setup. They are unusual among ADSL ISPs in supporting IPv6 properly - among other things.

- Jonathan Morton

Sebastian Moeller

2014-09-04 11:23:23 UTC

Permalink

Hi Jonathan,

On Sep 4, 2014, at 13:15 , Jonathan Morton <***@gmail.com> wrote:

>
> On 4 Sep, 2014, at 10:04 am, Sebastian Moeller wrote:
>
>> IPv6 NOTE: Everyone with a real dual-stack IPv6 and IPv4 connection to the internet (so not tunneled over IPv4) and an ATM-based DSL connection (might be the empty set...)
>
> I believe at least A&A (Andrews & Arnold) in the UK have that setup. They are unusual among ADSL ISPs in supporting IPv6 properly - among other things.

Oh golly, there I go; I guess I have to resort hoping that the set of A&A and SQM users is empty then ;) (Really the stab code in the kernel needs to be fixed, but I doubt that this is going to count as worth back porting to stable kernels…)

Best Regards
Sebastian

>
> - Jonathan Morton
>

Sebastian Moeller

2014-09-02 08:55:19 UTC

Permalink

Hi Jonathan, hi List,

On Sep 1, 2014, at 23:43 , Jonathan Morton <***@gmail.com> wrote:

>
> On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:
>
>>>> But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.
>>>>
>>>> It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a "proper" desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?
>
>>> Both good questions worth further exploration.
>
>> Doing some napkin math and some spec reading, I think that the memory bus is a likely factory. The G4 had a fairly impressive memory bus for the day (64-bit?). The WNDR3800 appears to be used in an x16 configuration (based on the numbers on the memory parts). It may have *just* enough bw to push concurrent 3x3 802.11n through the software bridge interface, which short-circuits a lot of processing (IIRC).
>>
>> The typical way I've seen a home router being benchmarked for the "marketing numbers" is to flow tcp data to/from a wifi client to a wired client. Single socket is used, for a uni-directional stream of data. So long as they can hit peak rates (peak MCS), it will get marked as good for "up to 900Mbps!!" or whatever they want to say.
>>
>> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, but there simply isn't room in the AR7161 for that, which puts further pressure on the bus.
>
> I don't think that's it.
>
> First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't have the external L3 cache interface, so it only has the 256KB or 512KB internal L2 cache (I forget which). The desktop version (7457A) used external cache. The G4 was considered to be *crippled* by its FSB by the end of its run, since it never adopted high-performance signalling techniques, nor moved the memory controller on-die; it was quoted that the G5 (970) could move data using *single-byte* operations faster than the *peak* throughput of the G4's FSB. The only reason the G5 never made it into a PowerBook was because it wasn't battery-friendly in the slightest.
>
> But that makes little difference to your argument - compared to a cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even if it is already a decade old.
>
> More compelling is that even at 16-bit width, the WNDR's RAM should have more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, which corresponds in total to about half the PCI bus's theoretical capacity. (The GEM reports 66MHz capability, but it shares the bus with an IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit RAM should be able to match PCI if it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.
>
> The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies at least 200MHz unless the integrator was colossally stingy. Further, a little digging suggests that the memory bus should be 32-bit wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed. For an embedded SoC, that's really not too bad - it should be able to sustain 1GB/sec, in one direction at a time.
>
> So that takes care of the argument for simply moving the payload around. In any case, the WNDR demonstrably *can* cope with the available bandwidth if the shaping is turned off.

That makes me wonder, couldn’t some reasonable batching help here? We know we can shape roughly 50Mbps combined with no batching (which corresponds to batching at packet size I guess), so get out the envelop, turn it around and go:
50*1000*1000 / (1500*8) = 4166.66666667 that would be around 4000 packet per second (well with small packets might be more). So what about batching up enough packets to take at least 250µs to transfer and push the whole bunch to the tx queue, and then sleep for 250µs? (Since I have not looked at the code and something like this might already be happening, I guess the idea is figure out the highest sustainable shaping frequency, and just make sure we do not attempt to make the shaping decicisios more often. Sure batching will introduce some latency, but if my silly numbers are roughly right, I would be quite happy to accept 1/4 ms added latency if the wndr3[7|8]00 would still last a bit in the future ;) )) Heck, what about running HTB simply from a 1ms timer instead of from a data driven timer?
Now, since when HTB appeared on the scene it was quite costly for the machines of the time, I wonder whether there are any computation-demand mitigation mechanisms somewhere in the code base…

Best Regards
Sebastian

>
> For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
>
> And while the MIPS 24K core is old, it's also been die-shrunk over the intervening years, so it runs a lot faster than it originally did. I very much doubt that it's as refined as my G4, but it could probably hold its own relative to a comparable ARM SoC such as the Raspberry Pi. (Unfortunately, the latter doesn't have the I/O capacity to do high-speed networking - USB only.) Atheros publicity materials indicate that they increased the I-cache to 64KB for performance reasons, but saw no need to increase the D-cache at the same time.
>
> Which brings me back to the timers, and other items of black magic.
>
> Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.
>
> - Jonathan Morton
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

Jonathan Morton

2014-09-02 13:40:58 UTC

Permalink

On 1 Sep, 2014, at 9:32 pm, Dave Taht wrote:

>>> It would be cool to be able to program the ethernet hardware itself to
>>> return completion interrupts at a given transmit rate (so you could
>>> program the hardware to be any bandwidth not just 10/100/1000). Some
>>> hardware so far as I know supports this with a "pacing" feature.
>>
>> Is there a summary of hardware features like this anywhere? It'd be nice to see what us GEM and RTL proles are missing out on. :-)
>
> I'd like one.

Is there at least a list of drivers (both wired and wireless) which are BQL enabled? If GEM is not in that list, it might explain why the PCI bus gets jammed solid on my PowerBook.

> There are certain 3rd party firmwares like octeon's
> where it seems possible to add more features to the firmware
> co-processor, in particular.

Octeon is basically a powerful, multi-core MIPS64 SoC that happens to have Ethernet hardware attached, and is available in NIC form. These "NICs" look like miniature motherboards in PCIe-card format, complete with mini-SIMM slots. Utter overkill for normal applications; they're meant to do encryption on the fly, and were originally introduced as Ethernet-less coprocessor cards for that purpose. At least they represent a good example of what high-end MIPS is like these days.

The original Bigfoot KillerNIC was along those lines, too, but slightly less overdone. It still managed to cost $250+, and Newegg still lists a price in that general range despite being permanently out of stock. As well as running Linux on the card itself, the drivers apparently replaced large parts of the Windows network stack in the quest for efficiency and low latency. Results varied; Anandtech suggested that the biggest improvements probably came on cheaper PCs, whose owners wouldn't be able to justify such a high-priced NIC - and that was in 2007.

I can't tell what the newer products under the Killer brand (taken over by Qualcomm/Atheros) really are, but they are sufficiently reduced in cost, size and complexity to be integrated into "gamer" PC motherboards and laptops, and they respond to being driven like standard (newish) Atheros hardware. In particular, it's unclear whether they do most of their special sauce in software (so Windows-specific) or firmware.

Comments I hear sometimes seem to imply that *some* Atheros hardware runs internal firmware. Whether that is strictly wireless hardware, or whether it extends into Ethernet, I can't yet tell. Since it's widely deployed, it would theoretically be a good platform for experimentation - but in practice?

> tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std

Or even having the "diffservmap std" part be in the defaults. I try not to spend too much mental effort understanding diffserv - it's widely misunderstood, and most end-user applications ignore it. Supporting the basic eight precedences, and maybe some userspace effort to introduce marking, should be enough.

I like the name, though. :-)

- Jonathan Morton

Sujith Manoharan

2014-09-02 13:49:40 UTC

Permalink

Jonathan Morton wrote:
> I can't tell what the newer products under the Killer brand (taken over by
> Qualcomm/Atheros) really are, but they are sufficiently reduced in cost, size
> and complexity to be integrated into "gamer" PC motherboards and laptops, and
> they respond to being driven like standard (newish) Atheros hardware. In
> particular, it's unclear whether they do most of their special sauce in
> software (so Windows-specific) or firmware.

That is correct. The newer Killer cards can be used as normal WLAN NICs in Linux,
they are detected by ath9k and no special code is needed for them.

Sujith

Dave Taht

2014-09-02 15:37:30 UTC

Permalink

On Sep 2, 2014 6:41 AM, "Jonathan Morton" <***@gmail.com> wrote:
>
>
> On 1 Sep, 2014, at 9:32 pm, Dave Taht wrote:
>
> >>> It would be cool to be able to program the ethernet hardware itself to
> >>> return completion interrupts at a given transmit rate (so you could
> >>> program the hardware to be any bandwidth not just 10/100/1000). Some
> >>> hardware so far as I know supports this with a "pacing" feature.
> >>
> >> Is there a summary of hardware features like this anywhere? It'd be
nice to see what us GEM and RTL proles are missing out on. :-)
> >
> > I'd like one.
>
> Is there at least a list of drivers (both wired and wireless) which are
BQL enabled? If GEM is not in that list, it might explain why the PCI bus
gets jammed solid on my PowerBook.

A fairly current list (and the means to generate a more current one) is at:

https://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel

>
> > There are certain 3rd party firmwares like octeon's
> > where it seems possible to add more features to the firmware
> > co-processor, in particular.
>
> Octeon is basically a powerful, multi-core MIPS64 SoC that happens to
have Ethernet hardware attached, and is available in NIC form. These
"NICs" look like miniature motherboards in PCIe-card format, complete with
mini-SIMM slots. Utter overkill for normal applications; they're meant to
do encryption on the fly, and were originally introduced as Ethernet-less
coprocessor cards for that purpose. At least they represent a good example
of what high-end MIPS is like these days.
>
> The original Bigfoot KillerNIC was along those lines, too, but slightly
less overdone. It still managed to cost $250+, and Newegg still lists a
price in that general range despite being permanently out of stock. As
well as running Linux on the card itself, the drivers apparently replaced
large parts of the Windows network stack in the quest for efficiency and
low latency. Results varied; Anandtech suggested that the biggest
improvements probably came on cheaper PCs, whose owners wouldn't be able to
justify such a high-priced NIC - and that was in 2007.
>
> I can't tell what the newer products under the Killer brand (taken over
by Qualcomm/Atheros) really are, but they are sufficiently reduced in cost,
size and complexity to be integrated into "gamer" PC motherboards and
laptops, and they respond to being driven like standard (newish) Atheros
hardware. In particular, it's unclear whether they do most of their
special sauce in software (so Windows-specific) or firmware.

It is also the chip in the Edgerouter line of products and a few others.

>
> Comments I hear sometimes seem to imply that *some* Atheros hardware runs
internal firmware. Whether that is strictly wireless hardware, or whether
it extends into Ethernet, I can't yet tell. Since it's widely deployed, it
would theoretically be a good platform for experimentation - but in
practice?

The ath10k has a cpu and firmware. The ath9k does not.

>
> > tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std
>
> Or even having the "diffservmap std" part be in the defaults. I try not
to spend too much mental effort understanding diffserv - it's widely
misunderstood, and most end-user applications ignore it. Supporting the
basic eight precedences, and maybe some userspace effort to introduce
marking, should be enough.

The various ietf wgs seem to think AFxx is a useful concept.

>
> I like the name, though. :-)

It is partially a reference to a scene in the 2010 sequel to 2001.

>
> - Jonathan Morton
>

Jonathan Morton

2014-09-02 15:47:30 UTC

Permalink

On 2 Sep, 2014, at 6:37 pm, Dave Taht wrote:

> The ath10k has a cpu and firmware. The ath9k does not.

So what's this then? http://wireless.kernel.org/en/users/Drivers/ar9170.fw

- Jonathan Morton

Sujith Manoharan

2014-09-02 15:57:21 UTC

Permalink

Jonathan Morton wrote:
> So what's this then? http://wireless.kernel.org/en/users/Drivers/ar9170.fw

Those are for USB devices.

ar9170, carl9170 and ath9k_htc are drivers for USB cards which have Atheros
chipsets and all of them require firmware.

ath9k requires no firmware.

Sujith

Jonathan Morton

2014-09-02 17:36:02 UTC

Permalink

On 2 Sep, 2014, at 6:37 pm, Dave Taht wrote:

> > Is there at least a list of drivers (both wired and wireless) which are BQL enabled? If GEM is not in that list, it might explain why the PCI bus gets jammed solid on my PowerBook.
>
> A fairly current list (and the means to generate a more current one) is at:
>
> https://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel

Ah, so GEM doesn't have BQL.

...now it does. :-D

Dave Taht

2014-09-02 17:41:36 UTC

Permalink

unfortunately you may need to initialize things correctly with
netdev_reset_queue in the appropriate initialization or
recovery-from-error bits.

(this is the part that tends to be tricky)

On Tue, Sep 2, 2014 at 10:36 AM, Jonathan Morton <***@gmail.com> wrote:
>
> On 2 Sep, 2014, at 6:37 pm, Dave Taht wrote:
>
>> > Is there at least a list of drivers (both wired and wireless) which are BQL enabled? If GEM is not in that list, it might explain why the PCI bus gets jammed solid on my PowerBook.
>>
>> A fairly current list (and the means to generate a more current one) is at:
>>
>> https://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel
>
> Ah, so GEM doesn't have BQL.
>
> ...now it does. :-D
>
>
>
>
> - Jonathan Morton
>
>

--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

Jonathan Morton

2014-09-02 18:28:27 UTC

Permalink

On 2 Sep, 2014, at 8:41 pm, Dave Taht wrote:

> unfortunately you may need to initialize things correctly with
> netdev_reset_queue in the appropriate initialization or
> recovery-from-error bits.
>
> (this is the part that tends to be tricky)

I poked around a bit and found that gem_clean_rings() seems to be called from everywhere relevant, including from gem_init_rings() and gem_do_stop(). I was therefore able to add a single call there.

I've taken the other suggestion at face value.

> Do you have a before/after test result?

At gigabit link speeds, there seems to be no measurable difference - the machine just isn't capable of filling the buffer fast enough. I have yet to try it at slower link rates.

Jonathan Morton

2014-09-03 11:04:05 UTC

Permalink

On 2 Sep, 2014, at 6:37 pm, Dave Taht wrote:

> > > tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std
> >
> > Or even having the "diffservmap std" part be in the defaults. I try not to spend too much mental effort understanding diffserv - it's widely misunderstood, and most end-user applications ignore it. Supporting the basic eight precedences, and maybe some userspace effort to introduce marking, should be enough.
>
> The various ietf wgs seem to think AFxx is a useful concept.

I'm sure they do. And I'm sure that certain networks make use of it internally. But the Internet does not support such fine distinctions in practice - at least, not at the moment. We have enough difficulty getting SQM of *any* colour deployed where it's needed.

A good default handling of Precedence would already be an improvement over the status quo, and I've worked out a CPU-efficient way of doing so. It takes explicit advantage of the fact that the overall shaping bandwidth is known, but degrades gracefully in case the actual bandwidth temporarily falls below that value. As I suggested previously, it gives weighted priority to higher-precedence packets, but limits their permitted bandwidth to prevent abuse.

As it happens, simply following the Precedence field, and ignoring the low-order bits of the Diffserv codepoint, satisfies the letter of the AF spec. The Class field is encoded as a Precedence value, and the drop-precedence subclasses then have equal drop probability, which the inequality equations permit. The same equations say nothing obvious about how a packet marked *only* with Precedence 1-4 should be treated relative to AF-marked packets in the same Precedence band, which is part of what gives me a headache about the whole idea.

EF is also neatly handled by ignoring the low-order bits, since its encoding has a high Precedence value. So, at the very least, more refined AF/EF handling can be deferred to a "version 2" implementation.

Reading the HTB code also gives me a headache. I have so far been unable to distinguish any fundamental timing differences in its single-class behaviour relative to TBF. The only clues I have so far are:

1) HTB uses a different timer call to schedule a future wakeup than TBF or FQ do.
2) FQ doesn't use a bucket of tokens, and explicitly avoids producing a "burst" of packets, but HTB and TBF both do and explicitly can.
3) TBF is the only one of the three to exhibit unusually high softirq load on egress. But this varies over time, even with constant bandwidth and packet size.

> > I like the name, though. :-)
>
> It is partially a reference to a scene in the 2010 sequel to 2001.

I need to re-watch that.

- Jonathan Morton

Stephen Hemminger

2014-08-30 21:53:15 UTC

Permalink

On Sat, 30 Aug 2014 14:02:57 +0300
Jonathan Morton <***@gmail.com> wrote:

> It seems that TBF has dire warnings all over it about having limited packet-rate capacity which depends on the value of HZ, while HTB has some sort of solution to that problem.

Packet scheduling has not depended on HZ for 3 or more years. It uses what ever clock
source is available and high resolution timers. As happens in legacy code comments
no longer match reality.

Jonathan Morton

2014-08-30 11:14:34 UTC

Permalink

On 29 Aug, 2014, at 9:06 pm, Dave Taht wrote:

> In the future, finding something that could be easily implemented in hardware would be good.

Does "implemented in firmware for a tiny ARM core" count? I imagine that NICs could be made with those, if they aren't already, and it would probably make the lead-time shorter and engineering risk smaller.

- Jonathan Morton

Aaron Wood

2014-08-30 17:19:17 UTC

Permalink

On Fri, Aug 29, 2014 at 11:06 AM, Dave Taht <***@gmail.com> wrote:

> On Fri, Aug 29, 2014 at 9:57 AM, Aaron Wood <***@gmail.com> wrote:
> > Comcast has upped the download rates in my area, from 50Mbps to 100Mbps.
> > This morning I tried to find the limit of the WNDR3800. And I found it.
> > 50Mbps is still well within capabilities, 100Mbps isn't.
> >
> > And as I've seen Dave say previously, it's right around 80Mbps total
> > (download + upload).
> >
> >
> http://burntchrome.blogspot.com/2014/08/new-comcast-speeds-new-cerowrt-sqm.html
>
> Thank you very much, as always, for doing public benchmarking with a good
> setup!
>

No problem, I find this sort of investigation a lot of fun. Even if it is
somewhat maddening at times.

> Yes we hit kind of an unexpected wall on everything shipped with a
> processor
> originally designed in 1989, and the prevalance of hardware offloads to
> bridge
> the gap and lower costs between 100mbit and a gige is a real PITA.
>

Do you think this is a limitation of MIPS as a whole, or just the
particular MIPS cores in use on these platforms?

OTOH, I have noticed that MIPS is losing ground to ARM as bandwidths go up.
The router SoCs that I'm seeing from the usual suspects have been
switching from MIPS to ARM over the last year or two. The WNDR is in the
top-tier for SOHO SoCs, but at a product family is getting long in the
tooth.

> > I tried disabling downstream shaping to see what the result was, and it
> > wasn't pretty.
>
> Well, I'll argue that only seeing an increase of 20ms or so with the
> upstream
> only, fq_codeled, (vs 120ms not) is not bad and within tolerances of most
> applications, even voip. Secondly the characteristics of normal
> traffic, as opposed
> to the benchmark, make it pretty hard to hit that 100mbit download limit,
> so a mere outbound rate limiter will suffice.
>

Well, yes... I have considered just turning it off entirely, as the the
extra latency isn't awful. And frankly, the laptops (individually) never
see that sort of bandwidth, but the AppleTV might when downloading video (I
need to go see what the downloads are capped at by Apple).

The cpu caches are 32k/32k, the memory interface 16 bit. The rate limiter
> (the thing eating all the cycles, not the fq_codel algorithm!) is
> single threaded and has global locks,
> and is at least partially interrupt bound at 100Mbits/sec.
>

This is interesting, and lines up with Sebastian's idea about perhaps using
ethtool to lock the upstream interface to 100Mbps. Except that moves the
bottleneck to the next upstream device... (modem), with it's buffer mgmt,
so maybe that's not a great idea, either. Upstream is certainly where the
biggest issues are.

> > Or should I start looking for something like this:
> >
> > http://www.gateworks.com/product/item/ventana-gw5310-network-processor
> >
> > (although that's an expensive board, given the very low production
> volume,
> > for the same cost I could probably build a small passively-cooled
> > mini/micro-atx setup running x86 and dual NICs).
>
> There is that option as well. I would certainly like to find a low end x86
> box
> that could rate limit + fq_codel at up to 300Mbits/sec. Toke's x86 boxes
> have proven out to do 100Mbit/10Mbit correctly, but I don't remember their
> specs, nor has he tried to push them past that, yet.
>

If I do get my hands on a Ventana board (I may still for work purposes),
I'll certain set it up and see what it does in this scenario, too.

-Aaron

Jonathan Morton

2014-08-30 18:01:46 UTC

Permalink

On 30 Aug, 2014, at 8:19 pm, Aaron Wood wrote:

> Do you think this is a limitation of MIPS as a whole, or just the particular MIPS cores in use on these platforms?

There were historically a great many MIPS designs. Several of the high-end designs were 64-bit and used in famous workstations. The one we see in CPE today, however, is the MIPS equivalent of the AMD Geode, based on an old version of the MIPS architecture, and further crippled by embedded-style hardware choices. It would have been a good CPU in 1989, considering that it would have competed against the 486 in the PC space, but it wouldn't have been hobbled by a 16-bit memory bus back then.

I'm not sure how much effort is going into improving the embeddable versions of MIPS cores, but certainly ARM seems to be a more active participant in the embedded space. Their current range of embeddable cores scales from the Cortex-M0 (whose chief selling point is that it takes only a fraction of a square millimetre of die space) to some quite decent 64-bit multicore CPUs (which AMD is developing a server platform for), with a number of intermediate points along that continuum catered for.

So if a particular core works but proves to have inadequate performance, a better one can be integrated into the next version of the hardware, without any risk of having to rewrite all the software. That future-proofing is probably important to manufacturers, and isn't very obviously available with MIPS cores.

I wouldn't be surprised to see something like a Cortex-A5, or possibly even a multicore Cortex-A7 in CPE. These are capable of running conventional multitasking OSes like Linux (and hence OpenWRT), and have a lot of fully-mature toolchain support. But perhaps they would leave out the FPU, or configure only the most basic type of FPU (VFPv3-D16), to save money compared to the NEON unit you'd normally find in a smartphone.

- Jonathan Morton

Sebastian Moeller

2014-08-30 18:21:04 UTC

Permalink

Hi Aaron

On August 30, 2014 7:19:17 PM CEST, Aaron Wood <***@gmail.com> wrote:
>On Fri, Aug 29, 2014 at 11:06 AM, Dave Taht <***@gmail.com>
>wrote:
>
>> On Fri, Aug 29, 2014 at 9:57 AM, Aaron Wood <***@gmail.com>
>wrote:
>> > Comcast has upped the download rates in my area, from 50Mbps to
>100Mbps.
>> > This morning I tried to find the limit of the WNDR3800. And I
>found it.
>> > 50Mbps is still well within capabilities, 100Mbps isn't.
>> >
>> > And as I've seen Dave say previously, it's right around 80Mbps
>total
>> > (download + upload).
>> >
>> >
>>
>http://burntchrome.blogspot.com/2014/08/new-comcast-speeds-new-cerowrt-sqm.html
>>
>> Thank you very much, as always, for doing public benchmarking with a
>good
>> setup!
>>
>
>No problem, I find this sort of investigation a lot of fun. Even if it
>is
>somewhat maddening at times.
>
>
>> Yes we hit kind of an unexpected wall on everything shipped with a
>> processor
>> originally designed in 1989, and the prevalance of hardware offloads
>to
>> bridge
>> the gap and lower costs between 100mbit and a gige is a real PITA.
>>
>
>Do you think this is a limitation of MIPS as a whole, or just the
>particular MIPS cores in use on these platforms?
>

I think that the last MIPS-based SGI systems could saturate at least Gbit Ethernet, so it does not seem to be generic to MIPS as an architecture...

>OTOH, I have noticed that MIPS is losing ground to ARM as bandwidths go
>up.

I have a gut-feeling that this change is driven by economics and not technical superiority...

> The router SoCs that I'm seeing from the usual suspects have been
>switching from MIPS to ARM over the last year or two. The WNDR is in
>the
>top-tier for SOHO SoCs, but at a product family is getting long in the
>tooth.
>
>
>> > I tried disabling downstream shaping to see what the result was,
>and it
>> > wasn't pretty.
>>
>> Well, I'll argue that only seeing an increase of 20ms or so with the
>> upstream
>> only, fq_codeled, (vs 120ms not) is not bad and within tolerances of
>most
>> applications, even voip. Secondly the characteristics of normal
>> traffic, as opposed
>> to the benchmark, make it pretty hard to hit that 100mbit download
>limit,
>> so a mere outbound rate limiter will suffice.
>>
>
>Well, yes... I have considered just turning it off entirely, as the
>the
>extra latency isn't awful. And frankly, the laptops (individually)
>never
>see that sort of bandwidth, but the AppleTV might when downloading
>video (I
>need to go see what the downloads are capped at by Apple).
>
>
>The cpu caches are 32k/32k, the memory interface 16 bit. The rate
>limiter
>> (the thing eating all the cycles, not the fq_codel algorithm!) is
>> single threaded and has global locks,
>> and is at least partially interrupt bound at 100Mbits/sec.
>>
>
>This is interesting, and lines up with Sebastian's idea about perhaps
>using
>ethtool to lock the upstream interface to 100Mbps. Except that moves
>the
>bottleneck to the next upstream device... (modem), with it's buffer
>mgmt,
>so maybe that's not a great idea, either. Upstream is certainly where
>the
>biggest issues are.

Well, my proposal (actually its an idea Dave floated some time ago) in full is to set the WAN interface to 100mbps and only run sqm on the egress... This should keep both bottlenecks under control. I assume that you get 100 plus some slack from your ISP, so that 100mbps Ethernet is still a bit faster

Best Regards

Sebastian
>
>> > Or should I start looking for something like this:
>> >
>> >
>http://www.gateworks.com/product/item/ventana-gw5310-network-processor
>> >
>> > (although that's an expensive board, given the very low production
>> volume,
>> > for the same cost I could probably build a small passively-cooled
>> > mini/micro-atx setup running x86 and dual NICs).
>>
>> There is that option as well. I would certainly like to find a low
>end x86
>> box
>> that could rate limit + fq_codel at up to 300Mbits/sec. Toke's x86
>boxes
>> have proven out to do 100Mbit/10Mbit correctly, but I don't remember
>their
>> specs, nor has he tried to push them past that, yet.
>>
>
>If I do get my hands on a Ventana board (I may still for work
>purposes),
>I'll certain set it up and see what it does in this scenario, too.
>
>-Aaron
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Cerowrt-devel mailing list
>Cerowrt-***@lists.bufferbloat.net
>https://lists.bufferbloat.net/listinfo/cerowrt-devel

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.