New hardware, old problem: stuck beacon when here is WiFi traffic

Discussion:

New hardware, old problem: stuck beacon when here is WiFi traffic

Lev Serebryakov

2013-04-23 12:41:37 UTC

Hello, Freebsd-wireless.

Now my router/AP is new hardware in new case, and it is not
overheating for sure.
AP's WiFi card is Ubiquity SR-71E MiniPCIe card. Client is Intel
Advanced-N card. When AP sends stable stream of data (like RDesktop
connection from WiFi client to computer on wired network), AP gives
big pauses sometime, with "ath0: stuck beacon; resetting (bmiss count
4)" message. These pauses long enough to RDesktop client notice them
and try to reconnect. But no client dissociation occurs.

Here is card:

ath0: <Atheros 9280> mem 0xd0000000-0xd000ffff irq 18 at device 0.0 on pci3
ath0: [HT] enabling HT modes
ath0: [HT] 1 stream STBC receive enabled
ath0: [HT] 1 stream STBC transmit enabled
ath0: [HT] 2 RX streams; 2 TX streams
ath0: AR9280 mac 128.2 RF5133 phy 13.0
ath0: 2GHz radio: 0x0000; 5GHz radio: 0x00c0

Here is AP settings:

> ifconfig ath0
ath0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 2290
ether 00:15:6d:85:5f:fc
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: IEEE 802.11 Wireless Ethernet autoselect mode 11ng <hostap>
status: running
> ifconfig wlan0
wlan0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 00:15:6d:85:5f:fc
inet 192.168.135.1 netmask 0xffffff00 broadcast 192.168.135.255
inet6 fe80::215:6dff:fe85:5ffc%wlan0 prefixlen 64 scopeid 0x6
inet6 2001:470:923f:2::1 prefixlen 64
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: IEEE 802.11 Wireless Ethernet autoselect mode 11ng <hostap>
status: running
ssid home.serebryakov.spb.ru channel 3 (2422 MHz 11g ht/40+) bssid 00:15:6d:85:5f:fc
regdomain ROW country DE ecm authmode WPA2/802.11i privacy MIXED
deftxkey 3 AES-CCM 2:128-bit AES-CCM 3:128-bit txpower 30 scanvalid 60
protmode CTS ampdulimit 64k ampdudensity 8 shortgi wme burst
dtimperiod 1 -dfs

Here is ath-related dmesg:

ath0: stuck beacon; resetting (bmiss count 4)
ath0: stuck beacon; resetting (bmiss count 4)
ath0: stuck beacon; resetting (bmiss count 4)
ath0: ath_bar_response: huh? bar_tx=0, bar_wait=0
ath0: stuck beacon; resetting (bmiss count 4)

Here is my sources: "10.0-CURRENT FreeBSD 10.0-CURRENT #0 r249767M:
Mon Apr 22 22:22:47 MSK 2013".

I have old router with AR9220 card, which I could use as spectrum
analyzer, if I'm giving proper instructions how to setup in in such
way :)

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-23 14:43:45 UTC

Try dropping it down to a HT20 channel.

adrian

On 23 April 2013 05:41, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Freebsd-wireless.
>
> Now my router/AP is new hardware in new case, and it is not
> overheating for sure.
> AP's WiFi card is Ubiquity SR-71E MiniPCIe card. Client is Intel
> Advanced-N card. When AP sends stable stream of data (like RDesktop
> connection from WiFi client to computer on wired network), AP gives
> big pauses sometime, with "ath0: stuck beacon; resetting (bmiss count
> 4)" message. These pauses long enough to RDesktop client notice them
> and try to reconnect. But no client dissociation occurs.
>
> Here is card:
>
> ath0: <Atheros 9280> mem 0xd0000000-0xd000ffff irq 18 at device 0.0 on pci3
> ath0: [HT] enabling HT modes
> ath0: [HT] 1 stream STBC receive enabled
> ath0: [HT] 1 stream STBC transmit enabled
> ath0: [HT] 2 RX streams; 2 TX streams
> ath0: AR9280 mac 128.2 RF5133 phy 13.0
> ath0: 2GHz radio: 0x0000; 5GHz radio: 0x00c0
>
> Here is AP settings:
>
>> ifconfig ath0
> ath0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 2290
> ether 00:15:6d:85:5f:fc
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> media: IEEE 802.11 Wireless Ethernet autoselect mode 11ng <hostap>
> status: running
>> ifconfig wlan0
> wlan0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> ether 00:15:6d:85:5f:fc
> inet 192.168.135.1 netmask 0xffffff00 broadcast 192.168.135.255
> inet6 fe80::215:6dff:fe85:5ffc%wlan0 prefixlen 64 scopeid 0x6
> inet6 2001:470:923f:2::1 prefixlen 64
> nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
> media: IEEE 802.11 Wireless Ethernet autoselect mode 11ng <hostap>
> status: running
> ssid home.serebryakov.spb.ru channel 3 (2422 MHz 11g ht/40+) bssid 00:15:6d:85:5f:fc
> regdomain ROW country DE ecm authmode WPA2/802.11i privacy MIXED
> deftxkey 3 AES-CCM 2:128-bit AES-CCM 3:128-bit txpower 30 scanvalid 60
> protmode CTS ampdulimit 64k ampdudensity 8 shortgi wme burst
> dtimperiod 1 -dfs
>
> Here is ath-related dmesg:
>
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: ath_bar_response: huh? bar_tx=0, bar_wait=0
> ath0: stuck beacon; resetting (bmiss count 4)
>
>
> Here is my sources: "10.0-CURRENT FreeBSD 10.0-CURRENT #0 r249767M:
> Mon Apr 22 22:22:47 MSK 2013".
>
> I have old router with AR9220 card, which I could use as spectrum
> analyzer, if I'm giving proper instructions how to setup in in such
> way :)
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-wireless
> To unsubscribe, send any mail to "freebsd-wireless-***@freebsd.org"

Lev Serebryakov

2013-04-23 14:45:28 UTC

Hello, Adrian.
You wrote 23 апреля 2013 г., 18:43:45:

AC> Try dropping it down to a HT20 channel.
AP or client or both? Is it enough to disable HT40 channels on
client?

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Petko Bordjukov

2013-04-23 19:57:15 UTC

Falling back to a ht:20- channel did considerably lower the number of stuck
beacon messages on my AR9220. It still periodically hangs though (is this
why ``baseband hangs?'' is listed in the wiki?).

On Tue, Apr 23, 2013 at 5:45 PM, Lev Serebryakov <***@freebsd.org> wrote:

> Hello, Adrian.
> You wrote 23 апреля 2013 г., 18:43:45:
>
> AC> Try dropping it down to a HT20 channel.
> AP or client or both? Is it enough to disable HT40 channels on
> client?
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-wireless
> To unsubscribe, send any mail to "freebsd-wireless-***@freebsd.org
> "
>

Adrian Chadd

2013-04-23 20:01:55 UTC

On 23 April 2013 12:57, Petko Bordjukov <***@gmail.com> wrote:
> Falling back to a ht:20- channel did considerably lower the number of stuck
> beacon messages on my AR9220. It still periodically hangs though (is this
> why ``baseband hangs?'' is listed in the wiki?).

Well, partially.

I'm seeing stuck beacons in my testing with my modified AP power save
code running, so it's also possible that there's something odd going
on with the hardware programming that's locking things up. Luckily I
can easy reproduce it, so I'm going to dig into it a bit more and see
what I find.

Adrian

Adrian Chadd

2013-04-25 20:27:03 UTC

Hiya,

On 25 April 2013 13:14, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 24 апреля 2013 г., 0:55:52:
>
> AC> Turn on reset debugging - sysctl dev.ath.0.debug=0x20
> Log attached. Debug was set to BAR + reset flags, UDP stream from AP
> to client, at some moment AP stops to work and I've bring interface
> down.

Thanks.

Yeah, the reset path isn't logging the contents of the TX queue. I'll
go fix that in -HEAD and email you back when it's there.

Thanks,

adrian

Adrian Chadd

2013-04-26 00:11:44 UTC

Ok. Next time it happens, force a stuck beacon:

sysctl dev.ath.X.forcebstuck=1

That will cause the right kind of reset in order to log the frames in
the TX queue.

Make sure you have the right debug mask though!

Adrian

Lev Serebryakov

2013-04-26 08:14:51 UTC

Hello, Adrian.
You wrote 26 апреля 2013 г., 4:11:44:

AC> Ok. Next time it happens, force a stuck beacon:
AC> sysctl dev.ath.X.forcebstuck=1
AC> That will cause the right kind of reset in order to log the frames in
AC> the TX queue.
AC> Make sure you have the right debug mask though!
Without reboot (iface dow/up) it showed very low speed (~30Mbit UDP)
with clear environment (RSSI 28, RATE 270M according to list sta).
After router reboot it is 140-160Mbit/s UDP (2x2 MIMO, HT40 channel),
but with drops to 20-50M for 1-2 seconds.

But I cannot repeat previous hang in this "clean" conditions :)

It looks like I need wait night :)

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-26 08:45:41 UTC

Hi,

Wait. sysctl dev.ath.0.forcebstuck=1 didn't fix it?

adrian

On 26 April 2013 01:25, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 26 апреля 2013 г., 4:11:44:
>
> AC> Ok. Next time it happens, force a stuck beacon:
> GOT IT!
>
> You could use "sudo" messages as markers in log file.
> It was:
>
> sysctl dev.ath.0.debug=0x900000020
> <lot out outgoing UDP traffic, about 12 minutes of iperf>
> <hang, client disassotiation>
> sysctl dev.ath.0.forsebstuck=1
> ifconfig wlan0 down
> <lot of debug output>
> ifconfig wlan0 up
> /etc/rc.d/hostapd restart
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-26 08:51:17 UTC

On 26 April 2013 01:45, Adrian Chadd <***@freebsd.org> wrote:
> Hi,
>
> Wait. sysctl dev.ath.0.forcebstuck=1 didn't fix it?

So the descriptors are mostly completed. I really need to hack the
reset path to continue printing everything in the queue, not just the
frames that were completed.

But the TXQ[1] head pointer is pointing to the wrong descriptor. It's
pointing to a descriptor that's finished, whose next descriptor is
also finished. That makes no sense.

I'll have to add some more debugging to -HEAD so to dump out the full
set of TX descriptors in the queue. That hopefully will shed some
light here.

Adrian

Lev Serebryakov

2013-04-26 21:28:51 UTC

Hello, Adrian.
You wrote 26 àïðåëÿ 2013 ã., 12:51:17:

>> Wait. sysctl dev.ath.0.forcebstuck=1 didn't fix it?
Ok, several such commands helps without down/up of interface.
Log attached: a lot of traffic, several BAR retransmits, interface
hangs, several "forcebstuck=1", client could associate again and
interface passes traffic.

Maybe, there were too many "forcebstuck=1", but one ro two doesn't
fix situation for sure, client was not able to assotate after first
two of them.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-26 21:34:24 UTC

You're not doing something like leaving bgscan enabled, or not
defining a channel up front?

There's a lot of resets hre.

adrian

On 26 April 2013 14:28, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 26 апреля 2013 г., 12:51:17:
>
>>> Wait. sysctl dev.ath.0.forcebstuck=1 didn't fix it?
> Ok, several such commands helps without down/up of interface.
> Log attached: a lot of traffic, several BAR retransmits, interface
> hangs, several "forcebstuck=1", client could associate again and
> interface passes traffic.
>
> Maybe, there were too many "forcebstuck=1", but one ro two doesn't
> fix situation for sure, client was not able to assotate after first
> two of them.
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-26 21:52:06 UTC

Hi,

I've just committed something to -HEAD that logs -all- the descriptors
in the TX queue during a reset.

This requires ATH_DEBUG_RESET to be set (0x20), so use the same debug
mask as before.

Thanks!

Adrian

Lev Serebryakov

2013-04-27 08:33:52 UTC

Hello, Adrian.
You wrote 27 апреля 2013 г., 1:34:24:

AC> You're not doing something like leaving bgscan enabled, or not
AC> defining a channel up front?
Here is my config:

ifconfig_wlan0="inet 192.168.135.1 netmask 255.255.255.0 mode 11ng channel 3:ht/40 -bgscan ssid home.serebryakov.spb.ru country DE regdomain row txpower 30"

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-27 18:56:30 UTC

On 27 April 2013 01:33, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 27 апреля 2013 г., 1:34:24:
>
> AC> You're not doing something like leaving bgscan enabled, or not
> AC> defining a channel up front?
> Here is my config:
>
> ifconfig_wlan0="inet 192.168.135.1 netmask 255.255.255.0 mode 11ng channel 3:ht/40 -bgscan ssid home.serebryakov.spb.ru country DE regdomain row txpower 30"

Is it a bridge? If so, put an IP address on the bridge, not wlan0.

By putting inet <x> there early on, it's bringing the interface up
very early. It's a side effect of hte network stack - if you assign an
IP address, the VAP comes up immediately.

thus i do the inet line after _all_ other things are done, and on a
separate line.

Anyway, leave it how it is for now. Just update to the latest -HEAD
ath code and redo the testing. It will now log the whole TX queue
contents, not just the completed frames. Hopefully it'll be more
obvious.

Thanks,

adrian

Lev Serebryakov

2013-04-28 08:11:22 UTC

Hello, Adrian.
You wrote 27 апреля 2013 г., 22:56:30:

>> AC> You're not doing something like leaving bgscan enabled, or not
>> AC> defining a channel up front?
>> Here is my config:
>
>> ifconfig_wlan0="inet 192.168.135.1 netmask 255.255.255.0 mode 11ng channel 3:ht/40 -bgscan ssid home.serebryakov.spb.ru country DE regdomain row txpower 30"
AC> Is it a bridge? If so, put an IP address on the bridge, not wlan0.
I don't have bridge, it is true router, to have strong separation
between WiFi and wired network.

AC> thus i do the inet line after _all_ other things are done, and on a
AC> separate line.
Two lines for one interface? Or just move "inet" part to end of this
line?

AC> Anyway, leave it how it is for now. Just update to the latest -HEAD
AC> ath code and redo the testing. It will now log the whole TX queue
AC> contents, not just the completed frames. Hopefully it'll be more
AC> obvious.
Ok, today later.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-04-28 19:32:05 UTC

Hello, Adrian.
You wrote 28 апреля 2013 г., 23:15:23:

AC> Yeah, can you see why 11n wasn't negotiated and repeat the test?
Ok, rebooted both AP and client.
Client/Windows says (via connection properties) it is "240Mbit/s",
AP/FreeBSD says (via ifconfig list sta):

ADDR AID CHAN RATE RSSI IDLE TXSEQ RXSEQ CAPS FLAG
c4:85:08:3f:9e:c2 1 3 270M 22.0 0 14972 4880 EPS AQEHTRM SSID<home.serebryakov.spb.ru> RATES<B2,B4,B11,B22,12,18,24,36> XRATES<48,72,96,108> RSN<v1 mc:AES-CCMP uc:AES-CCMP km:8021X-PSK, caps 0x3c> HTCAP<cap 0x976 param 0x17 mcsset[0-15,32] extcap 0x0 txbf 0x0 antenna 0x0> WME<version 0x1 info 0x0>

But throughput is no more than 30Mbit again... It is very strange...

AC> I'll go over what you gave me but what you did give me has 500 frames
AC> in the TX queue. That's a lot of frames to go over. :)
When I stop "sending" iperf "receiving" part shows some traffic for
about 5 seconds after it :)

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-04-28 18:50:06 UTC

Hello, Adrian.
You wrote 27 Ð°Ð¿ÑÐµÐ»Ñ 2013 Ð³., 22:56:30:

AC> Anyway, leave it how it is for now. Just update to the latest -HEAD
AC> ath code and redo the testing. It will now log the whole TX queue
AC> contents, not just the completed frames. Hopefully it'll be more
AC> obvious.
Ok, same test: UDP stream from AP to client (with iperf), limited to
300M. Only 11-30M seen by client now, but no beacon stucks.
After AP hangs several forced beacon stucks + hostaptd restart
allow client to re-associate.

Log with dev.ath.0.debug=0x900000020 attached ("sudo" messages are left as
"time marks").

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-28 19:38:20 UTC

On 28 April 2013 11:50, Lev Serebryakov <***@freebsd.org> wrote:

> 300M. Only 11-30M seen by client now, but no beacon stucks.
> After AP hangs several forced beacon stucks + hostaptd restart
> allow client to re-associate.

Right. That's likely because 11n aggregation wasn't enabled. I don't know why.

> Log with dev.ath.0.debug=0x900000020 attached ("sudo" messages are left as
> "time marks").

Cool, thanks.

It's odd. The descriptor list(s) look fine - ie, the descriptors
aren't out of order or anything. (Ie, the descriptor and link pointers
line up with the order that the ath_buf's are in the TXQ.)

And the descriptors in your particular setup aren't chained into
multi-descriptor lists, they're one descriptor per frame. There used
to be bugs with my 11n TX code and which ath_buf to check the status
of in a list of ath_bufs, but this last test is with no aggregation.
So that's not a problem.

The next step is seeing whether the hardware queue is actually just
plain stuck, or whether it's idle. If it's idle, it may just be
waiting for another call to ath_hal_txstart() or whatever the call is
to re-trigger it. The only reason it would be idle like that though is
if it hit the end of the descriptor list at about the same time we
appended a frame to the list. However, every time we append a frame to
the hardware queue, we also call ath_hal_txstart() to re-kick TX.

There's some race condition hack that Sam threw in that gets enabled
only if you compile things with TDMA support enabled. Would you mind
compiling in TDMA support (add options IEEE80211_SUPPORT_TDMA) to your
kernel config and rebuild? I'd like to see if that TX queue workaround
is effective at helping us out here.

Does this happen with tcp iperf? or only udp iperf?

adrian

Lev Serebryakov

2013-04-28 19:41:33 UTC

Hello, Adrian.
You wrote 28 апреля 2013 г., 23:38:20:

AC> There's some race condition hack that Sam threw in that gets enabled
AC> only if you compile things with TDMA support enabled. Would you mind
AC> compiling in TDMA support (add options IEEE80211_SUPPORT_TDMA) to your
AC> kernel config and rebuild? I'd like to see if that TX queue workaround
AC> is effective at helping us out here.
I'm building it right now. It is good, that only kernel rebuild is
fast.

AC> Does this happen with tcp iperf? or only udp iperf?
First time it was with tcp (some time ago). I could check this too.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-28 19:49:38 UTC

.. how's this never triggered a clang warning on me before?

Just add casts for now or something so it compiles.

Adrian

On 28 April 2013 12:48, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 28 апреля 2013 г., 23:38:20:
>
> AC> There's some race condition hack that Sam threw in that gets enabled
> AC> only if you compile things with TDMA support enabled. Would you mind
> AC> compiling in TDMA support (add options IEEE80211_SUPPORT_TDMA) to your
> AC> kernel config and rebuild? I'd like to see if that TX queue workaround
> AC> is effective at helping us out here.
> Oooopps... And, yes, it is really annoying, that there is no good way
> to specify (u_)?int\d\d?_t in printf() :(
>
> cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -Wmissing-include-dirs -fdiagnostics-show-option -Wno-error-tautological-compare -Wno-error-empty-body -Wno-error-parentheses-equality -nostdinc -I. -I/data/src/sys -I/data/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-omit-frame-pointer -mno-aes -mno-avx -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -fstack-protector -Werror /data/src/sys/dev/ath/if_ath_tdma.c -I/data/src/sys/dev/ath
> /data/src/sys/dev/ath/if_ath_tdma.c:442:42: error: format specifies type 'unsigned long long' but the argument has type 'u_int64_t' (aka 'unsigned long') [-Werror,-Wformat]
> (unsigned long long) rs->rs_tstamp, rstamp, tsf, txtime,
> ^~~~~~
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> /data/src/sys/dev/ath/if_ath_tdma.c:442:50: error: format specifies type 'unsigned long long' but the argument has type 'u_int64_t' (aka 'unsigned long') [-Werror,-Wformat]
> (unsigned long long) rs->rs_tstamp, rstamp, tsf, txtime,
> ^~~
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> /data/src/sys/dev/ath/if_ath_tdma.c:443:6: error: format specifies type 'unsigned long long' but the argument has type 'u_int64_t' (aka 'unsigned long') [-Werror,-Wformat]
> nextslot, nextslottu, TSF_TO_TU(nextslot >> 32, nextslot));
> ^~~~~~~~
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> /data/src/sys/dev/ath/if_ath_tdma.c:446:6: error: format specifies type 'unsigned long long' but the argument has type 'uint64_t' (aka 'unsigned long') [-Werror,-Wformat]
> le64toh(ni->ni_tstamp.tsf),
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> /data/src/sys/sys/endian.h:80:20: note: expanded from macro 'le64toh'
> #define le64toh(x) ((uint64_t)(x))
> ^
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> /data/src/sys/dev/ath/if_ath_tdma.c:447:6: error: format specifies type 'unsigned long long' but the argument has type 'uint64_t' (aka 'unsigned long') [-Werror,-Wformat]
> le64toh(ni->ni_tstamp.tsf));
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> /data/src/sys/sys/endian.h:80:20: note: expanded from macro 'le64toh'
> #define le64toh(x) ((uint64_t)(x))
> ^
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> /data/src/sys/dev/ath/if_ath_tdma.c:451:6: error: format specifies type 'unsigned long long' but the argument has type 'u_int64_t' (aka 'unsigned long') [-Werror,-Wformat]
> nexttbtt,
> ^~~~~~~~
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> /data/src/sys/dev/ath/if_ath_tdma.c:545:7: error: format specifies type 'unsigned long long' but the argument has type 'uint64_t' (aka 'unsigned long') [-Werror,-Wformat]
> tsf,
> ^~~
> /data/src/sys/dev/ath/if_ath_debug.h:97:34: note: expanded from macro 'DPRINTF'
> device_printf(sc->sc_dev, fmt, __VA_ARGS__); \
> ^
> 7 errors generated.
> *** [if_ath_tdma.o] Error code 1
> 1 error
> *** [buildkernel] Error code 2
> 1 error
> *** [buildkernel] Error code 2
> 1 error
>
>
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>
>

Lev Serebryakov

2013-04-28 20:14:47 UTC

Hello, Adrian.
You wrote 28 апреля 2013 г., 23:49:38:

AC> .. how's this never triggered a clang warning on me before?
AC> Just add casts for now or something so it compiles.
Testing... One more datapoint: with this option bsnmpd with wlan
plugin complains:

Apr 29 00:09:46 gateway snmpd[1712]: iface wlan0 - get param: ioctl(201) failed: Operation not supported
Apr 29 00:09:46 gateway snmpd[1712]: iface wlan0 - get param: ioctl(202) failed: Operation not supported
Apr 29 00:09:46 gateway snmpd[1712]: iface wlan0 - get param: ioctl(203) failed: Operation not supported
Apr 29 00:09:46 gateway snmpd[1712]: iface wlan0 - get param: ioctl(204) failed: Operation not supported

And bandwidth still low...

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-04-28 20:43:52 UTC

Hello, Adrian.
You wrote 28 àïðåëÿ 2013 ã., 23:38:20:

AC> There's some race condition hack that Sam threw in that gets enabled
AC> only if you compile things with TDMA support enabled. Would you mind
AC> compiling in TDMA support (add options IEEE80211_SUPPORT_TDMA) to your
AC> kernel config and rebuild? I'd like to see if that TX queue workaround
AC> is effective at helping us out here.
Two logs for UDP, first time I forgot to stop sender, and second
don't, but it doesnt' help :)

AC> Does this happen with tcp iperf? or only udp iperf?
Same with UDP. Cannot repeat with TCP.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-29 07:29:32 UTC

Please update to -HEAD first!

And erm, run iperf tcp first, before udp. That way aggregation is
negotiated. :-)

adrian

Adrian Chadd

2013-04-29 07:30:00 UTC

.. well, please update to -HEAD; I've just added some more debugging
to the reset path to see what the state of the hardware queue is.

Adrian

Lev Serebryakov

2013-04-29 07:36:23 UTC

Hello, Adrian.
You wrote 29 апреля 2013 г., 11:29:32:

AC> Please update to -HEAD first!
Ok :) These test were on r250017.

AC> And erm, run iperf tcp first, before udp. That way aggregation is
AC> negotiated. :-)
Ok. But on previous revisions it was not needed. But previous tests
were done no at Sunday night, and at Sunday night environment is
worst here.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-29 07:49:03 UTC

There's likely some bugs with the AMPDU aggregation negotiating
kicking in when all the TX buffers are full. I thought that was fixed,
alas.

I'll investigate that later. I have to acquire some programmable
attenuators to do this.

adrian

Adrian Chadd

2013-04-30 04:53:00 UTC

Any follow-up with the latest stuff in -HEAD? I'd like to see if the
hardware queue is behaving how I think it is.

Adrian

On 29 April 2013 00:49, Adrian Chadd <***@freebsd.org> wrote:
> There's likely some bugs with the AMPDU aggregation negotiating
> kicking in when all the TX buffers are full. I thought that was fixed,
> alas.
>
> I'll investigate that later. I have to acquire some programmable
> attenuators to do this.
>
>
>
> adrian

Lev Serebryakov

2013-04-30 08:24:59 UTC

Hello, Adrian.
You wrote 30 àïðåëÿ 2013 ã., 8:53:00:

AC> Any follow-up with the latest stuff in -HEAD? I'd like to see if the
AC> hardware queue is behaving how I think it is.
Sorry, was busy yesterday.
Sources: r250041, TDMA is enabled in kernel config.

Warm-up: 120 seconds of TCP, throughput osculate between 50 and
100Mbit/, several BAR resets, no hangs.

Test: 600 seconds of UDP. Throughput is 20Mbit/s. Hang on ~327s.
traffic generator stopped, single "forced bstuck" helps.

Log attached.

Please, note: low throughput for UDP on these revisions. It was not
so on older revisions, before all this additional debug stuff was
added, latest good I tried was something like r249333, and I don't
remember about r249767 (but I could try, firmware image is not
removed, and I don't have images between r249333 and r249333).

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-30 15:41:12 UTC

On 30 April 2013 01:24, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 30 апреля 2013 г., 8:53:00:
>
> AC> Any follow-up with the latest stuff in -HEAD? I'd like to see if the
> AC> hardware queue is behaving how I think it is.
> Sorry, was busy yesterday.
> Sources: r250041, TDMA is enabled in kernel config.

Ok.

> Warm-up: 120 seconds of TCP, throughput osculate between 50 and
> 100Mbit/, several BAR resets, no hangs.

Ok. But did it negotiate A-MPDU?

> Test: 600 seconds of UDP. Throughput is 20Mbit/s. Hang on ~327s.
> traffic generator stopped, single "forced bstuck" helps.
>
> Log attached.
>
>
> Please, note: low throughput for UDP on these revisions. It was not
> so on older revisions, before all this additional debug stuff was
> added, latest good I tried was something like r249333, and I don't
> remember about r249767 (but I could try, firmware image is not
> removed, and I don't have images between r249333 and r249333).

It's fine. We'll worry about aggregation later.

So here we go:

Apr 30 12:21:26 gateway kernel: ath0: ath_stoptxdma: tx queue [9]
0xd8c3000, link 0
Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [0] 0,
active=0, hwpending=0, flags 0x00000000, link 0
Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [1]
0xc720ed00, active=0, hwpending=0, flags 0x00000000, link 0
Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [2] 0,
active=0, hwpending=0, flags 0x00000000, link 0
Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [3]
0xd8bec00, active=0, hwpending=0, flags 0x00000000, link 0
Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [8]
0xc720d2c0, active=0, hwpending=0, flags 0x00000000, link 0

Notice how the hardware queue isn't active at all. There's frames in
the TXQ and the hardware points to the next frame to transmit (but you
didn't provide the whole list of frames that were in the TXQ, so I'm
going on what it said before.)

Hm..

Adrian

Lev Serebryakov

2013-04-30 20:07:14 UTC

Hello, Adrian.
You wrote 30 апреля 2013 г., 19:41:12:

>> Warm-up: 120 seconds of TCP, throughput osculate between 50 and
>> 100Mbit/, several BAR resets, no hangs.
AC> Ok. But did it negotiate A-MPDU?
I'm not sure. Both ends showed typical N speeds (Windows in
connection properties, FreeBSD in "list sta").

AC> It's fine. We'll worry about aggregation later.
Yep.

AC> So here we go:
AC> Apr 30 12:21:26 gateway kernel: ath0: ath_stoptxdma: tx queue [9]
AC> 0xd8c3000, link 0
AC> Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [0] 0,
AC> active=0, hwpending=0, flags 0x00000000, link 0
AC> Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [1]
AC> 0xc720ed00, active=0, hwpending=0, flags 0x00000000, link 0
AC> Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [2] 0,
AC> active=0, hwpending=0, flags 0x00000000, link 0
AC> Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [3]
AC> 0xd8bec00, active=0, hwpending=0, flags 0x00000000, link 0
AC> Apr 30 12:21:26 gateway kernel: ath0: ath_tx_stopdma: tx queue [8]
AC> 0xc720d2c0, active=0, hwpending=0, flags 0x00000000, link 0

AC> Notice how the hardware queue isn't active at all. There's frames in
AC> the TXQ and the hardware points to the next frame to transmit (but you
AC> didn't provide the whole list of frames that were in the TXQ, so I'm
AC> going on what it said before.)
I've provided everything I've had in log :)

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-30 22:17:52 UTC

Well, right now I'm trying to figure out whether there's a race
condition in setting up / enabling the TX queue that I haven't yet
seen before, or whether we've hit another one of those corner case
bugs in the TX queue handling.

I may add in a hack to the completion code that checks if the queue
isn't active and if it isn't (but there's frames in the TX queue), it
restarts TX.

adrian

Lev Serebryakov

2013-05-01 09:32:20 UTC

Hello, Adrian.
You wrote 1 мая 2013 г., 2:17:52:

AC> Well, right now I'm trying to figure out whether there's a race
AC> condition in setting up / enabling the TX queue that I haven't yet
AC> seen before, or whether we've hit another one of those corner case
AC> bugs in the TX queue handling.
I could do some additional experiments, if you need more data :)
Really, I could add a tons of debug output if you send me a patch,
to avoid spamming SVN with code for this particular case.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-01 16:07:14 UTC

On 1 May 2013 02:32, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 1 мая 2013 г., 2:17:52:
>
> AC> Well, right now I'm trying to figure out whether there's a race
> AC> condition in setting up / enabling the TX queue that I haven't yet
> AC> seen before, or whether we've hit another one of those corner case
> AC> bugs in the TX queue handling.
> I could do some additional experiments, if you need more data :)
> Really, I could add a tons of debug output if you send me a patch,
> to avoid spamming SVN with code for this particular case.

I have enough data for now. :) and I'm all for adding more debuggig to
the driver.

ADrian

Adrian Chadd

2013-05-01 16:18:10 UTC

I'm trying to plan out how to potentially address this.

I don't know whether it'll be enough to just set TXE to 1 for that
queue and restart TX that way. I was hoping you'd post some more
results with aggregate frames, but it doesn't seem to be happening.

I'm hoping I don't have to stop the DMA for that queue and re-start it
again. That's going to be relatively error-prone.

ath9k and the reference driver have a bunch of workarounds to notice a
stalled TX queue and restart things. I'm not really happy doing that;
I'd rather try and fix the problem(s) before 10.0.

Adrian

Adrian Chadd

2013-05-03 01:54:54 UTC

.. and I've now (mostly) successfully reproduced this here at home.

I'll do some further digging. It may be that I just give in and add
the TXQ stall workaround. I'd however like to avoid doing that if
possible. :-)

Adrian

Adrian Chadd

2013-05-03 04:22:40 UTC

On 2 May 2013 18:54, Adrian Chadd <***@freebsd.org> wrote:
> .. and I've now (mostly) successfully reproduced this here at home.
>
> I'll do some further digging. It may be that I just give in and add
> the TXQ stall workaround. I'd however like to avoid doing that if
> possible. :-)

Hm, I'm reproducing it here and I see some odd descriptor behaviour.
I'll tinker with it a bit more locally.

There's at least one bug in the TX code to do with descriptor
handling, but it didn't fix everything. I'll have to re-review things
to make sure everything is handled in the correct order.

I'll let you know when I have something to test!

Thanks,

Adrian

Lev Serebryakov

2013-05-03 08:28:11 UTC

Hello, Adrian.
You wrote 3 мая 2013 г., 8:22:40:

>> I'll do some further digging. It may be that I just give in and add
>> the TXQ stall workaround. I'd however like to avoid doing that if
>> possible. :-)
AC> Hm, I'm reproducing it here and I see some odd descriptor behaviour.
AC> I'll tinker with it a bit more locally.

AC> There's at least one bug in the TX code to do with descriptor
AC> handling, but it didn't fix everything. I'll have to re-review things
AC> to make sure everything is handled in the correct order.

AC> I'll let you know when I have something to test!
Adrian! I want to thank you for your hard and fantastic work!
If I visit USA (when I visit?) and will be near your place of living
I want to buy your at least beer :) Or two :)

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-04 04:06:01 UTC

Hi!

Please update to the head -HEAD and test!

I'm still debugging all of this and trying to understand what the
other failure cases. Right now I'm seeing STA disassocications during
UDP testing but it doesn't look like a stalled queue.
Thanks,

Adrian

Adrian Chadd

2013-05-04 05:09:14 UTC

Hi,

So doing this testing as a station doing the UDP transmitting (to a
FreeBSD AP) is showing some interesting behaviour. No, i don't think
it's your bug.

The key re-negotiation seems to be failing somewhere; I bet it's out
of sequence or out of order crypto frames. I've seen this happen
before but I only fixed it for the AP case (by overriding the QOS for
said frames so it appeared in a different TID, thus different sequence
number space.) Since it's reauthing, it's transitioning the node
through disassociate / re-auth / re-associate, but the current code in
-HEAD isn't resetting the local state (aggregation, BAR state,
filtered frames state, etc.) So things can and do go weird.

I'll do some further digging into this and see if I can resolve it. I
thought I had fixed it but .. well, obviously not. Once I have
resolved this rekey / disconnect issue I'll keep testing things out
and see if I've fixed your issue.

(And I'll do some testing in parallel in open mode, so I don't bite this bug.)

adrian

Adrian Chadd

2013-05-08 02:53:22 UTC

Hi Lev (and others);

I've fixed the UDP TX + rekey session drop bug.

Please re-test your testcase on the latest -HEAD and let me know how it goes.

I'd really like to sneak in these power save and reassociate changes
of mine, but first I want to make sure that your issue is correctly
resolved.

Thanks!

Adrian

On 3 May 2013 22:09, Adrian Chadd <***@freebsd.org> wrote:
> Hi,
>
> So doing this testing as a station doing the UDP transmitting (to a
> FreeBSD AP) is showing some interesting behaviour. No, i don't think
> it's your bug.
>
> The key re-negotiation seems to be failing somewhere; I bet it's out
> of sequence or out of order crypto frames. I've seen this happen
> before but I only fixed it for the AP case (by overriding the QOS for
> said frames so it appeared in a different TID, thus different sequence
> number space.) Since it's reauthing, it's transitioning the node
> through disassociate / re-auth / re-associate, but the current code in
> -HEAD isn't resetting the local state (aggregation, BAR state,
> filtered frames state, etc.) So things can and do go weird.
>
> I'll do some further digging into this and see if I can resolve it. I
> thought I had fixed it but .. well, obviously not. Once I have
> resolved this rekey / disconnect issue I'll keep testing things out
> and see if I've fixed your issue.
>
> (And I'll do some testing in parallel in open mode, so I don't bite this bug.)
>
>
>
> adrian

Lev Serebryakov

2013-05-08 20:49:31 UTC

Hello, Adrian.
You wrote 8 ÐŒÐ°Ñ 2013 Ð³., 6:53:22:

AC> Hi Lev (and others);
AC> I've fixed the UDP TX + rekey session drop bug.
AC> Please re-test your testcase on the latest -HEAD and let me know how it goes.
Revision 250333, TDMA compiled-in.

The same hang. I've stopped traffic generator almost instanteously,
but it takes about 20 "forced resets" and 5 minutes to cleanup mess
and allow client to associate.

Log (huge one! Sorry!) attached.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-08 21:00:06 UTC

Yeah, -HEAD right now is still busted.

I'm testing out some locking changes right now that fix how I was
handling the holding descriptors. Once that's in -HEAD, I'll tell you
to re-test.

Thanks!

Adrian

On 8 May 2013 13:49, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 8 мая 2013 г., 6:53:22:
>
> AC> Hi Lev (and others);
> AC> I've fixed the UDP TX + rekey session drop bug.
> AC> Please re-test your testcase on the latest -HEAD and let me know how it goes.
> Revision 250333, TDMA compiled-in.
>
> The same hang. I've stopped traffic generator almost instanteously,
> but it takes about 20 "forced resets" and 5 minutes to cleanup mess
> and allow client to associate.
>
> Log (huge one! Sorry!) attached.
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-05-08 21:03:43 UTC

Hello, Adrian.
You wrote 9 мая 2013 г., 1:00:06:

AC> Yeah, -HEAD right now is still busted.
AC> I'm testing out some locking changes right now that fix how I was
AC> handling the holding descriptors. Once that's in -HEAD, I'll tell you
AC> to re-test.
We had long holidays (4 days!) here, so I could be AFK and be slow
with re-testing :)

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-08 21:24:12 UTC

Ok, please update to the latest -HEAD as of right now and re-test.

Adrian

On 8 May 2013 14:03, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 9 мая 2013 г., 1:00:06:
>
> AC> Yeah, -HEAD right now is still busted.
> AC> I'm testing out some locking changes right now that fix how I was
> AC> handling the holding descriptors. Once that's in -HEAD, I'll tell you
> AC> to re-test.
> We had long holidays (4 days!) here, so I could be AFK and be slow
> with re-testing :)
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>
>

Lev Serebryakov

2013-05-09 17:18:49 UTC

Hello, Adrian.
You wrote 9 ÐŒÐ°Ñ 2013 Ð³., 1:24:12:

AC> Ok, please update to the latest -HEAD as of right now and re-test.
r250406: problems after long UDP stream. But it looks like it recover
fast now.

BUT, BUT! After that, at second run, I got 802.11n rates with UDP at
last! They were lost long time ago :) I've checked twice :)

170Mbit/s :) It is almost half of maximum, but it much better, that
25-30Mbit/s :)

So, here are two logs.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-09 17:23:20 UTC

When you say "recover fast", what's that mean?

adrian

On 9 May 2013 10:18, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 9 мая 2013 г., 1:24:12:
>
> AC> Ok, please update to the latest -HEAD as of right now and re-test.
> r250406: problems after long UDP stream. But it looks like it recover
> fast now.
>
> BUT, BUT! After that, at second run, I got 802.11n rates with UDP at
> last! They were lost long time ago :) I've checked twice :)
>
> 170Mbit/s :) It is almost half of maximum, but it much better, that
> 25-30Mbit/s :)
>
> So, here are two logs.
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-05-09 17:26:04 UTC

Hello, Adrian.
You wrote 9 мая 2013 г., 21:23:20:

AC> When you say "recover fast", what's that mean?
First time I call forced bstuck several times, but it seems, that
client re-associate after first one. For second time (AP was
rebooted after first experiment) client lost association (According
to interface status on Windows), but I was not fast enough to call
"sysctl" by hands, it re-associated by itself.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-09 17:33:22 UTC

ok, I'll go over hte logs again soon.

thanks! we're getting there!

adrian

On 9 May 2013 10:26, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 9 мая 2013 г., 21:23:20:
>
> AC> When you say "recover fast", what's that mean?
> First time I call forced bstuck several times, but it seems, that
> client re-associate after first one. For second time (AP was
> rebooted after first experiment) client lost association (According
> to interface status on Windows), but I was not fast enough to call
> "sysctl" by hands, it re-associated by itself.
>
> --
> // Black Lion AKA Lev Serebryakov <***@FreeBSD.org>
>

Adrian Chadd

2013-05-09 19:24:07 UTC

Hi,

Next time you do this, can you do this:

wlandebug +crypto +assoc +state

I'd like to see what's actually going on behind the scenes.

so:

May 9 21:08:59 gateway sudo: lev : TTY=pts/1 ;
PWD=/usr/home/lev/bin ; USER=root ; COMMAND=/sbin/sysctl
dev.ath.0.debug=0x900000020

.. then a bunch of time, which I guess is you doing UDP:
May 9 21:16:25 gateway kernel: ath0: ath_tx_tid_drain_print: norm:
node 0xffffff8000d7f000: bf=0xffffff800085d8e0: addbaw=0, dobaw=0,
seqno=715, retry=0
May 9 21:16:25 gateway kernel: ath0: ath_tx_tid_drain_print: node
0xffffff8000d7f000: bf=0xffffff800085d8e0: txq[1] axq_depth=502,
axq_aggr_depth=0
May 9 21:16:25 gateway kernel: ath0: ath_tx_tid_drain_print: node
0xffffff8000d7f000: bf=0xffffff800085d8e0: tid txq_depth=1
hwq_depth=503, bar_wait=0, isfiltered=0
May 9 21:16:25 gateway kernel: ath0: ath_tx_tid_drain_print: node
0xffffff8000d7f000: tid 0: sched=1, paused=0, incomp=0, baw_head=0,
baw_tail=0 txa_start=51018, ni_txseqs=61249

.. so:

* hardware queue depth is 502 frames
* no aggregates
* there's only one frame in the TID txq, but 503 frames from that TID
in the hardware queue
* it's scheduled for more frames but there's likely no space left in the TXQ.

But there's no hardware reset until you do a forcebstuck.

So I guess that first one is actually a disassociate/reassociate bug
more than anything.

I bet this is what is actually going on. Can you please do the
wlandebug above, so we can see if it's actually disassociating and not
transitioning back through SCAN/ASSOC states?

Thanks,

adrian

Lev Serebryakov

2013-05-09 19:37:31 UTC

Hello, Adrian.
You wrote 9 мая 2013 г., 23:24:07:

AC> Next time you do this, can you do this:
AC> wlandebug +crypto +assoc +state
AC> I'd like to see what's actually going on behind the scenes.

AC> I bet this is what is actually going on. Can you please do the
AC> wlandebug above, so we can see if it's actually disassociating and not
AC> transitioning back through SCAN/ASSOC states?
Yep, I'm doing it right now/ Here are alot of

May 9 23:36:43 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q overflow, drops 8895 (size 50)

messages are generating with these srttings.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-09 19:39:49 UTC

On 9 May 2013 12:37, Lev Serebryakov <***@freebsd.org> wrote:

> May 9 23:36:43 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q overflow, drops 8895 (size 50)
>
> messages are generating with these srttings.

... right; is that the MAC address of the test station?

adrian

Lev Serebryakov

2013-05-09 19:43:22 UTC

Hello, Adrian.
You wrote 9 мая 2013 г., 23:39:49:

>> May 9 23:36:43 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q overflow, drops 8895 (size 50)
>> messages are generating with these srttings.
AC> ... right; is that the MAC address of the test station?
Yep!

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-05-09 19:42:32 UTC

Hello, Adrian.
You wrote 9 ÐŒÐ°Ñ 2013 Ð³., 23:24:07:

AC> I bet this is what is actually going on. Can you please do the
AC> wlandebug above, so we can see if it's actually disassociating and not
AC> transitioning back through SCAN/ASSOC states?
Here it is. Faster, than usual (~150 seconds, typical it takes about
450 second). Alot messages about full queues, but they could be
grepped out easily :)

Station re-associate automatically, I didn't use force bstuck.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Lev Serebryakov

2013-05-09 19:48:07 UTC

Hello, Lev.
You wrote 9 мая 2013 г., 23:42:32:

AC>> I bet this is what is actually going on. Can you please do the
AC>> wlandebug above, so we can see if it's actually disassociating and not
AC>> transitioning back through SCAN/ASSOC states?
LS> Here it is. Faster, than usual (~150 seconds, typical it takes about
LS> 450 second). Alot messages about full queues, but they could be
LS> grepped out easily :)
LS> Station re-associate automatically, I didn't use force bstuck.

f0:a2:25:ec:38:c6 is Kindle.
c4:85:08:3f:9e:c2 is Windows client, UDP stream receiver.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-09 19:54:08 UTC

On 9 May 2013 12:48, Lev Serebryakov <***@freebsd.org> wrote:

> f0:a2:25:ec:38:c6 is Kindle.
> c4:85:08:3f:9e:c2 is Windows client, UDP stream receiver.

Ok. So this is a different problem.

May 9 23:39:18 gateway kernel: wlan0: [f0:a2:25:ec:38:c6] RSN ie: mc
3/0 uc 3/0 key 2 caps 0x0
May 9 23:39:18 gateway kernel: wlan0: [f0:a2:25:ec:38:c6] switch
station to HT20 channel 2422/0x10480
May 9 23:39:18 gateway kernel: wlan0: [f0:a2:25:ec:38:c6] station
associated at aid 1: short preamble, short slot time, QoS, HT20
(+AMPDU)
May 9 23:39:18 gateway kernel: wlan0: [f0:a2:25:ec:38:c6] station
unauthorize via MLME
May 9 23:39:19 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] RSN ie: mc
3/0 uc 3/0 key 2 caps 0x3c
May 9 23:39:19 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] station
associated at aid 2: short preamble, short slot time, QoS, HT40
(+AMPDU) (+SMPS-DYN)
May 9 23:39:19 gateway kernel: wlan0: _ieee80211_crypto_delkey:
AES-CCM keyix 5 flags 0x103 rsc 0 tsc 222188 len 16
May 9 23:39:19 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q
overflow, drops 59924 (size 50)
May 9 23:39:20 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q
overflow, drops 59925 (size 50)
May 9 23:39:21 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q
overflow, drops 59926 (size 50)
May 9 23:39:21 gateway kernel: wlan0: [f0:a2:25:ec:38:c6] station
deauth via MLME (reason 2)
May 9 23:39:21 gateway kernel: wlan0: [f0:a2:25:ec:38:c6] station
with aid 1 leaves
May 9 23:39:22 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] pwr save q
overflow, drops 59927 (size 50)
May 9 23:39:23 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] station
deauth via MLME (reason 2)
May 9 23:39:23 gateway kernel: wlan0: [c4:85:08:3f:9e:c2] station
with aid 2 leaves

.. now, see how it's kicking you off? The transmit queue is filled
_and_ the station is asleep. We're still trying to send it stupid
amounts of data even though we can't get to the sleeping station.

So:

Try setting wpa_group_rekey in hostapd.conf to something low, like
say 15 seconds.

See if that immediately triggers things to go pear-shaped.

Then, if it does, try running hostapd in debug mode in the foreground:

# hostapd -d -d /etc/hostapd.conf 2>&1 | tee /tmp/hostapd.log

.. and then watch what goes on with rekeying as it gets booted off.

I bet that it's missing the group rekey notification from the remote
station, and it's being disconnected.

If that's the case - great! We've fixed the hardware bug. You'll be a
perfect test candidate for my power-save queue handling changes, which
look to address this saxact problem.

(yay! I hope we've fixed the TX queue hangs!)

adrian

Adrian Chadd

2013-05-09 19:58:31 UTC

.. and basically, once you've grabbed some hostapd debugging, let me
know if you see transmit queue stalls. Yes, the queue handling w/
power-save handling in hostapd mode is rather suboptimal. That's high
up on my 'broken stuff that needs fixing' list.

Let me know how the data gathering goes and we'll submit a separate PR for this.

Thanks!

adrian

Lev Serebryakov

2013-05-09 20:13:01 UTC

Hello, Adrian.
You wrote 9 мая 2013 г., 23:58:31:

AC> .. and basically, once you've grabbed some hostapd debugging, let me
AC> know if you see transmit queue stalls. Yes, the queue handling w/
AC> power-save handling in hostapd mode is rather suboptimal. That's high
AC> up on my 'broken stuff that needs fixing' list.
AC> Let me know how the data gathering goes and we'll submit a separate PR for this.
I've sent new logs, hostpad and messages from kernel.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-05-09 20:33:55 UTC

Yup, looks like the TX queue is just too deep and the EAPOL messages
aren't making out.

I thought I fixed this.

I'll see if I can replicate this on holidays, but it may take a while.
But luckily this looks like I've fixed the hardware TX queue stall!

adrian

Lev Serebryakov

2013-05-09 20:11:50 UTC

Hello, Adrian.
You wrote 9 ìàÿ 2013 ã., 23:54:08:

AC> .. now, see how it's kicking you off? The transmit queue is filled
AC> _and_ the station is asleep. We're still trying to send it stupid
AC> amounts of data even though we can't get to the sleeping station.
At same moment, station prints report about received traffic every
second... And there is no power setting settings on Windows (or I
don't know how to find them, but I see big property page with
different settings, like roaming aggressiveness and preferred
frequency band).

AC> So:
AC> Try setting wpa_group_rekey in hostapd.conf to something low, like
AC> say 15 seconds.
Done!

AC> See if that immediately triggers things to go pear-shaped.
Almost immediately, with traffic. Without traffic it seems to work,
but how could I know for sure -- without traffic? :) It looks like
Windows doesn't update status if here are no traffic on interface!

AC> Then, if it does, try running hostapd in debug mode in the foreground:

AC> # hostapd -d -d /etc/hostapd.conf 2>&1 | tee /tmp/hostapd.log
AC> .. and then watch what goes on with rekeying as it gets booted off.

AC> I bet that it's missing the group rekey notification from the remote
AC> station, and it's being disconnected.

AC> If that's the case - great! We've fixed the hardware bug. You'll be a
AC> perfect test candidate for my power-save queue handling changes, which
AC> look to address this saxact problem.
Logs for several (3 or 4) dis- and re-associations is attached.

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Maciej Milewski

2013-05-10 09:57:55 UTC

On 09.05.2013 22:11, Lev Serebryakov wrote:
> And there is no power setting settings on Windows (or I
> don't know how to find them, but I see big property page with
> different settings, like roaming aggressiveness and preferred
> frequency band).
>
Lev,
depends of driver version/manufacturer. F.ex. I have Intel 4965AG near
me and power settings are under Network Card Properties -> Advanced. But
there you can have only three types to choose:
Default,Low,High.

--
Pozdrawiam,
Maciej Milewski

Lev Serebryakov

2013-04-26 21:14:23 UTC

Hello, Adrian.
You wrote 26 апреля 2013 г., 18:56:57:

AC> And please cc -wireless with your debugging results :)
Ok.
It is interesting how throughput is not connected to rate after
these problems. Now I have client at stable 270M rate with RSSI ~28,
and only 15-20Mbit/s of UDP traffic AP->Client... :(

It looks like throughput is bad after card reset without reboot...
But no ath_bar_suspend!

Yes, after full reboot throughput is osscilate between 65 and
120Mbit/s! But with bar_suspsends!

Maybe, it is coincidence, but I npotice this for second time...

--
// Black Lion AKA Lev Serebryakov <***@FreeBSD.org>

Adrian Chadd

2013-04-26 21:17:25 UTC

On 26 April 2013 14:14, Lev Serebryakov <***@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 26 апреля 2013 г., 18:56:57:
>
> AC> And please cc -wireless with your debugging results :)
> Ok.
> It is interesting how throughput is not connected to rate after
> these problems. Now I have client at stable 270M rate with RSSI ~28,
> and only 15-20Mbit/s of UDP traffic AP->Client... :(
>
> It looks like throughput is bad after card reset without reboot...
> But no ath_bar_suspend!
>
> Yes, after full reboot throughput is osscilate between 65 and
> 120Mbit/s! But with bar_suspsends!
>
> Maybe, it is coincidence, but I npotice this for second time...

All kinds of reasons for this behaviour. We'll look into it after we
nail down why the TX queue is stalling.

adrian

Adrian Chadd

2013-05-09 21:13:14 UTC

Ah. So these non aggregate tests are showing a problem - there shouldn't be 500 frames in the txq. Ever. That's just plain dumb.

Because of this, the logic I introduced to limit queue depth for a node isn't working - the node software queue depth never grows past a handful. They're immediately queued to the hardware.

I'll go and 'fix' the queue handling for non aggregate traffic sometime in the next few days. I'll modify it so I only queue up to some threshold of non aggregate frames into the hardware queue.

Thanks,

Adrian

Sent from my Palm Pre on AT&T
On May 9, 2013 1:33 PM, Adrian Chadd <***@freebsd.org> wrote:

Yup, looks like the TX queue is just too deep and the EAPOL messages

aren't making out.

I thought I fixed this.

I'll see if I can replicate this on holidays, but it may take a while.

But luckily this looks like I've fixed the hardware TX queue stall!

adrian

63 Replies
49 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Lev Serebryakov 2013-04-23 12:41:37 UTC

Adrian Chadd 2013-04-23 14:43:45 UTC

Lev Serebryakov 2013-04-23 14:45:28 UTC

Petko Bordjukov 2013-04-23 19:57:15 UTC

Adrian Chadd 2013-04-23 20:01:55 UTC

Adrian Chadd 2013-04-25 20:27:03 UTC

Adrian Chadd 2013-04-26 00:11:44 UTC

Lev Serebryakov 2013-04-26 08:14:51 UTC

Adrian Chadd 2013-04-26 08:45:41 UTC

Adrian Chadd 2013-04-26 08:51:17 UTC

Lev Serebryakov 2013-04-26 21:28:51 UTC

Adrian Chadd 2013-04-26 21:34:24 UTC

Adrian Chadd 2013-04-26 21:52:06 UTC

Lev Serebryakov 2013-04-27 08:33:52 UTC

Adrian Chadd 2013-04-27 18:56:30 UTC

Lev Serebryakov 2013-04-28 08:11:22 UTC

Lev Serebryakov 2013-04-28 19:32:05 UTC

Lev Serebryakov 2013-04-28 18:50:06 UTC

Adrian Chadd 2013-04-28 19:38:20 UTC

Lev Serebryakov 2013-04-28 19:41:33 UTC

Adrian Chadd 2013-04-28 19:49:38 UTC

Lev Serebryakov 2013-04-28 20:14:47 UTC

Lev Serebryakov 2013-04-28 20:43:52 UTC

Adrian Chadd 2013-04-29 07:29:32 UTC

Adrian Chadd 2013-04-29 07:30:00 UTC

Lev Serebryakov 2013-04-29 07:36:23 UTC

Adrian Chadd 2013-04-29 07:49:03 UTC

Adrian Chadd 2013-04-30 04:53:00 UTC

Lev Serebryakov 2013-04-30 08:24:59 UTC

Adrian Chadd 2013-04-30 15:41:12 UTC

Lev Serebryakov 2013-04-30 20:07:14 UTC

Adrian Chadd 2013-04-30 22:17:52 UTC

Lev Serebryakov 2013-05-01 09:32:20 UTC

Adrian Chadd 2013-05-01 16:07:14 UTC

Adrian Chadd 2013-05-01 16:18:10 UTC

Adrian Chadd 2013-05-03 01:54:54 UTC

Adrian Chadd 2013-05-03 04:22:40 UTC

Lev Serebryakov 2013-05-03 08:28:11 UTC

Adrian Chadd 2013-05-04 04:06:01 UTC

Adrian Chadd 2013-05-04 05:09:14 UTC

Adrian Chadd 2013-05-08 02:53:22 UTC

Lev Serebryakov 2013-05-08 20:49:31 UTC

Adrian Chadd 2013-05-08 21:00:06 UTC

Lev Serebryakov 2013-05-08 21:03:43 UTC

Adrian Chadd 2013-05-08 21:24:12 UTC

Lev Serebryakov 2013-05-09 17:18:49 UTC

Adrian Chadd 2013-05-09 17:23:20 UTC

Lev Serebryakov 2013-05-09 17:26:04 UTC

Adrian Chadd 2013-05-09 17:33:22 UTC

Adrian Chadd 2013-05-09 19:24:07 UTC

Lev Serebryakov 2013-05-09 19:37:31 UTC

Adrian Chadd 2013-05-09 19:39:49 UTC

Lev Serebryakov 2013-05-09 19:43:22 UTC

Lev Serebryakov 2013-05-09 19:42:32 UTC

Lev Serebryakov 2013-05-09 19:48:07 UTC

Adrian Chadd 2013-05-09 19:54:08 UTC

Adrian Chadd 2013-05-09 19:58:31 UTC

Lev Serebryakov 2013-05-09 20:13:01 UTC

Adrian Chadd 2013-05-09 20:33:55 UTC

Lev Serebryakov 2013-05-09 20:11:50 UTC

Maciej Milewski 2013-05-10 09:57:55 UTC

Lev Serebryakov 2013-04-26 21:14:23 UTC

Adrian Chadd 2013-04-26 21:17:25 UTC

Adrian Chadd 2013-05-09 21:13:14 UTC

about - legalese

Loading...