Discussion:
Unicast packets stop being transmitted to a particular station, under load, when WPA2 is enabled
Avery Pennarun
2014-05-12 01:57:54 UTC
Permalink
Version: 3.15-rc1 and ath10k-stable-3.11-8 (both via backports to kernel 3.2.26)
Firmware: 10.1.467.2-1

Steps:
- Configure ath10k as AP on channel 149, width 80 MHz, WPA2 encryption
- Connect my 2009 macbook
- Start a ping of 8.8.8.8 in the background from my macbook
- Generate some traffic. The trigger varies, but running uTorrent or
quickly opening a lot of background tabs in Chrome usually seems to
set it off within a couple of minutes. iperf and isoping don't seem
to cause any trouble.

Expected:
- ping keeps pinging

Actual:
- tcpdump on the ath10k host shows ICMP requests coming in, and
responses going out.
- tcpdump on the macbook shows ICMP requests going out, but no
responses coming back.
- tcpdump -I (radiotap mode) on the macbook is a little hard to
understand since it's encrypted, but it shows some packets coming out
of the ath10k (broadcasts, I think) but no unicast packets to the
macbook.
- tcpdump on the macbook *does* show broadcast packets arriving. For
example, ARP requests and "ping -b 192.168.1.255" (my local subnet IP
address) get through.
- Disconnecting and then reconnecting the wifi on my macbook fixes the
problem until it next triggers.
- Other STAs connected to the AP are not affected when the one STA
isn't able to communicate (although each one has the potential to
trigger the problem)

Disabling encryption makes the problem go away permanently.

This is pretty quick for me to reproduce, but unfortunately I don't
have any steps to trigger it instantly, nor any command line tools
that seem to make it happen. (For example I tried multiple parallel
'curl' processes in a loop, and no lock.)

Nothing interesting appears in the dmesg or hostapd logs at the time
of the problem.

This sounds like it could be a problem with crypto session keys, but I
don't understand why it would only be wrong in a single direction. I
also don't think my keys are rotating this quickly, so this shouldn't
be a key rotation problem (though I don't understand very well how
that works).

It might be my imagination, but it's possible that this triggers more
quickly if my macbook has been connected for a longer period of time
before generating the traffic burst.

Anything I can check to help narrow this down?

Thanks,

Avery
Dave Taht
2014-05-12 02:07:46 UTC
Permalink
Post by Avery Pennarun
Version: 3.15-rc1 and ath10k-stable-3.11-8 (both via backports to kernel 3.2.26)
Firmware: 10.1.467.2-1
- Configure ath10k as AP on channel 149, width 80 MHz, WPA2 encryption
- Connect my 2009 macbook
- Start a ping of 8.8.8.8 in the background from my macbook
- Generate some traffic. The trigger varies, but running uTorrent or
quickly opening a lot of background tabs in Chrome usually seems to
set it off within a couple of minutes. iperf and isoping don't seem
to cause any trouble.
- ping keeps pinging
- tcpdump on the ath10k host shows ICMP requests coming in, and
responses going out.
- tcpdump on the macbook shows ICMP requests going out, but no
responses coming back.
- tcpdump -I (radiotap mode) on the macbook is a little hard to
understand since it's encrypted, but it shows some packets coming out
of the ath10k (broadcasts, I think) but no unicast packets to the
macbook.
- tcpdump on the macbook *does* show broadcast packets arriving. For
example, ARP requests and "ping -b 192.168.1.255" (my local subnet IP
address) get through.
- Disconnecting and then reconnecting the wifi on my macbook fixes the
problem until it next triggers.
- Other STAs connected to the AP are not affected when the one STA
isn't able to communicate (although each one has the potential to
trigger the problem)
Disabling encryption makes the problem go away permanently.
This is pretty quick for me to reproduce, but unfortunately I don't
have any steps to trigger it instantly, nor any command line tools
that seem to make it happen. (For example I tried multiple parallel
'curl' processes in a loop, and no lock.)
Nothing interesting appears in the dmesg or hostapd logs at the time
of the problem.
This sounds like it could be a problem with crypto session keys, but I
don't understand why it would only be wrong in a single direction. I
also don't think my keys are rotating this quickly, so this shouldn't
be a key rotation problem (though I don't understand very well how
that works).
I have been failing to find and fix a very similar problem on the
ath9k for many months now. What I see happening there is that one or
more of the
hardware queues locks up, and stops transmitting traffic. So, for
example I might get traffic destined for the BK (background queue,
traffic marked CS1) hung,
but BE remains fine. Most recently I was able to lock up the VO, VI
AND BK queues by exercising it overnight with multiple copies of the
rrul test.

I don't know much about how the hardware queues are configured on
ath10k, but you can land stuff in each queue by marking with CS0, CS1,
CS5, and CS7 (BE,BK,VI,VO) on mac80211 based devices.

I can make it happen more often, faster, if the associated station has
considerable distance and less signal strength than nearby.

My bug in the bug tracker is here:

http://www.bufferbloat.net/issues/442
Post by Avery Pennarun
It might be my imagination, but it's possible that this triggers more
quickly if my macbook has been connected for a longer period of time
before generating the traffic burst.
Anything I can check to help narrow this down?
Move it farther away.

Blow it up with netperf-wrappers -H someserver rrul...
Post by Avery Pennarun
Thanks,
Avery
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Avery Pennarun
2014-05-12 02:29:06 UTC
Permalink
Post by Dave Taht
I have been failing to find and fix a very similar problem on the
ath9k for many months now. What I see happening there is that one or
more of the
hardware queues locks up, and stops transmitting traffic. So, for
example I might get traffic destined for the BK (background queue,
traffic marked CS1) hung,
but BE remains fine. Most recently I was able to lock up the VO, VI
AND BK queues by exercising it overnight with multiple copies of the
rrul test.
I don't know much about how the hardware queues are configured on
ath10k, but you can land stuff in each queue by marking with CS0, CS1,
CS5, and CS7 (BE,BK,VI,VO) on mac80211 based devices.
I think my problem may be something else. In particular, it seems to
affect each station separately, and doesn't seem to happen if I
disable encryption. (Does your ath9k problem trigger if encryption is
turned off?) I also have an ath9k device in the same AP on 2.4 GHz,
and it doesn't trigger there either. I haven't attempted to see if
your bug triggers on that one though :)
Post by Dave Taht
I can make it happen more often, faster, if the associated station has
considerable distance and less signal strength than nearby.
I just checked, and my bug seems to trigger more often when I'm at a
longer distance (my macbook says about -60 RSSI) and less often at a
closer distance (currently macbook reports RSSI of -41). Not sure if
this is related to increased retransmits or decreased speed or
something else.
Post by Dave Taht
Blow it up with netperf-wrappers -H someserver rrul...
That's not a bad idea... I really need to get netperf-wrappers going
for some stress testing :)

Have fun,

Avery
Dave Taht
2014-05-12 02:42:38 UTC
Permalink
Post by Avery Pennarun
Post by Dave Taht
I have been failing to find and fix a very similar problem on the
ath9k for many months now. What I see happening there is that one or
more of the
hardware queues locks up, and stops transmitting traffic. So, for
example I might get traffic destined for the BK (background queue,
traffic marked CS1) hung,
but BE remains fine. Most recently I was able to lock up the VO, VI
AND BK queues by exercising it overnight with multiple copies of the
rrul test.
I don't know much about how the hardware queues are configured on
ath10k, but you can land stuff in each queue by marking with CS0, CS1,
CS5, and CS7 (BE,BK,VI,VO) on mac80211 based devices.
I think my problem may be something else. In particular, it seems to
affect each station separately, and doesn't seem to happen if I
disable encryption. (Does your ath9k problem trigger if encryption is
turned off?)
No. WPA2 only so far.

I will try multiple stations to see if I can get it to occur only on a
per-station basis. (there are hardware queues for multiple forms of
traffic not just the visible VO, VI, BE, and BK queues)
Post by Avery Pennarun
I also have an ath9k device in the same AP on 2.4 GHz,
and it doesn't trigger there either. I haven't attempted to see if
your bug triggers on that one though :)
It really takes work to trigger it, and I can can now do it on both
2.4ghz and 5. Getting it down to under 6 hours of high traffic
recently was an accomplishment.
Post by Avery Pennarun
Post by Dave Taht
I can make it happen more often, faster, if the associated station has
considerable distance and less signal strength than nearby.
There are not often executed code paths controlling how noise rejection
works, and all sorts of hardware issues on configuring it that vary between
chipset versions. Ton of patches had landed in head that had an update
to the ANI values
that worked on newer versions of the ath9k chipset that later had to be modified
to deal with older ath9k chipsets.
Post by Avery Pennarun
I just checked, and my bug seems to trigger more often when I'm at a
longer distance (my macbook says about -60 RSSI) and less often at a
closer distance (currently macbook reports RSSI of -41). Not sure if
this is related to increased retransmits or decreased speed or
something else.
Post by Dave Taht
Blow it up with netperf-wrappers -H someserver rrul...
That's not a bad idea... I really need to get netperf-wrappers going
for some stress testing :)
The hardware queues are rarely tested.

If you just want to blow up one queue at a time, the syntax for netperf is

You can also arbitrarily do tos-setting with iptables.

dnsmasq uses CS6 by default, btw, so it's DHCP packets land by default
in VO and then get shuffled over to the multicast hw queue.
Post by Avery Pennarun
Have fun,
Avery
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Dave Taht
2014-05-12 02:46:10 UTC
Permalink
Post by Dave Taht
Post by Avery Pennarun
Post by Dave Taht
I have been failing to find and fix a very similar problem on the
ath9k for many months now. What I see happening there is that one or
more of the
hardware queues locks up, and stops transmitting traffic. So, for
example I might get traffic destined for the BK (background queue,
traffic marked CS1) hung,
but BE remains fine. Most recently I was able to lock up the VO, VI
AND BK queues by exercising it overnight with multiple copies of the
rrul test.
I don't know much about how the hardware queues are configured on
ath10k, but you can land stuff in each queue by marking with CS0, CS1,
CS5, and CS7 (BE,BK,VI,VO) on mac80211 based devices.
I think my problem may be something else. In particular, it seems to
affect each station separately, and doesn't seem to happen if I
disable encryption. (Does your ath9k problem trigger if encryption is
turned off?)
No. WPA2 only so far.
I will try multiple stations to see if I can get it to occur only on a
per-station basis. (there are hardware queues for multiple forms of
traffic not just the visible VO, VI, BE, and BK queues)
Post by Avery Pennarun
I also have an ath9k device in the same AP on 2.4 GHz,
and it doesn't trigger there either. I haven't attempted to see if
your bug triggers on that one though :)
It really takes work to trigger it, and I can can now do it on both
2.4ghz and 5. Getting it down to under 6 hours of high traffic
recently was an accomplishment.
Post by Avery Pennarun
Post by Dave Taht
I can make it happen more often, faster, if the associated station has
considerable distance and less signal strength than nearby.
There are not often executed code paths controlling how noise rejection
works, and all sorts of hardware issues on configuring it that vary between
chipset versions. Ton of patches had landed in head that had an update
to the ANI values
that worked on newer versions of the ath9k chipset that later had to be modified
to deal with older ath9k chipsets.
Post by Avery Pennarun
I just checked, and my bug seems to trigger more often when I'm at a
longer distance (my macbook says about -60 RSSI) and less often at a
closer distance (currently macbook reports RSSI of -41). Not sure if
this is related to increased retransmits or decreased speed or
something else.
Post by Dave Taht
Blow it up with netperf-wrappers -H someserver rrul...
That's not a bad idea... I really need to get netperf-wrappers going
for some stress testing :)
The hardware queues are rarely tested.
If you just want to blow up one queue at a time, the syntax for netperf is
netperf -H someserver -t the_test -Y CS1,CS1 # or CS5,CS5 or CS6, CS6

I have been flooding all the queues with both -t TCP_STREAM and TCP_MAERTS
to make it happen using the rrul test, but I have also made it happen
with BE only.

Getting one data point every day or so makes for slow debugging.
Post by Dave Taht
You can also arbitrarily do tos-setting with iptables.
dnsmasq uses CS6 by default, btw, so it's DHCP packets land by default
in VO and then get shuffled over to the multicast hw queue.
Post by Avery Pennarun
Have fun,
Avery
--
Dave Täht
NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Adrian Chadd
2014-05-12 14:49:38 UTC
Permalink
Hi,

I've faced this in FreeBSD. It's only very recently that I found some
corner case block-ack window tracking bugs that only occured during
periods of extreme packet loss. What I would do when this happens:

* I'd dump out the entire software queue for the hung station and
hardware queue state. the hardware queue state tends to be quite small
on ath9k/freebsd as we artifically limit the queue depth
* I also hacked up a rolling log of all the transmit, transmit
completion, baw add, baw remove log entries, so I could go back in
history to find where the hole came from.

The last one I found took a day of active torrenting and around 500
million log line entries just to trigger. :-)

As for firmware, lemme respond to that separately.



-a
Post by Dave Taht
Post by Dave Taht
Post by Avery Pennarun
Post by Dave Taht
I have been failing to find and fix a very similar problem on the
ath9k for many months now. What I see happening there is that one or
more of the
hardware queues locks up, and stops transmitting traffic. So, for
example I might get traffic destined for the BK (background queue,
traffic marked CS1) hung,
but BE remains fine. Most recently I was able to lock up the VO, VI
AND BK queues by exercising it overnight with multiple copies of the
rrul test.
I don't know much about how the hardware queues are configured on
ath10k, but you can land stuff in each queue by marking with CS0, CS1,
CS5, and CS7 (BE,BK,VI,VO) on mac80211 based devices.
I think my problem may be something else. In particular, it seems to
affect each station separately, and doesn't seem to happen if I
disable encryption. (Does your ath9k problem trigger if encryption is
turned off?)
No. WPA2 only so far.
I will try multiple stations to see if I can get it to occur only on a
per-station basis. (there are hardware queues for multiple forms of
traffic not just the visible VO, VI, BE, and BK queues)
Post by Avery Pennarun
I also have an ath9k device in the same AP on 2.4 GHz,
and it doesn't trigger there either. I haven't attempted to see if
your bug triggers on that one though :)
It really takes work to trigger it, and I can can now do it on both
2.4ghz and 5. Getting it down to under 6 hours of high traffic
recently was an accomplishment.
Post by Avery Pennarun
Post by Dave Taht
I can make it happen more often, faster, if the associated station has
considerable distance and less signal strength than nearby.
There are not often executed code paths controlling how noise rejection
works, and all sorts of hardware issues on configuring it that vary between
chipset versions. Ton of patches had landed in head that had an update
to the ANI values
that worked on newer versions of the ath9k chipset that later had to be modified
to deal with older ath9k chipsets.
Post by Avery Pennarun
I just checked, and my bug seems to trigger more often when I'm at a
longer distance (my macbook says about -60 RSSI) and less often at a
closer distance (currently macbook reports RSSI of -41). Not sure if
this is related to increased retransmits or decreased speed or
something else.
Post by Dave Taht
Blow it up with netperf-wrappers -H someserver rrul...
That's not a bad idea... I really need to get netperf-wrappers going
for some stress testing :)
The hardware queues are rarely tested.
If you just want to blow up one queue at a time, the syntax for netperf is
netperf -H someserver -t the_test -Y CS1,CS1 # or CS5,CS5 or CS6, CS6
I have been flooding all the queues with both -t TCP_STREAM and TCP_MAERTS
to make it happen using the rrul test, but I have also made it happen
with BE only.
Getting one data point every day or so makes for slow debugging.
Post by Dave Taht
You can also arbitrarily do tos-setting with iptables.
dnsmasq uses CS6 by default, btw, so it's DHCP packets land by default
in VO and then get shuffled over to the multicast hw queue.
Post by Avery Pennarun
Have fun,
Avery
--
Dave Täht
NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
--
Dave Täht
NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
Ben Greear
2014-05-12 03:09:21 UTC
Permalink
Post by Avery Pennarun
Version: 3.15-rc1 and ath10k-stable-3.11-8 (both via backports to kernel 3.2.26)
Firmware: 10.1.467.2-1
- Configure ath10k as AP on channel 149, width 80 MHz, WPA2 encryption
- Connect my 2009 macbook
- Start a ping of 8.8.8.8 in the background from my macbook
- Generate some traffic. The trigger varies, but running uTorrent or
quickly opening a lot of background tabs in Chrome usually seems to
set it off within a couple of minutes. iperf and isoping don't seem
to cause any trouble.
- ping keeps pinging
- tcpdump on the ath10k host shows ICMP requests coming in, and
responses going out.
- tcpdump on the macbook shows ICMP requests going out, but no
responses coming back.
- tcpdump -I (radiotap mode) on the macbook is a little hard to
understand since it's encrypted, but it shows some packets coming out
of the ath10k (broadcasts, I think) but no unicast packets to the
macbook.
- tcpdump on the macbook *does* show broadcast packets arriving. For
example, ARP requests and "ping -b 192.168.1.255" (my local subnet IP
address) get through.
- Disconnecting and then reconnecting the wifi on my macbook fixes the
problem until it next triggers.
- Other STAs connected to the AP are not affected when the one STA
isn't able to communicate (although each one has the potential to
trigger the problem)
Can you reproduce with any other type of station
device?

Also, have you tried sniffing with a third device to see if
the AP actually puts the ICMP responses on the air?

Thanks,
Ben
--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
Avery Pennarun
2014-05-12 03:54:08 UTC
Permalink
Post by Ben Greear
Can you reproduce with any other type of station
device?
I'll try. It's a little tricky since it seems to involve running
Chrome, etc, rather than my usual command-line tests for which I have
a wider variety of clients.
Post by Ben Greear
Also, have you tried sniffing with a third device to see if
the AP actually puts the ICMP responses on the air?
I did try that. As far as I can tell, the ICMP responses are simply
not being sent at all.
Ben Greear
2014-05-12 04:05:28 UTC
Permalink
Post by Avery Pennarun
Post by Ben Greear
Can you reproduce with any other type of station
device?
I'll try. It's a little tricky since it seems to involve running
Chrome, etc, rather than my usual command-line tests for which I have
a wider variety of clients.
Post by Ben Greear
Also, have you tried sniffing with a third device to see if
the AP actually puts the ICMP responses on the air?
I did try that. As far as I can tell, the ICMP responses are simply
not being sent at all.
I haven't dug into many of the stats yet, but it's possible the
ath10k debugfs file would show some types of transmit errors in this case?

Might be interesting to see how long it takes the AP to generate
the tx status response for the transmitted ICMP packets. My firmware
has some extended tx status, but I'm not sure it has anything overly
useful for your case...I was mostly concerned with the tx rate reporting
when writing it.

Thanks,
Ben
--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
Avery Pennarun
2014-05-12 04:56:04 UTC
Permalink
Post by Ben Greear
Post by Avery Pennarun
Post by Ben Greear
Also, have you tried sniffing with a third device to see if
the AP actually puts the ICMP responses on the air?
I did try that. As far as I can tell, the ICMP responses are simply
not being sent at all.
I was incorrect about this because I was looking at the wrong data. I
tested again with a more obvious method, running "ping -i0.01
192.168.1.107" where 107 is the address of my macbook. With 100
packets per second, they overwhelm the rest of the traffic so it's
easy to see whether they're coming through.

The AP definitely *is* transmitting the packets on the air. Even my
macbook can see them in the radiotap tcpdump mode, but it doesn't see
them at the IP layer. So they are either being encrypted wrong or my
macbook is decrypting them wrong, I guess.

I think I'll try using wireshark and see what it thinks...
Post by Ben Greear
I haven't dug into many of the stats yet, but it's possible the
ath10k debugfs file would show some types of transmit errors in this case?
Possibly. Can you give me a hint of where to look? I don't really
know what these files do.
Post by Ben Greear
Might be interesting to see how long it takes the AP to generate
the tx status response for the transmitted ICMP packets. My firmware
has some extended tx status, but I'm not sure it has anything overly
useful for your case...I was mostly concerned with the tx rate reporting
when writing it.
I can try it with your firmware if you think there is useful data to
gather, although since it turned out my earlier statement wasn't true
maybe this is less important :)

Thanks,

Avery
Ben Greear
2014-05-12 05:05:53 UTC
Permalink
Post by Avery Pennarun
Post by Ben Greear
Post by Avery Pennarun
Post by Ben Greear
Also, have you tried sniffing with a third device to see if
the AP actually puts the ICMP responses on the air?
I did try that. As far as I can tell, the ICMP responses are simply
not being sent at all.
I was incorrect about this because I was looking at the wrong data. I
tested again with a more obvious method, running "ping -i0.01
192.168.1.107" where 107 is the address of my macbook. With 100
packets per second, they overwhelm the rest of the traffic so it's
easy to see whether they're coming through.
The AP definitely *is* transmitting the packets on the air. Even my
macbook can see them in the radiotap tcpdump mode, but it doesn't see
them at the IP layer. So they are either being encrypted wrong or my
macbook is decrypting them wrong, I guess.
I think I'll try using wireshark and see what it thinks...
Post by Ben Greear
I haven't dug into many of the stats yet, but it's possible the
ath10k debugfs file would show some types of transmit errors in this case?
Possibly. Can you give me a hint of where to look? I don't really
know what these files do.
Post by Ben Greear
Might be interesting to see how long it takes the AP to generate
the tx status response for the transmitted ICMP packets. My firmware
has some extended tx status, but I'm not sure it has anything overly
useful for your case...I was mostly concerned with the tx rate reporting
when writing it.
I can try it with your firmware if you think there is useful data to
gather, although since it turned out my earlier statement wasn't true
maybe this is less important :)
If it's getting on the air, then I think the only way to figure out
what is wrong is to decode the packets and see if they are encrypted
properly or not. I think there is a way to get wireshark to decode
pkts by feeding it the proper keys, but I have not ever actually tried
doing that.

If pkts do not get on the air, then possibly the tx status and/or tx
error counters could tell you why, but it seems that is not relevant
in this case.

Thanks,
Ben
--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
Avery Pennarun
2014-05-12 05:19:24 UTC
Permalink
Post by Ben Greear
If it's getting on the air, then I think the only way to figure out
what is wrong is to decode the packets and see if they are encrypted
properly or not. I think there is a way to get wireshark to decode
pkts by feeding it the proper keys, but I have not ever actually tried
doing that.
If pkts do not get on the air, then possibly the tx status and/or tx
error counters could tell you why, but it seems that is not relevant
in this case.
Okay, in fact I just learned how to use that wireshark feature last
week, so I tried it just now and it worked. It clearly shows the
downstream packets from the AP *are* decryptable from wireshark, but
they get no replies from the macbook (and the macbook doesn't show
them at its IP layer).

Any guesses? Does that mean somehow the macbook got the wrong
*decryption* keys or something?

I'm currently trying to wade through the giant wireshark capture
trying to find the actual point where the dropout occurred, but it
seems to slow down a bit with 1.5 million frames captured :)
Avery Pennarun
2014-05-12 07:07:10 UTC
Permalink
Post by Ben Greear
If it's getting on the air, then I think the only way to figure out
what is wrong is to decode the packets and see if they are encrypted
properly or not. I think there is a way to get wireshark to decode
pkts by feeding it the proper keys, but I have not ever actually tried
doing that.
Okay, here is a fairly reduced capture of my wireshark trace:
http://apenwarr.ca/tmp/ath10k-utorrent-dropout-v2-reduced.pcapng.gz

To decode the wifi packets in wireshark, you need to follow these steps:
- open the pcap
- Edit | Preferences
- Protocols | IEEE 802.11
- Enable Decryption: checked
- Decryption Keys: Edit
- New
- Key type: wpa-pwd
- Key: my-password:my-ssid (I'll email these to Ben privately; anyone
else interested, let me know)
- Ok
- Ok

The capture contains a lot of simultaneous TCP and UDP sessions, since
I can only trigger the problem when there is quite a bit of stuff
going on. Luckily the air itself was relatively quiet so there isn't
too much noise other than my AP and laptop.

I believe I can narrow down the dropout to somewhere between rows
20035 and 20391. 20035 is a packet from a remote server that is ACKed
by my macbook. 20391 is a packet from the same remote server that is
*not* ACKed by my macbook, and then there are a bunch of retransmits
on that session after that.

I'm still looking at the capture to find any clues, but if anybody
else wants to take a peek, please do :)

Thanks,

Avery
Avery Pennarun
2014-05-12 08:21:49 UTC
Permalink
Post by Avery Pennarun
Post by Ben Greear
If it's getting on the air, then I think the only way to figure out
what is wrong is to decode the packets and see if they are encrypted
properly or not. I think there is a way to get wireshark to decode
pkts by feeding it the proper keys, but I have not ever actually tried
doing that.
http://apenwarr.ca/tmp/ath10k-utorrent-dropout-v2-reduced.pcapng.gz
[...]
Ben Greear
2014-05-12 14:10:29 UTC
Permalink
Post by Avery Pennarun
Post by Ben Greear
If it's getting on the air, then I think the only way to figure out
what is wrong is to decode the packets and see if they are encrypted
properly or not. I think there is a way to get wireshark to decode
pkts by feeding it the proper keys, but I have not ever actually tried
doing that.
http://apenwarr.ca/tmp/ath10k-utorrent-dropout-v2-reduced.pcapng.gz
[...]
Avery Pennarun
2014-05-14 19:07:15 UTC
Permalink
Nevertheless, I can certainly see things being ACKed in my wifi
traces. I think maybe some incorrect Add Block Ack Request packets
may be triggering a bug in the wifi driver on my macbook?
IEEE 802.11 wireless LAN management frame
Fixed parameters
Category code: Block Ack (3)
Action code: Add Block Ack Request (0x00)
Dialog token: 0x01
Block Ack Parameters: 0x1017, A-MSDUs, Block Ack Policy
.... .... .... ...1 = A-MSDUs: Permitted in QoS Data MPDUs
.... .... .... ..1. = Block Ack Policy: Immediate Block Ack
.... .... ..01 01.. = Traffic Identifier: 0x0005
0001 0000 00.. .... = Number of Buffers (1 Buffer = 2304 Bytes): 64
Block Ack Timeout: 0x0000
Block Ack Starting Sequence Control (SSC): 0x0000
.... .... .... 0000 = Fragment: 0
0000 0000 0000 .... = Starting Sequence Number: 0
Replicated this a few more times. Still on a Macbook, but a different
one (latest MacOS version this time) on a different setting.

Once again, the problem kicked in immediately following an ADDBA
request like the above. The TID was this time equal to 3 (which I
think is the problematic one last time too; I accidentally pasted the
ADDBA for TID=5 but in my earlier email I called out TID=3 as the one
that triggered the problem).

In this new test case, interestingly, unicast continued to work as
long as I was sending packets off-LAN (ie. through a gateway to
somewhere on the Internet); packets could be exchanged in both
directions. But I couldn't ping the gateway or any other
LAN-connected device. This is fishy since I might have blamed a lack
of multicast/broadcast and thus a failure of ARP (unicast was
obviously working) but my ARP table definitely contained the IP of the
gateway.

Anyway, one new clue: the ADDBA was triggered by a TCP ACK packet
being sent by the SPDY port on a server somewhere with a particular
packet priority: diffserv field = 0x60 (class selector 3, no ECN).

I don't know why this would scramble my Macbook's wifi, since the
packets themselves looked fine, but it did. This was definitely the
first SPDY packet on the session and it triggered the ADDBA and died
instantly.

I really should take Dave Taht's advice and get that netperf thing
going and test all the QoS queues... but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?

Thanks,

Avery
Dave Taht
2014-05-14 19:26:42 UTC
Permalink
Post by Avery Pennarun
Nevertheless, I can certainly see things being ACKed in my wifi
traces. I think maybe some incorrect Add Block Ack Request packets
may be triggering a bug in the wifi driver on my macbook?
IEEE 802.11 wireless LAN management frame
Fixed parameters
Category code: Block Ack (3)
Action code: Add Block Ack Request (0x00)
Dialog token: 0x01
Block Ack Parameters: 0x1017, A-MSDUs, Block Ack Policy
.... .... .... ...1 = A-MSDUs: Permitted in QoS Data MPDUs
.... .... .... ..1. = Block Ack Policy: Immediate Block Ack
.... .... ..01 01.. = Traffic Identifier: 0x0005
0001 0000 00.. .... = Number of Buffers (1 Buffer = 2304 Bytes): 64
Block Ack Timeout: 0x0000
Block Ack Starting Sequence Control (SSC): 0x0000
.... .... .... 0000 = Fragment: 0
0000 0000 0000 .... = Starting Sequence Number: 0
Replicated this a few more times. Still on a Macbook, but a different
one (latest MacOS version this time) on a different setting.
Once again, the problem kicked in immediately following an ADDBA
request like the above. The TID was this time equal to 3 (which I
think is the problematic one last time too; I accidentally pasted the
ADDBA for TID=5 but in my earlier email I called out TID=3 as the one
that triggered the problem).
In this new test case, interestingly, unicast continued to work as
long as I was sending packets off-LAN (ie. through a gateway to
somewhere on the Internet); packets could be exchanged in both
directions. But I couldn't ping the gateway or any other
LAN-connected device. This is fishy since I might have blamed a lack
of multicast/broadcast and thus a failure of ARP (unicast was
obviously working) but my ARP table definitely contained the IP of the
gateway.
Anyway, one new clue: the ADDBA was triggered by a TCP ACK packet
being sent by the SPDY port on a server somewhere with a particular
packet priority: diffserv field = 0x60 (class selector 3, no ECN).
I don't know why this would scramble my Macbook's wifi, since the
packets themselves looked fine, but it did. This was definitely the
first SPDY packet on the session and it triggered the ADDBA and died
instantly.
I really should take Dave Taht's advice and get that netperf thing
going and test all the QoS queues... but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?
Tis easy to hack and slash it out of cfg80211_classify8021d in
net/wireless/util.c

Recently someone added a direct mapping from vlan priorities to wifi
queues here - which is stupid, as the VO queue does not behave
anything like the equivalent vlan priority queue does. I'd like to see
that removed or mapped properly, also.

Or you can try telling hostapd to never negotiate wmm/802.11e.

Last night I got it to blow up the vi, vo, and bk queues on the ath9k
in about 4 hours, and I just successfully in under 20 minutes got the
BK queue to misbehave, driven by an iwl driver on linux, on both ipv6
and ipv4. It is probable this is elsewhere in the stack than the
drivers themselves. (although things are so intertwined down here that
it's hard to tell). Perhaps running on a noisier network triggers it
faster.

I logged last nights progress on http://www.bufferbloat.net/issues/442
- with pictures and logs - there are packet captures on that bug now,
but not on the monitoring interface. Adding that now.

What I just saw looks like this.

Command: /usr/local/bin/netperf -P 0 -v 0 -D -0.2 -6 -Y CS1,CS1 -H
fdfe:7b16:f8e2::1 -t TCP_STREAM -l 300 -f m
Program output:
netperf: send_omni: connect_data_socket failed: Connection timed out

Warning: Command produced no valid data.
Data series: Ping (ms) UDP BK
Runner: NetperfDemoRunner
Command: /usr/local/bin/netperf -P 0 -v 0 -D -0.2 -6 -Y CS1,CS1 -H
fdfe:7b16:f8e2::1 -t UDP_RR -l 310 -- -e 2
Standard error output:

+ :
+ expr 3 + 1
+ i=4
+ netperf-wrapper -l 300 -H 172.21.18.1 -t default-4 -o default-4.svg
-p all_scaled rrul
Warning: Program exited non-zero (1).
Command: /usr/local/bin/netperf -P 0 -v 0 -D -0.2 -4 -Y CS1,CS1 -H
172.21.18.1 -t TCP_MAERTS -l 300 -f m
Program output:
netperf: send_omni: connect_data_socket failed: Connection timed out

Warning: Program exited non-zero (1).
Command: /usr/local/bin/netperf -P 0 -v 0 -D -0.2 -4 -Y CS1,CS1 -H
172.21.18.1 -t TCP_STREAM -l 300 -f m
Program output:
netperf: send_omni: connect_data_socket failed: Connection timed out

The bk, be, and vi queues are currently still operational.
Post by Avery Pennarun
Thanks,
Avery
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Adrian Chadd
2014-05-14 19:38:00 UTC
Permalink
[snip]

I found a lot of .. bad behaving WME/TID/QoS mapping code in freebsd's
atheros and net80211 code. It wouldn't surprise me to find the same in
Linux. There's not always a lot of driver / stack testing with
multiple overlapping TID traffic. (I found a lot out by actually
running lots of torrents whilst working; that triggers some bulk/best
effort/high priority concurrent traffic and it clued me in for how
screwed up it is.)

Ok, so the "seqno = 0" thing is something that various wireless stacks
do (including a couple of freebsd drivers, ugh) - when aggregation is
enabled, they reset the seqno back to 0 before continuing. It's kind
of odd, but it's typically because the non-aggregation seqno space was
done by the wireless stack (net80211 or derived) but then aggregation,
BAW tracking transmit/retransmit is done by the driver and/or
firmware. Thus it doesn't actually know about the sequence numbers in
question.

Yes, you negotiate a different block-ack setup with each TID. Same as
how each TID has a separate sequence number space. It gets more
interesting, as there's only one set of keys for all TIDs, yet each
TID can have a separate replay counter.

Anyway, lemme look at the trace. I'd like to see what's going on after
the negotitation.


-a
Avery Pennarun
2014-05-15 06:12:51 UTC
Permalink
Post by Dave Taht
Post by Avery Pennarun
but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?
Tis easy to hack and slash it out of cfg80211_classify8021d in
net/wireless/util.c
Recently someone added a direct mapping from vlan priorities to wifi
queues here - which is stupid, as the VO queue does not behave
anything like the equivalent vlan priority queue does. I'd like to see
that removed or mapped properly, also.
Yup, replaced this function with 'return 0' and my problems are gone.
(Well okay, this problem is gone, and you're probably not interested
in my personal problems. :))

Obviously just completely disabling QoS multiple queues is not the
right long term fix, but as Dave pointed out, it wasn't clear the code
there was exactly what we want anyway, so I won't feel too guilty
about it. Meanwhile, hopefully someone with deeper knowledge than me
can figure out what's with the IVs.

Thanks!

Avery
Dave Taht
2014-05-16 20:01:21 UTC
Permalink
Post by Avery Pennarun
Post by Dave Taht
Post by Avery Pennarun
but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?
Tis easy to hack and slash it out of cfg80211_classify8021d in
net/wireless/util.c
Recently someone added a direct mapping from vlan priorities to wifi
queues here - which is stupid, as the VO queue does not behave
anything like the equivalent vlan priority queue does. I'd like to see
that removed or mapped properly, also.
Yup, replaced this function with 'return 0' and my problems are gone.
(Well okay, this problem is gone, and you're probably not interested
in my personal problems. :))
The wifi alliance won't certify something that doesn't do 802.11e
"correctly". Not that I care.

A side effect of ripping this out is that some prioritized traffic, notably
multicast & dhcp (marked with CS6) now gets head of line blocked
in the BE queue until it can be delivered.

A possible problem with this "fix" is that the underlying hw queues are
still having some sort of problem in the driver, and doing a
multicast flood (say, mdns-scan or uftp) while doing other traffic might
trigger it again.

The linux 4 exposed queue problem should long ago have been
bumped to directly include a 5th queue observable and manageable
for multicast events.
Post by Avery Pennarun
Obviously just completely disabling QoS multiple queues is not the
right long term fix, but as Dave pointed out, it wasn't clear the code
there was exactly what we want anyway, so I won't feel too guilty
about it.
Well, after spending 4+ months trying to get to the bottom of this
myself...

... I just also ripped 802.11e out of cerowrt in this same file for more
testing.

802.11e in a packet aggregated world is both conceptually and
operationally broken, when the real resource limit is txops,
and cramming as many packets (sanely) into a txop is more the
right thing.

I've been showing that 4 hardware queues induces more latency
than 1 for a long time now, I am inspired (before I commit to
ripping out 802.11e entirely) to produce some graphs showing how
horrible flooding those queues is, compared to having one queue.

Now, getting away from doing QoS at this level opens up an opportunity
in that there are still 4 independently programmable tx queues (from
hostapd) and they could be used instead as a poor mans "mu-mimo" to
drive multiple stations more sanely.
Post by Avery Pennarun
Meanwhile, hopefully someone with deeper knowledge than me
can figure out what's with the IVs.
+10
Post by Avery Pennarun
Thanks!
Avery
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Adrian Chadd
2014-05-16 20:10:20 UTC
Permalink
Hi,

When I get the firmware source from QCA and Ben, I'll dig into the IV
shit in more detail.

As for the 11e and txop stuff - absolutely. I've been wanting to add
some txop-time aware clue to the freebsd packet scheduler and rate
control code. aggregation and 11e work great, but you can't just
simply keep packing max-length aggregates in when your bulk-download
station is on MCS0. That's just plain dumb. Linux and FreeBSD don't
take that into account.

We can use AMPDU and AMSDU to better interface with the timing and
transmission constraints of doing things over the open air. We're just
not.

Now that I'm not working at QCA I do plan on adding that stuff to
FreeBSD's atheros driver and hopefully work with the ath9k team to get
that into the rate control code there too. As for the firmware, I'm
happy to also add that to firmware - but not until the firmware is
open sourced.


-a
Post by Dave Taht
Post by Avery Pennarun
Post by Dave Taht
Post by Avery Pennarun
but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?
Tis easy to hack and slash it out of cfg80211_classify8021d in
net/wireless/util.c
Recently someone added a direct mapping from vlan priorities to wifi
queues here - which is stupid, as the VO queue does not behave
anything like the equivalent vlan priority queue does. I'd like to see
that removed or mapped properly, also.
Yup, replaced this function with 'return 0' and my problems are gone.
(Well okay, this problem is gone, and you're probably not interested
in my personal problems. :))
The wifi alliance won't certify something that doesn't do 802.11e
"correctly". Not that I care.
A side effect of ripping this out is that some prioritized traffic, notably
multicast & dhcp (marked with CS6) now gets head of line blocked
in the BE queue until it can be delivered.
A possible problem with this "fix" is that the underlying hw queues are
still having some sort of problem in the driver, and doing a
multicast flood (say, mdns-scan or uftp) while doing other traffic might
trigger it again.
The linux 4 exposed queue problem should long ago have been
bumped to directly include a 5th queue observable and manageable
for multicast events.
Post by Avery Pennarun
Obviously just completely disabling QoS multiple queues is not the
right long term fix, but as Dave pointed out, it wasn't clear the code
there was exactly what we want anyway, so I won't feel too guilty
about it.
Well, after spending 4+ months trying to get to the bottom of this
myself...
... I just also ripped 802.11e out of cerowrt in this same file for more
testing.
802.11e in a packet aggregated world is both conceptually and
operationally broken, when the real resource limit is txops,
and cramming as many packets (sanely) into a txop is more the
right thing.
I've been showing that 4 hardware queues induces more latency
than 1 for a long time now, I am inspired (before I commit to
ripping out 802.11e entirely) to produce some graphs showing how
horrible flooding those queues is, compared to having one queue.
Now, getting away from doing QoS at this level opens up an opportunity
in that there are still 4 independently programmable tx queues (from
hostapd) and they could be used instead as a poor mans "mu-mimo" to
drive multiple stations more sanely.
Post by Avery Pennarun
Meanwhile, hopefully someone with deeper knowledge than me
can figure out what's with the IVs.
+10
Post by Avery Pennarun
Thanks!
Avery
--
Dave Täht
NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
Adrian Chadd
2014-05-19 21:04:23 UTC
Permalink
Hi,

Bumping this up.

Please don't let us forget. once I get access to the firmware source
from Ben and QCA I'll do a hell of a lot more digging.

Thanks,



-a
Post by Avery Pennarun
Post by Dave Taht
Post by Avery Pennarun
but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?
Tis easy to hack and slash it out of cfg80211_classify8021d in
net/wireless/util.c
Recently someone added a direct mapping from vlan priorities to wifi
queues here - which is stupid, as the VO queue does not behave
anything like the equivalent vlan priority queue does. I'd like to see
that removed or mapped properly, also.
Yup, replaced this function with 'return 0' and my problems are gone.
(Well okay, this problem is gone, and you're probably not interested
in my personal problems. :))
Obviously just completely disabling QoS multiple queues is not the
right long term fix, but as Dave pointed out, it wasn't clear the code
there was exactly what we want anyway, so I won't feel too guilty
about it. Meanwhile, hopefully someone with deeper knowledge than me
can figure out what's with the IVs.
Thanks!
Avery
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
Dave Taht
2014-05-19 21:07:46 UTC
Permalink
from what I see from reports on the ath9k list, and openwrt
I tend to think this problem is endemic, and not driver specific.
Post by Adrian Chadd
Hi,
Bumping this up.
Please don't let us forget. once I get access to the firmware source
from Ben and QCA I'll do a hell of a lot more digging.
Thanks,
-a
Post by Avery Pennarun
Post by Dave Taht
Post by Avery Pennarun
but meanwhile, I think I'm in a
big hurry so I probably want to just rip out all support for mapping
diffserv to QoS on the ath10k transmit side. Does anyone have any
suggestions on where to look for my hack-and-slash operation?
Tis easy to hack and slash it out of cfg80211_classify8021d in
net/wireless/util.c
Recently someone added a direct mapping from vlan priorities to wifi
queues here - which is stupid, as the VO queue does not behave
anything like the equivalent vlan priority queue does. I'd like to see
that removed or mapped properly, also.
Yup, replaced this function with 'return 0' and my problems are gone.
(Well okay, this problem is gone, and you're probably not interested
in my personal problems. :))
Obviously just completely disabling QoS multiple queues is not the
right long term fix, but as Dave pointed out, it wasn't clear the code
there was exactly what we want anyway, so I won't feel too guilty
about it. Meanwhile, hopefully someone with deeper knowledge than me
can figure out what's with the IVs.
Thanks!
Avery
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Adrian Chadd
2014-05-19 21:09:22 UTC
Permalink
Post by Dave Taht
from what I see from reports on the ath9k list, and openwrt
I tend to think this problem is endemic, and not driver specific.
I thought the IV for ath10k was handled in the firmware, rather than
setup from the host. I'll have to go digging.



-a
Adrian Chadd
2014-05-14 20:20:40 UTC
Permalink
Hi,

When you quote frame offsets in a pcap, can you next time use the frame number?

I think your'e referring to 20253, right? That looks like the right frame.



-a
Post by Avery Pennarun
Post by Ben Greear
If it's getting on the air, then I think the only way to figure out
what is wrong is to decode the packets and see if they are encrypted
properly or not. I think there is a way to get wireshark to decode
pkts by feeding it the proper keys, but I have not ever actually tried
doing that.
http://apenwarr.ca/tmp/ath10k-utorrent-dropout-v2-reduced.pcapng.gz
[...]
Adrian Chadd
2014-05-14 20:46:56 UTC
Permalink
ok, so I've taken a 30 second look.

The CCMP IV's used for TID 3 look creepy. They start at 0x300000000000
and go up. It's much higher than the rest of the IVs used for
transmitting on the other TIDs.

It'd be nice to see TID 3 traffic -before- it negotiates the ADDBA,
just to see what the heck is going on there.

What _should_ be happening!

* transmitting to a station should have exactly one CCMP IV, shared
across all frames being transmitted. It should be allocated in order,
regardless of the TID
* on the receiving side, there's tracking for the CCMP IV seen for
each TID, and checked to avoid replay attacks.

It's choosing a totally different CCMP IV space for transmitting on
TID 3. I don't know why.

(And I still don't have firmware source, so I can't tell you yet. grr.)

Ben - I'm punting this one to you!




-a
Avery Pennarun
2014-05-14 21:45:39 UTC
Permalink
Post by Adrian Chadd
It'd be nice to see TID 3 traffic -before- it negotiates the ADDBA,
just to see what the heck is going on there.
I'd have to re-check the big trace you're looking at, but at least in
my new one from today, it actually does the ADDBA before *any* traffic is
sent on TID=3.

The IV being different from what it should be definitely sounds like
something that would explain the odd behaviour of wireshark being able
to decode it, but MacOS rejecting it. I guess MacOS could be
rejecting unexpected IVs to avoid replay attacks.
Kalle Valo
2014-05-27 09:53:37 UTC
Permalink
Hi Avery,
Post by Avery Pennarun
Post by Ben Greear
Can you reproduce with any other type of station
device?
I'll try. It's a little tricky since it seems to involve running
Chrome, etc, rather than my usual command-line tests for which I have
a wider variety of clients.
did you have a chance to check with other clients? It would be good to
know if this is macbook specific interoparibility problem or something
else.
--
Kalle Valo
Ben Greear
2016-06-17 00:01:55 UTC
Permalink
Post by Kalle Valo
Hi Avery,
Post by Avery Pennarun
Post by Ben Greear
Can you reproduce with any other type of station
device?
I'll try. It's a little tricky since it seems to involve running
Chrome, etc, rather than my usual command-line tests for which I have
a wider variety of clients.
did you have a chance to check with other clients? It would be good to
know if this is macbook specific interoparibility problem or something
else.
So, to resurrect an ancient thread....

I have a user (who wants to buy and sell lots and lots of QCA chipset NICS!)
of my 10.1 firmware hitting this problem. They have a way to reproduce
it readily, though for one reason or another, I cannot reproduce it locally using what
should be an identical setup.

Their problem cannot be reproduced in a stock 10.2 firmware (firmware-3.bin, I think) on a newer kernel,
while that newer kernel does hit the problem with my 10.1 firmware. So, it is
definitely something in the firmware. Their system cannot use 10.2 firmware
at the moment because it is based on an older kernel.

So, I am curious if someone would like to give me a hint as to what was the actual
firmware fix for this issue. I don't need code snippets since that stuff is
private...but just a method name and/or file or something and I should be able
to track it down and backport just that fix. I've tried comparing obvious code
spots in 10.1 and 10.2 firmware, but nothing obvious jumps out.

Thanks,
Ben
--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
Adrian Chadd
2014-05-12 14:53:14 UTC
Permalink
Hi,

So the keys aren't bidirectional. There'll be a set of transmit and
receive keys per station, as well as the multicast key.

If you're seeing traffic in other TIDs for a given station, then it's
likely a block-ack tracking issue.

If you only see multicast traffic to the station, but no unicast
traffic, then it could be all the HWQ's are stuck; but it also could
be a key exchange issue. Normally on FreeBSD I'd just do a monitor
mode trace and look at whether I was receiving anything; before I then
dug down to see if the encryption sequence was fine (ie, no CCMP
replays, the key is correct, etc.)

As for ath10k, it's been a while since I was deep in the firmware.
Ideally you'd like to get TX completion notifications from the chip up
to the firmware as that'll at least tell you the frame is making it
out to the air and being ACKed (and is thus not going to be a hung
hardware queue or TID.) I don't know if TX completion notifications
are completely working though; I'll check once I have firmware source
again.



-a
Post by Avery Pennarun
Version: 3.15-rc1 and ath10k-stable-3.11-8 (both via backports to kernel 3.2.26)
Firmware: 10.1.467.2-1
- Configure ath10k as AP on channel 149, width 80 MHz, WPA2 encryption
- Connect my 2009 macbook
- Start a ping of 8.8.8.8 in the background from my macbook
- Generate some traffic. The trigger varies, but running uTorrent or
quickly opening a lot of background tabs in Chrome usually seems to
set it off within a couple of minutes. iperf and isoping don't seem
to cause any trouble.
- ping keeps pinging
- tcpdump on the ath10k host shows ICMP requests coming in, and
responses going out.
- tcpdump on the macbook shows ICMP requests going out, but no
responses coming back.
- tcpdump -I (radiotap mode) on the macbook is a little hard to
understand since it's encrypted, but it shows some packets coming out
of the ath10k (broadcasts, I think) but no unicast packets to the
macbook.
- tcpdump on the macbook *does* show broadcast packets arriving. For
example, ARP requests and "ping -b 192.168.1.255" (my local subnet IP
address) get through.
- Disconnecting and then reconnecting the wifi on my macbook fixes the
problem until it next triggers.
- Other STAs connected to the AP are not affected when the one STA
isn't able to communicate (although each one has the potential to
trigger the problem)
Disabling encryption makes the problem go away permanently.
This is pretty quick for me to reproduce, but unfortunately I don't
have any steps to trigger it instantly, nor any command line tools
that seem to make it happen. (For example I tried multiple parallel
'curl' processes in a loop, and no lock.)
Nothing interesting appears in the dmesg or hostapd logs at the time
of the problem.
This sounds like it could be a problem with crypto session keys, but I
don't understand why it would only be wrong in a single direction. I
also don't think my keys are rotating this quickly, so this shouldn't
be a key rotation problem (though I don't understand very well how
that works).
It might be my imagination, but it's possible that this triggers more
quickly if my macbook has been connected for a longer period of time
before generating the traffic burst.
Anything I can check to help narrow this down?
Thanks,
Avery
_______________________________________________
ath10k mailing list
http://lists.infradead.org/mailman/listinfo/ath10k
Loading...