Question on 10.4 firmware and fetch-indication logic.

Discussion:

Ben Greear

2016-11-01 17:21:37 UTC

I am testing on modified 4.7 kernel and modified firmware with QCA9984 NIC
and lots of virtual station vdevs.

The issue I am looking at currently is that I am seeing floods of these messages
in some cases:

Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed to lookup txq for peer_id 56 tid 7
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed to lookup txq for peer_id 56 tid 7
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed to lookup txq for peer_id 56 tid 7
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed to lookup txq for peer_id 56 tid 7

From this code in htt_rx.c:

static void ath10k_htt_rx_tx_fetch_ind(struct ath10k *ar, struct sk_buff *skb)
...

/* It is okay to release the lock and use txq because RCU read
* lock is held.
*/

if (unlikely(!txq)) {
if (net_ratelimit())
ath10k_warn(ar, "fetch-ind: failed to lookup txq for peer_id %hu tid %hhu\n",
peer_id, tid);
continue;
}

I am getting these after the vdev in question (and its peers) have been removed. I guess these
must be stale buffers that are finally transmitted or cleaned up by the firmware after
vdev has been deleted?

I am curious if anyone else sees something similar, and if this is expected behaviour.

Thanks,
Ben

--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com

Michal Kazior

2016-11-01 17:56:27 UTC

Permalink

Post by Ben Greear
I am testing on modified 4.7 kernel and modified firmware with QCA9984 NIC
and lots of virtual station vdevs.
The issue I am looking at currently is that I am seeing floods of these messages
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed
to lookup txq for peer_id 56 tid 7
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed
to lookup txq for peer_id 56 tid 7
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed
to lookup txq for peer_id 56 tid 7
Nov 01 09:43:38 ath-9984 kernel: ath10k_pci 0000:05:00.0: fetch-ind: failed
to lookup txq for peer_id 56 tid 7
static void ath10k_htt_rx_tx_fetch_ind(struct ath10k *ar, struct sk_buff *skb)
...
/* It is okay to release the lock and use txq because RCU read
* lock is held.
*/
if (unlikely(!txq)) {
if (net_ratelimit())
ath10k_warn(ar, "fetch-ind: failed to lookup
txq for peer_id %hu tid %hhu\n",
peer_id, tid);
continue;
}
I am getting these after the vdev in question (and its peers) have been
removed. I guess these
must be stale buffers that are finally transmitted or cleaned up by the firmware after
vdev has been deleted?
I am curious if anyone else sees something similar, and if this is expected behaviour.

Hmm, WMI and HTT do use independent CE ring buffers but peer_ids are
unmapped in response to HTT events so it should be properly serialized
by firmware itself.

Did you happen to not remove peers prior to deleting vdev? Perhaps
that's the cause that triggers the !txq condition.

Perhaps it would make sense to flush (i.e. put up a barrier) HTT rx
after stopping vdev.

Michał

Ben Greear

2016-11-01 18:04:41 UTC

Permalink

Post by Michal Kazior

Hmm, WMI and HTT do use independent CE ring buffers but peer_ids are
unmapped in response to HTT events so it should be properly serialized
by firmware itself.
Did you happen to not remove peers prior to deleting vdev? Perhaps
that's the cause that triggers the !txq condition.
Perhaps it would make sense to flush (i.e. put up a barrier) HTT rx
after stopping vdev.

From what I can tell, on peer removal, the firmware will flush the tids, and will
delay the low-level peer object deletion until tids are fully flushed. Based on logging,
the peer deletion was not deferred in the case I looked at, and so at peer removal
time, there were no frames in the tid tx queue.

Firmware then deletes AST keys and such, and that logic generates peer removal messages
(one per AST key in my case, which may be a bug, but probably is harmless, and should not
cause this as far as I can tell).

Then, some time later, after I get peer removal events in the driver, I see
the fetch-ind warnings.

I see this very often, so it is not just a rare race.

I have also modified firmware fairly extensively to allow disabling the peer
caching, which is integrated into the tx scheduling and similar logic, and
could have made mistakes there.

I do not know the code well around the fetch-ind logic: This is how the
firmware tells the driver that it has fully transmitted a frame and is reporting
tx status? Can it be anything else?

Thanks,
Ben

--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com

Adrian Chadd

2016-11-01 18:12:23 UTC

Permalink

Post by Ben Greear
I do not know the code well around the fetch-ind logic: This is how the
firmware tells the driver that it has fully transmitted a frame and is reporting
tx status? Can it be anything else?

Isn't this the newish stuff added so per-peer queues can be managed and drained?

-adrian

Michal Kazior

2016-11-02 00:20:02 UTC

Permalink

Post by Adrian Chadd

Isn't this the newish stuff added so per-peer queues can be managed and drained?

Yes. 10.4 can support operational mode where it doesn't expect host to
be pushing frames at its own discretion and instead issues
fetch-indication events to ask host to submit frames from given
station/tid within given packet/byte constraints.

After looking at the code I guess this could happen because
tx_fetch_ind_q is processed in a separate tasklet. You'd probably need
to flush the tasklet/event queues to prevent this problem from
happening.

Michał