Discussion:
[RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
Ming Lei
2013-06-09 15:18:27 UTC
Permalink
Hi,

The patchset supports to run giveback of URB in tasklet context, so that
DMA unmapping/mapping on transfer buffer and compelte() callback can be
run with interrupt enabled, then time of HCD interrupt handler can be
saved much. Also this approach may simplify HCD since HCD lock needn't
be released any more before calling usb_hcd_giveback_urb().

The patchset enables the mechanism on EHCI HCD.

Patch 3/4 improves interrupt qh unlink on EHCI HCD when the mechanism
is introduced.

In the commit log of patch 4/4, detailed test result on three machines
(ARM A9/A15 dual core, X86) are provided about transfer performance and
ehci irq handling time. From the result, no transfer performance loss
is found and ehci irq handling time can drop much with the patchset.

drivers/usb/core/hcd.c | 170 +++++++++++++++++++++++++++++++------
drivers/usb/host/ehci-fsl.c | 2 +-
drivers/usb/host/ehci-grlib.c | 2 +-
drivers/usb/host/ehci-hcd.c | 18 ++--
drivers/usb/host/ehci-hub.c | 1 +
drivers/usb/host/ehci-mv.c | 2 +-
drivers/usb/host/ehci-octeon.c | 2 +-
drivers/usb/host/ehci-pmcmsp.c | 2 +-
drivers/usb/host/ehci-ppc-of.c | 2 +-
drivers/usb/host/ehci-ps3.c | 2 +-
drivers/usb/host/ehci-q.c | 6 +-
drivers/usb/host/ehci-sched.c | 60 ++++++++++++-
drivers/usb/host/ehci-sead3.c | 2 +-
drivers/usb/host/ehci-sh.c | 2 +-
drivers/usb/host/ehci-tegra.c | 2 +-
drivers/usb/host/ehci-tilegx.c | 2 +-
drivers/usb/host/ehci-timer.c | 43 ++++++++++
drivers/usb/host/ehci-w90x900.c | 2 +-
drivers/usb/host/ehci-xilinx-of.c | 2 +-
drivers/usb/host/ehci.h | 3 +
include/linux/usb/hcd.h | 16 ++++
21 files changed, 295 insertions(+), 48 deletions(-)


Thanks,
--
Ming Lei

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-09 15:18:28 UTC
Permalink
This post might be inappropriate. Click to display it.
Alan Stern
2013-06-09 15:58:18 UTC
Permalink
Post by Ming Lei
This patch implements the mechanism of giveback of URB in
tasklet context, so that hardware interrupt handling time for
usb host controller can be saved much, and HCD interrupt handling
can be simplified.
Why do you want to minimize the hardware interrupt handling time? The
total interrupt handling time will actually be increased by your
change; the only advantage is that much of the work will not be in
hardirq context.

Also, I suspect that this will make HCD interrupt handling _more_
complicated, not less.
Post by Ming Lei
1), on some arch(such as ARM), DMA mapping/unmapping is a bit
time-consuming, for example: when accessing usb mass storage
via EHCI on pandaboard, the common length of transfer buffer is 120KB,
the time consumed on DMA unmapping may reach hundreds of microseconds;
even on A15 based box, the time is still about scores of microseconds
DMA mapping/unmapping will remain just as time-consuming as before.
This patch won't change that.
Post by Ming Lei
2), on some arch, reading DMA coherent memoery is very time-consuming,
the most common example is usb video class driver[1]
Reading DMA-coherent memory will remain just as time-consuming as
before.
Post by Ming Lei
3), driver's complete() callback may do much things which is driver
specific, so the time is consumed unnecessarily in hardware irq context.
With this patch, the time is consumed in softirq context. What is the
advantage?
Post by Ming Lei
4), running driver's complete() callback in hardware irq context causes
that host controller driver has to release its lock in interrupt handler,
so reacquiring the lock after return may busy wait a while and increase
interrupt handling time. More seriously, releasing the HCD lock makes
HCD becoming quite complicated to deal with introduced races.
There is no choice; the lock _has_ to be released before giving back
completed URBs. Therefore the races are unavoidable, as is the
busy-wait.
Post by Ming Lei
So the patch proposes to run giveback of URB in tasklet context, then
time consumed in HCD irq handling doesn't depend on drivers' complete and
DMA mapping/unmapping any more, also we can simplify HCD since the HCD
lock isn't needed to be released during irq handling.
I'm not convinced. In particular, I'm not convinced that this is
better than using a threaded interrupt handler.
Post by Ming Lei
1), for drivers, they don't care if the complete() is called in hard irq
context or softirq context
True.
Post by Ming Lei
2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
- control/bulk asynchronous transfer isn't sensitive to schedule
delay
Mass-storage device access (which uses bulk async transfers) certainly
_is_ sensitive to scheduling delays.
Post by Ming Lei
- the patch schedules giveback of periodic URBs using
tasklet_hi_schedule, so the introduced delay should be very
small
- use percpu tasklset for each HCD to schedule giveback of URB,
which are running as much as parallel
That might be an advantage; it depends on the work load. But if the
tasklet ends up running on a different CPU from the task that submitted
the URB originally, it would be less of an advantage.
Post by Ming Lei
- for ISOC transfer, generally, drivers submit several URBs
concurrently to avoid interrupt delay, so it is OK with the
little schedule delay.
This varies. Some drivers use very low overheads for isochronous
transfers. A delay of even one millisecond might be too long for them.
Post by Ming Lei
- for interrupt transfer, generally, drivers only submit one URB
at the same time, but interrupt transfer is often used in event
report, polling, ... situations, and a little delay should be OK.
Agreed.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-10 08:12:34 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
This patch implements the mechanism of giveback of URB in
tasklet context, so that hardware interrupt handling time for
usb host controller can be saved much, and HCD interrupt handling
can be simplified.
Why do you want to minimize the hardware interrupt handling time? The
total interrupt handling time will actually be increased by your
change; the only advantage is that much of the work will not be in
hardirq context.
Disabling interrupts too long will cause system to respond slowly, for
example timer interrupt may be handled very lately. That is why tasklet/
softirq is introduced to split driver's interrupt handler into two parts: the
urgent thing is done quickly in hard irq context with IRQs disabled, and
the remaining is handled in softirq/tasklet part with IRQs enabled.
Post by Alan Stern
Also, I suspect that this will make HCD interrupt handling _more_
complicated, not less.
If HCD's lock wasn't released before calling usb_hcd_giveback_urb(),
HCD interrupt handling would have been a bit simpler.
Post by Alan Stern
Post by Ming Lei
1), on some arch(such as ARM), DMA mapping/unmapping is a bit
time-consuming, for example: when accessing usb mass storage
via EHCI on pandaboard, the common length of transfer buffer is 120KB,
the time consumed on DMA unmapping may reach hundreds of microseconds;
even on A15 based box, the time is still about scores of microseconds
DMA mapping/unmapping will remain just as time-consuming as before.
This patch won't change that.
Yes, it is platform dependent, and I have tried to improve it, but not succeed.

But, is there any good reason to do time-consuming DMA mapping/
unmapping with all IRQs disabled to make system response slowly?
Post by Alan Stern
Post by Ming Lei
2), on some arch, reading DMA coherent memoery is very time-consuming,
the most common example is usb video class driver[1]
Reading DMA-coherent memory will remain just as time-consuming as
before.
Same with above.
Post by Alan Stern
Post by Ming Lei
3), driver's complete() callback may do much things which is driver
specific, so the time is consumed unnecessarily in hardware irq context.
With this patch, the time is consumed in softirq context. What is the
advantage?
IRQs are enabled in softirq context so that system can respond events
which might require to be handled very quickly.
Post by Alan Stern
Post by Ming Lei
4), running driver's complete() callback in hardware irq context causes
that host controller driver has to release its lock in interrupt handler,
so reacquiring the lock after return may busy wait a while and increase
interrupt handling time. More seriously, releasing the HCD lock makes
HCD becoming quite complicated to deal with introduced races.
There is no choice; the lock _has_ to be released before giving back
completed URBs. Therefore the races are unavoidable, as is the
busy-wait.
Could you explain it in a bit detail why there is no choice?

IMO, the patchset can make it possible.
Post by Alan Stern
Post by Ming Lei
So the patch proposes to run giveback of URB in tasklet context, then
time consumed in HCD irq handling doesn't depend on drivers' complete and
DMA mapping/unmapping any more, also we can simplify HCD since the HCD
lock isn't needed to be released during irq handling.
I'm not convinced. In particular, I'm not convinced that this is
better than using a threaded interrupt handler.
Threaded interrupt handler should be OK but with more context switch
cost, so URB giveback may be handled a bit lately and performance
might be affected. Also ISOC transfer for video/audio applicaton should
be handled with as less latency as possible to provide good user
experience, thread handler may be delayed a bit too long to meet the
demand.

Also there is no any sleep in usb_hcd_giveback_urb(), so tasklet
should be better.
Post by Alan Stern
Post by Ming Lei
1), for drivers, they don't care if the complete() is called in hard irq
context or softirq context
True.
Post by Ming Lei
2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
- control/bulk asynchronous transfer isn't sensitive to schedule
delay
Mass-storage device access (which uses bulk async transfers) certainly
_is_ sensitive to scheduling delays.
Post by Ming Lei
- the patch schedules giveback of periodic URBs using
tasklet_hi_schedule, so the introduced delay should be very
small
- use percpu tasklset for each HCD to schedule giveback of URB,
which are running as much as parallel
That might be an advantage; it depends on the work load. But if the
tasklet ends up running on a different CPU from the task that submitted
the URB originally, it would be less of an advantage.
Post by Ming Lei
- for ISOC transfer, generally, drivers submit several URBs
concurrently to avoid interrupt delay, so it is OK with the
little schedule delay.
This varies. Some drivers use very low overheads for isochronous
transfers. A delay of even one millisecond might be too long for them.
Post by Ming Lei
- for interrupt transfer, generally, drivers only submit one URB
at the same time, but interrupt transfer is often used in event
report, polling, ... situations, and a little delay should be OK.
Agreed.
Alan Stern
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-10 08:43:07 UTC
Permalink
Post by Ming Lei
2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
- control/bulk asynchronous transfer isn't sensitive to schedule
delay
That is debatable.Missing a frame boundary is expensive because the increased
latency then translates into lower throughput.

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-10 09:23:46 UTC
Permalink
Post by Oliver Neukum
Post by Ming Lei
2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
- control/bulk asynchronous transfer isn't sensitive to schedule
delay
That is debatable.Missing a frame boundary is expensive because the increased
latency then translates into lower throughput.
Suppose so, considered that bulk transfer will do large data block transfer, and
the extra frame or uFrame doesn't matter over the whole transfer time.

Also the tasklet function will be scheduled once the hard interrupt handler
completes, and the delay is often several microseconds or smaller, which
has a very low probability to miss frame/uframe boundary.

Even with submitting URBs in hardware interrupt handler, there is still the
interrupt handling delay, isn't there? (So disabling interrupt too
long is really
very bad, :-))

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-10 09:31:15 UTC
Permalink
Post by Ming Lei
Post by Oliver Neukum
Post by Ming Lei
2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
- control/bulk asynchronous transfer isn't sensitive to schedule
delay
That is debatable.Missing a frame boundary is expensive because the increased
latency then translates into lower throughput.
Suppose so, considered that bulk transfer will do large data block transfer, and
the extra frame or uFrame doesn't matter over the whole transfer time.
That is not true for all use cases. Networking looks vulnerable.

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-10 09:51:14 UTC
Permalink
Post by Oliver Neukum
Post by Ming Lei
Post by Oliver Neukum
Post by Ming Lei
2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
- control/bulk asynchronous transfer isn't sensitive to schedule
delay
That is debatable.Missing a frame boundary is expensive because the increased
latency then translates into lower throughput.
Suppose so, considered that bulk transfer will do large data block transfer, and
the extra frame or uFrame doesn't matter over the whole transfer time.
That is not true for all use cases. Networking looks vulnerable.
That is debatable.Missing a frame boundary is expensive because the increased
latency then translates into lower throughput.
Missing uframe/frame boundary doesn't cause lower throughput since network
devices use bulk transfer, which is scheduled in hw aync queue and there should
always have pending transfers in the queue when submitting bulk URB to the
queue.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-09 15:18:29 UTC
Permalink
If HCD_BH is set for HC driver's flags, URB giveback will be
done in tasklet context instead of interrupt context, so the
ehci->lock needn't to be released any more before calling
usb_hcd_giveback_urb().

Cc: Alan Stern <stern-nwvwT67g6+6dFdvTe/***@public.gmane.org>
Signed-off-by: Ming Lei <ming.lei-Z7WLFzj8eWMS+***@public.gmane.org>
---
drivers/usb/host/ehci-q.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index d34b399..0387a81 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -283,9 +283,11 @@ __acquires(ehci->lock)

/* complete() can reenter this HCD */
usb_hcd_unlink_urb_from_ep(ehci_to_hcd(ehci), urb);
- spin_unlock (&ehci->lock);
+ if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)))
+ spin_unlock(&ehci->lock);
usb_hcd_giveback_urb(ehci_to_hcd(ehci), urb, status);
- spin_lock (&ehci->lock);
+ if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)))
+ spin_lock(&ehci->lock);
}

static int qh_schedule (struct ehci_hcd *ehci, struct ehci_qh *qh);
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-09 16:06:24 UTC
Permalink
Post by Ming Lei
If HCD_BH is set for HC driver's flags, URB giveback will be
done in tasklet context instead of interrupt context, so the
ehci->lock needn't to be released any more before calling
usb_hcd_giveback_urb().
Ah, this is the reason why you don't need to release the private lock.
I'm not sure that this will save much time overall.

With the existing code, the main reason for lock contention would be if
(for example) CPU 2 submitted or cancelled an URB while the giveback
was occurring on CPU 1. Then CPU 1 would be forced to wait while CPU 2
finished its submission or cancellation.

With this patch, it's the other way around. CPU 2 would be forced to
wait while CPU 1 does all the rest of the work in its interrupt
handler. The total time spent holding the lock will be the same; it
will just be distributed differently among the CPUs. Hence it is not
at all clear that this change will save any time.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-10 09:10:56 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
If HCD_BH is set for HC driver's flags, URB giveback will be
done in tasklet context instead of interrupt context, so the
ehci->lock needn't to be released any more before calling
usb_hcd_giveback_urb().
Ah, this is the reason why you don't need to release the private lock.
I'm not sure that this will save much time overall.
I did't mention this 3/4 patch can save much time. In fact, without this
patch, handling URB giveback still can work. But releasing the
private lock can make hard irq handler become random long,
since reacquiring the lock may busy wait quite a while until other
CPUs released the lock.
Post by Alan Stern
With the existing code, the main reason for lock contention would be if
(for example) CPU 2 submitted or cancelled an URB while the giveback
was occurring on CPU 1. Then CPU 1 would be forced to wait while CPU 2
finished its submission or cancellation.
With this patch, it's the other way around. CPU 2 would be forced to
wait while CPU 1 does all the rest of the work in its interrupt
handler. The total time spent holding the lock will be the same; it
will just be distributed differently among the CPUs. Hence it is not
at all clear that this change will save any time.
Generally the interrupt handler without URB giveback consumes not
much time, and the time consumed is about 10~20us on fast machine,
and <40us on slow machine averagely per my tests in 4/4 commit log.

You may ague the situation would become worse if there are many
URBs to be handled in ehci_irq() on CPU1, but I think it should be
very very rare to happen attributed to this patchset: HCD hard interrupt
handler takes much less time than general HCD irq interval(for example,
in EHCI, generally there won't have many URBs to be completed in one
uFrame, and the URBs completed in next uFrame will interrupt CPUs
125us later)

Also we may introduce another lock/flags to let CPU1 handle CPU2's
interrupts and let CPU2 return IRQ_HANDLED immediately, but I am
not sure if it is necessary.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-09 15:18:30 UTC
Permalink
Given interrupt URB will be resubmitted from tasklet context which
is scheduled by ehci hardware interrupt handler, and commonly only
one interrupt URB is scheduled on qh, so the qh may be unlinked
immediately once qh_completions() returns from ehci_irq(), then
the intr URB to be resubmitted in complete() can only be scheduled
and linked to hardware until the qh unlink is completed.

This patch applied the QH_STATE_UNLINK_WAIT(not used by interrupt qh)
state to improve this above situation, and the qh will wait for 5
milliseconds before being unlinked from hardware, if one URB is submitted
during the period, the qh is move out of unlink wait state and the
interrupt transfer can be scheduled immediately, not like before: the
transfer is only linked to hardware until previous unlink is completed.

Only enable the improvement in case that HCD supports to run
giveback of URB in tasklet context.

Cc: Alan Stern <stern-nwvwT67g6+6dFdvTe/***@public.gmane.org>
Signed-off-by: Ming Lei <ming.lei-Z7WLFzj8eWMS+***@public.gmane.org>
---
drivers/usb/host/ehci-hcd.c | 16 ++++++++---
drivers/usb/host/ehci-hub.c | 1 +
drivers/usb/host/ehci-sched.c | 60 ++++++++++++++++++++++++++++++++++++++---
drivers/usb/host/ehci-timer.c | 43 +++++++++++++++++++++++++++++
drivers/usb/host/ehci.h | 3 +++
5 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 246e124..0ab82de 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -304,7 +304,9 @@ static void end_unlink_async(struct ehci_hcd *ehci);
static void unlink_empty_async(struct ehci_hcd *ehci);
static void unlink_empty_async_suspended(struct ehci_hcd *ehci);
static void ehci_work(struct ehci_hcd *ehci);
-static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh);
+static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh,
+ bool wait);
+static void start_do_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh);
static void end_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh);

#include "ehci-timer.c"
@@ -484,6 +486,7 @@ static int ehci_init(struct usb_hcd *hcd)
ehci->periodic_size = DEFAULT_I_TDPS;
INIT_LIST_HEAD(&ehci->async_unlink);
INIT_LIST_HEAD(&ehci->async_idle);
+ INIT_LIST_HEAD(&ehci->intr_unlink_wait);
INIT_LIST_HEAD(&ehci->intr_unlink);
INIT_LIST_HEAD(&ehci->intr_qh_list);
INIT_LIST_HEAD(&ehci->cached_itd_list);
@@ -908,15 +911,20 @@ static int ehci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
switch (qh->qh_state) {
case QH_STATE_LINKED:
if (usb_pipetype(urb->pipe) == PIPE_INTERRUPT)
- start_unlink_intr(ehci, qh);
+ start_unlink_intr(ehci, qh, false);
else
start_unlink_async(ehci, qh);
break;
case QH_STATE_COMPLETING:
qh->dequeue_during_giveback = 1;
break;
- case QH_STATE_UNLINK:
case QH_STATE_UNLINK_WAIT:
+ if (usb_pipetype(urb->pipe) == PIPE_INTERRUPT) {
+ cancel_unlink_wait_intr(ehci, qh);
+ start_unlink_intr(ehci, qh, false);
+ }
+ break;
+ case QH_STATE_UNLINK:
/* already started */
break;
case QH_STATE_IDLE:
@@ -1042,7 +1050,7 @@ ehci_endpoint_reset(struct usb_hcd *hcd, struct usb_host_endpoint *ep)
if (eptype == USB_ENDPOINT_XFER_BULK)
start_unlink_async(ehci, qh);
else
- start_unlink_intr(ehci, qh);
+ start_unlink_intr(ehci, qh, false);
}
}
spin_unlock_irqrestore(&ehci->lock, flags);
diff --git a/drivers/usb/host/ehci-hub.c b/drivers/usb/host/ehci-hub.c
index b2f6450..4104c66 100644
--- a/drivers/usb/host/ehci-hub.c
+++ b/drivers/usb/host/ehci-hub.c
@@ -345,6 +345,7 @@ static int ehci_bus_suspend (struct usb_hcd *hcd)

end_unlink_async(ehci);
unlink_empty_async_suspended(ehci);
+ ehci_handle_start_intr_unlinks(ehci);
ehci_handle_intr_unlinks(ehci);
end_free_itds(ehci);

diff --git a/drivers/usb/host/ehci-sched.c b/drivers/usb/host/ehci-sched.c
index acff5b8..d132c6f 100644
--- a/drivers/usb/host/ehci-sched.c
+++ b/drivers/usb/host/ehci-sched.c
@@ -601,10 +601,10 @@ static void qh_unlink_periodic(struct ehci_hcd *ehci, struct ehci_qh *qh)
list_del(&qh->intr_node);
}

-static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
+static void start_do_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
{
- /* If the QH isn't linked then there's nothing we can do. */
- if (qh->qh_state != QH_STATE_LINKED)
+ /* If the QH isn't unlink wait then there's nothing we can do. */
+ if (qh->qh_state != QH_STATE_UNLINK_WAIT)
return;

qh_unlink_periodic (ehci, qh);
@@ -632,6 +632,55 @@ static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
}
}

+static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh,
+ bool wait)
+{
+ /* If the QH isn't linked then there's nothing we can do. */
+ if (qh->qh_state != QH_STATE_LINKED)
+ return;
+
+ /*
+ * It is common only one intr URB is scheduled on one qh, and given
+ * complete() is run in tasklet context, introduce QH_STATE_UNLINK_WAIT
+ * to avoid unlink qh too early.
+ */
+ qh->qh_state = QH_STATE_UNLINK_WAIT;
+
+ /* bypass the optimization without URB giveback in tasklet */
+ if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)) || !wait) {
+ start_do_unlink_intr(ehci, qh);
+ return;
+ }
+
+ qh->unlink_cycle = ehci->intr_unlink_wait_cycle;
+
+ /* New entries go at the end of the intr_unlink_wait list */
+ list_add_tail(&qh->unlink_node, &ehci->intr_unlink_wait);
+
+ if (ehci->rh_state < EHCI_RH_RUNNING)
+ ehci_handle_start_intr_unlinks(ehci);
+ else if (ehci->intr_unlink_wait.next == &qh->unlink_node) {
+ ehci_enable_event(ehci, EHCI_HRTIMER_START_UNLINK_INTR, true);
+ ++ehci->intr_unlink_wait_cycle;
+ }
+}
+
+/* must be called with holding ehci->lock */
+static void cancel_unlink_wait_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
+{
+ /* If the QH isn't waited for unlink then do nothing */
+ if (qh->qh_state != QH_STATE_UNLINK_WAIT)
+ return;
+
+ /* restored to linked state */
+ list_del(&qh->unlink_node);
+ qh->qh_state = QH_STATE_LINKED;
+
+ /* avoid unnecessary CPU wakeup */
+ if (list_empty(&ehci->intr_unlink_wait))
+ ehci_disable_event(ehci, EHCI_HRTIMER_START_UNLINK_INTR);
+}
+
static void end_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
{
struct ehci_qh_hw *hw = qh->hw;
@@ -876,6 +925,9 @@ static int intr_submit (
goto done;
}

+ /* put back qh from unlink wait list */
+ cancel_unlink_wait_intr(ehci, qh);
+
/* then queue the urb's tds to the qh */
qh = qh_append_tds(ehci, urb, qtd_list, epnum, &urb->ep->hcpriv);
BUG_ON (qh == NULL);
@@ -921,7 +973,7 @@ static void scan_intr(struct ehci_hcd *ehci)
temp = qh_completions(ehci, qh);
if (unlikely(temp || (list_empty(&qh->qtd_list) &&
qh->qh_state == QH_STATE_LINKED)))
- start_unlink_intr(ehci, qh);
+ start_unlink_intr(ehci, qh, true);
}
}
}
diff --git a/drivers/usb/host/ehci-timer.c b/drivers/usb/host/ehci-timer.c
index 11e5b32..c2320f3 100644
--- a/drivers/usb/host/ehci-timer.c
+++ b/drivers/usb/host/ehci-timer.c
@@ -72,6 +72,7 @@ static unsigned event_delays_ns[] = {
1 * NSEC_PER_MSEC, /* EHCI_HRTIMER_POLL_DEAD */
1125 * NSEC_PER_USEC, /* EHCI_HRTIMER_UNLINK_INTR */
2 * NSEC_PER_MSEC, /* EHCI_HRTIMER_FREE_ITDS */
+ 5 * NSEC_PER_MSEC, /* EHCI_HRTIMER_START_UNLINK_INTR */
6 * NSEC_PER_MSEC, /* EHCI_HRTIMER_ASYNC_UNLINKS */
10 * NSEC_PER_MSEC, /* EHCI_HRTIMER_IAA_WATCHDOG */
10 * NSEC_PER_MSEC, /* EHCI_HRTIMER_DISABLE_PERIODIC */
@@ -98,6 +99,17 @@ static void ehci_enable_event(struct ehci_hcd *ehci, unsigned event,
}
}

+/* Warning: don't call this function from hrtimer handler context */
+static void ehci_disable_event(struct ehci_hcd *ehci, unsigned event)
+{
+ if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)))
+ return;
+
+ ehci->enabled_hrtimer_events &= ~(1 << event);
+ if (!ehci->enabled_hrtimer_events)
+ hrtimer_cancel(&ehci->hrtimer);
+}
+

/* Poll the STS_ASS status bit; see when it agrees with CMD_ASE */
static void ehci_poll_ASS(struct ehci_hcd *ehci)
@@ -215,6 +227,36 @@ static void ehci_handle_controller_death(struct ehci_hcd *ehci)
/* Not in process context, so don't try to reset the controller */
}

+/* start to unlink interrupt QHs */
+static void ehci_handle_start_intr_unlinks(struct ehci_hcd *ehci)
+{
+ bool stopped = (ehci->rh_state < EHCI_RH_RUNNING);
+
+ /*
+ * Process all the QHs on the intr_unlink list that were added
+ * before the current unlink cycle began. The list is in
+ * temporal order, so stop when we reach the first entry in the
+ * current cycle. But if the root hub isn't running then
+ * process all the QHs on the list.
+ */
+ while (!list_empty(&ehci->intr_unlink_wait)) {
+ struct ehci_qh *qh;
+
+ qh = list_first_entry(&ehci->intr_unlink_wait,
+ struct ehci_qh, unlink_node);
+ if (!stopped &&
+ qh->unlink_cycle == ehci->intr_unlink_wait_cycle)
+ break;
+ list_del(&qh->unlink_node);
+ start_do_unlink_intr(ehci, qh);
+ }
+
+ /* Handle remaining entries later */
+ if (!list_empty(&ehci->intr_unlink_wait)) {
+ ehci_enable_event(ehci, EHCI_HRTIMER_START_UNLINK_INTR, true);
+ ++ehci->intr_unlink_wait_cycle;
+ }
+}

/* Handle unlinked interrupt QHs once they are gone from the hardware */
static void ehci_handle_intr_unlinks(struct ehci_hcd *ehci)
@@ -363,6 +405,7 @@ static void (*event_handlers[])(struct ehci_hcd *) = {
ehci_handle_controller_death, /* EHCI_HRTIMER_POLL_DEAD */
ehci_handle_intr_unlinks, /* EHCI_HRTIMER_UNLINK_INTR */
end_free_itds, /* EHCI_HRTIMER_FREE_ITDS */
+ ehci_handle_start_intr_unlinks, /* EHCI_HRTIMER_START_UNLINK_INTR */
unlink_empty_async, /* EHCI_HRTIMER_ASYNC_UNLINKS */
ehci_iaa_watchdog, /* EHCI_HRTIMER_IAA_WATCHDOG */
ehci_disable_PSE, /* EHCI_HRTIMER_DISABLE_PERIODIC */
diff --git a/drivers/usb/host/ehci.h b/drivers/usb/host/ehci.h
index 7c978b2..1d2c164 100644
--- a/drivers/usb/host/ehci.h
+++ b/drivers/usb/host/ehci.h
@@ -88,6 +88,7 @@ enum ehci_hrtimer_event {
EHCI_HRTIMER_POLL_DEAD, /* Wait for dead controller to stop */
EHCI_HRTIMER_UNLINK_INTR, /* Wait for interrupt QH unlink */
EHCI_HRTIMER_FREE_ITDS, /* Wait for unused iTDs and siTDs */
+ EHCI_HRTIMER_START_UNLINK_INTR, /* wait for new intr schedule */
EHCI_HRTIMER_ASYNC_UNLINKS, /* Unlink empty async QHs */
EHCI_HRTIMER_IAA_WATCHDOG, /* Handle lost IAA interrupts */
EHCI_HRTIMER_DISABLE_PERIODIC, /* Wait to disable periodic sched */
@@ -143,6 +144,8 @@ struct ehci_hcd { /* one per controller */
unsigned i_thresh; /* uframes HC might cache */

union ehci_shadow *pshadow; /* mirror hw periodic table */
+ struct list_head intr_unlink_wait;
+ unsigned intr_unlink_wait_cycle;
struct list_head intr_unlink;
unsigned intr_unlink_cycle;
unsigned now_frame; /* frame from HC hardware */
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-09 15:18:31 UTC
Permalink
Both 4 transfers can work well on EHCI HCD after switching to run
URB giveback in tasklet context, so mark all HCD drivers to support
it.
From below test results on 3 machines(2 ARM and one x86), time
consumed by EHCI interrupt handler droped much without performance
loss.


1 test description
1.1 mass storage performance test:
- run below command 10 times and compute the average performance

dd if=/dev/sdN iflag=direct of=/dev/null bs=200M count=1

- two usb mass storage device:
A: sandisk extreme USB 3.0 16G(used in test case 1 & case 2)
B: kingston DataTraveler G2 4GB(only used in test case 2)

1.2 uvc function test:
- run one simple capture program in the below link

http://kernel.ubuntu.com/~ming/up/capture.c

- capture format 640*480 and results in High Bandwidth mode on the
uvc device: Z-Star 0x0ac8/0x3450

- on T410(x86) laptop, also use guvcview to watch video capture/playback

1.3 about test2 and test4
- both two devices involved are tested concurrently by above test items

1.4 how to compute irq time(the time consumed by ehci_irq)
- use trace points of irq:irq_handler_entry and irq:irq_handler_exit

1.5 kernel
3.10.0-rc3-next-20130528

1.6 test machines
Pandaboard A1: ARM CortexA9 dural core
Arndale board: ARM CortexA15 dural core
T410: i5 CPU 2.67GHz quad core

2 test result
2.1 test case1: single mass storage device performance test
--------------------------------------------------------------------
upstream | patched
perf(MB/s)+irq time(us) | perf(MB/s)+irq time(us)
--------------------------------------------------------------------
Pandaboard A1: 25.690(avg:144,max:818) | 25.620(avg:14, max:77)
Arndale board: 29.700(avg:29, max:123) | 29.700(avg:9, max:48)
T410: 34.420(avg:20, max:199*)| 34.440(avg:12, max:157)
---------------------------------------------------------------------

2.2 test case2: two mass storage devices' performance test
--------------------------------------------------------------------
upstream | patched
perf(MB/s)+irq time(us) | perf(MB/s)+irq time(us)
--------------------------------------------------------------------
Pandaboard A1: 15.830/15.530(avg:158,max:1139) | 16.490/16.180(avg:15, max:135)
Arndale board: 17.370/16.260(avg:30 max:222) | 17.450/16.240(avg:10, max:92)
T410: 21.300/19.900(avg:18 max:243) | 21.230/19.920(avg:12, max:152)
---------------------------------------------------------------------

2.3 test case3: one uvc streaming test
- uvc device works well(on x86, luvcview can be used too and has
same result with uvc capture)
--------------------------------------------------------------------
upstream | patched
irq time(us) | irq time(us)
--------------------------------------------------------------------
Pandaboard A1: (avg:455, max:941) | (avg:33, max:45)
Arndale board: (avg:316, max:616) | (avg:19, max:27)
T410: (avg:39, max:164) | (avg:10, max:142)
---------------------------------------------------------------------

2.4 test case4: one uvc streaming plus one mass storage device test
--------------------------------------------------------------------
upstream | patched
perf(MB/s)+irq time(us) | perf(MB/s)+irq time(us)
--------------------------------------------------------------------
Pandaboard A1: 20.340(avg:258,max:1575)| 20.230(avg:24, max:100)
Arndale board: 23.570(avg:120,max:683) | 23.560(avg:14, max:61)
T410: 28.260(avg:27, max:178) | 28.230(avg:13, max:155)
---------------------------------------------------------------------

* On T410, sometimes read ehci status register in ehci_irq takes more
than 100us, and the problem has been reported on the link:

http://marc.info/?t=137065867300001&r=1&w=2


Cc: Alan Stern <stern-nwvwT67g6+6dFdvTe/***@public.gmane.org>
Signed-off-by: Ming Lei <ming.lei-Z7WLFzj8eWMS+***@public.gmane.org>
---
drivers/usb/host/ehci-fsl.c | 2 +-
drivers/usb/host/ehci-grlib.c | 2 +-
drivers/usb/host/ehci-hcd.c | 2 +-
drivers/usb/host/ehci-mv.c | 2 +-
drivers/usb/host/ehci-octeon.c | 2 +-
drivers/usb/host/ehci-pmcmsp.c | 2 +-
drivers/usb/host/ehci-ppc-of.c | 2 +-
drivers/usb/host/ehci-ps3.c | 2 +-
drivers/usb/host/ehci-sead3.c | 2 +-
drivers/usb/host/ehci-sh.c | 2 +-
drivers/usb/host/ehci-tegra.c | 2 +-
drivers/usb/host/ehci-tilegx.c | 2 +-
drivers/usb/host/ehci-w90x900.c | 2 +-
drivers/usb/host/ehci-xilinx-of.c | 2 +-
14 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/usb/host/ehci-fsl.c b/drivers/usb/host/ehci-fsl.c
index bd831ec..330274a 100644
--- a/drivers/usb/host/ehci-fsl.c
+++ b/drivers/usb/host/ehci-fsl.c
@@ -669,7 +669,7 @@ static const struct hc_driver ehci_fsl_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_USB2 | HCD_MEMORY,
+ .flags = HCD_USB2 | HCD_MEMORY | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-grlib.c b/drivers/usb/host/ehci-grlib.c
index 5d75de9..c6048ee 100644
--- a/drivers/usb/host/ehci-grlib.c
+++ b/drivers/usb/host/ehci-grlib.c
@@ -43,7 +43,7 @@ static const struct hc_driver ehci_grlib_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 0ab82de..d856c5e 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -1174,7 +1174,7 @@ static const struct hc_driver ehci_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-mv.c b/drivers/usb/host/ehci-mv.c
index 915c2db..ce18a36 100644
--- a/drivers/usb/host/ehci-mv.c
+++ b/drivers/usb/host/ehci-mv.c
@@ -96,7 +96,7 @@ static const struct hc_driver mv_ehci_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-octeon.c b/drivers/usb/host/ehci-octeon.c
index 45cc001..ab0397e 100644
--- a/drivers/usb/host/ehci-octeon.c
+++ b/drivers/usb/host/ehci-octeon.c
@@ -51,7 +51,7 @@ static const struct hc_driver ehci_octeon_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-pmcmsp.c b/drivers/usb/host/ehci-pmcmsp.c
index 363890e..3ab32a2 100644
--- a/drivers/usb/host/ehci-pmcmsp.c
+++ b/drivers/usb/host/ehci-pmcmsp.c
@@ -286,7 +286,7 @@ static const struct hc_driver ehci_msp_hc_driver = {
#else
.irq = ehci_irq,
#endif
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-ppc-of.c b/drivers/usb/host/ehci-ppc-of.c
index 56dc732..da95a31 100644
--- a/drivers/usb/host/ehci-ppc-of.c
+++ b/drivers/usb/host/ehci-ppc-of.c
@@ -28,7 +28,7 @@ static const struct hc_driver ehci_ppc_of_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
index fd98377..8188542 100644
--- a/drivers/usb/host/ehci-ps3.c
+++ b/drivers/usb/host/ehci-ps3.c
@@ -71,7 +71,7 @@ static const struct hc_driver ps3_ehci_hc_driver = {
.product_desc = "PS3 EHCI Host Controller",
.hcd_priv_size = sizeof(struct ehci_hcd),
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,
.reset = ps3_ehci_hc_reset,
.start = ehci_run,
.stop = ehci_stop,
diff --git a/drivers/usb/host/ehci-sead3.c b/drivers/usb/host/ehci-sead3.c
index b2de52d..8a73449 100644
--- a/drivers/usb/host/ehci-sead3.c
+++ b/drivers/usb/host/ehci-sead3.c
@@ -55,7 +55,7 @@ const struct hc_driver ehci_sead3_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-sh.c b/drivers/usb/host/ehci-sh.c
index c4c0ee9..1691f8e 100644
--- a/drivers/usb/host/ehci-sh.c
+++ b/drivers/usb/host/ehci-sh.c
@@ -36,7 +36,7 @@ static const struct hc_driver ehci_sh_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_USB2 | HCD_MEMORY,
+ .flags = HCD_USB2 | HCD_MEMORY | HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-tegra.c b/drivers/usb/host/ehci-tegra.c
index 59d111b..c23db21 100644
--- a/drivers/usb/host/ehci-tegra.c
+++ b/drivers/usb/host/ehci-tegra.c
@@ -377,7 +377,7 @@ static const struct hc_driver tegra_ehci_hc_driver = {
.description = hcd_name,
.product_desc = "Tegra EHCI Host Controller",
.hcd_priv_size = sizeof(struct ehci_hcd),
- .flags = HCD_USB2 | HCD_MEMORY,
+ .flags = HCD_USB2 | HCD_MEMORY | HCD_BH,

/* standard ehci functions */
.irq = ehci_irq,
diff --git a/drivers/usb/host/ehci-tilegx.c b/drivers/usb/host/ehci-tilegx.c
index d72b292..204d3b6 100644
--- a/drivers/usb/host/ehci-tilegx.c
+++ b/drivers/usb/host/ehci-tilegx.c
@@ -61,7 +61,7 @@ static const struct hc_driver ehci_tilegx_hc_driver = {
* Generic hardware linkage.
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* Basic lifecycle operations.
diff --git a/drivers/usb/host/ehci-w90x900.c b/drivers/usb/host/ehci-w90x900.c
index 59e0e24..1c370df 100644
--- a/drivers/usb/host/ehci-w90x900.c
+++ b/drivers/usb/host/ehci-w90x900.c
@@ -108,7 +108,7 @@ static const struct hc_driver ehci_w90x900_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_USB2|HCD_MEMORY,
+ .flags = HCD_USB2|HCD_MEMORY|HCD_BH,

/*
* basic lifecycle operations
diff --git a/drivers/usb/host/ehci-xilinx-of.c b/drivers/usb/host/ehci-xilinx-of.c
index d845e3b..bb96a68 100644
--- a/drivers/usb/host/ehci-xilinx-of.c
+++ b/drivers/usb/host/ehci-xilinx-of.c
@@ -79,7 +79,7 @@ static const struct hc_driver ehci_xilinx_of_hc_driver = {
* generic hardware linkage
*/
.irq = ehci_irq,
- .flags = HCD_MEMORY | HCD_USB2,
+ .flags = HCD_MEMORY | HCD_USB2 | HCD_BH,

/*
* basic lifecycle operations
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-09 15:48:09 UTC
Permalink
Post by Ming Lei
Hi,
The patchset supports to run giveback of URB in tasklet context, so that
DMA unmapping/mapping on transfer buffer and compelte() callback can be
run with interrupt enabled, then time of HCD interrupt handler can be
saved much. Also this approach may simplify HCD since HCD lock needn't
be released any more before calling usb_hcd_giveback_urb().
That's a lot of material. However, you haven't specifically said
_what_ you want to accomplish. It sounds like you're trying to
minimize the amount of time spent in hardirq context, but I can't be
sure that is really what you want.

If it is, this is not a good way to do it. A better way to minimize
the time spent in hardirq context is to use a threaded interrupt
handler. But why do you want to minimize this time in the first place?
The CPU will be still have to use at least as much time as before; the
only difference is that it will be in softirq or in a high-priority
thread instead of in hardirq.

Also, I don't see how you can avoid releasing the HCD's private lock
before calling usb_hcd_giveback_urb(). When the completion handler
tries to resubmit the URB, the HCD's enqueue routine will try to
reacquire the private lock. If it didn't release the lock earlier, it
will hang.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-10 06:05:43 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
Hi,
The patchset supports to run giveback of URB in tasklet context, so that
DMA unmapping/mapping on transfer buffer and compelte() callback can be
run with interrupt enabled, then time of HCD interrupt handler can be
saved much. Also this approach may simplify HCD since HCD lock needn't
be released any more before calling usb_hcd_giveback_urb().
That's a lot of material. However, you haven't specifically said
_what_ you want to accomplish. It sounds like you're trying to
All IRQs are disabled(locally) in hard interrupt context, but they are
enabled in softirq context, this patchset can decrease the time a lot.
Post by Alan Stern
minimize the amount of time spent in hardirq context, but I can't be
sure that is really what you want.
If it is, this is not a good way to do it. A better way to minimize
the time spent in hardirq context is to use a threaded interrupt
handler. But why do you want to minimize this time in the first place?
Process context switch may have more cost than switching to softirq,
then USB performance may be effected, that is why I choose to use
tasklet. Also now there is no any sleep in URB giveback, so it isn't
necessary to introduce process to handle the giveback or completion.
Post by Alan Stern
The CPU will be still have to use at least as much time as before; the
only difference is that it will be in softirq or in a high-priority
thread instead of in hardirq.
Yes, as I said above, but we can allow other IRQs to be served in
handling URB giveback in softirq context, that is the value of softirq/tasklet.
Of course, system may have more important things to do during the time.
Generally speaking, for one driver, hard interrupt handler should always
consume less time as possible.
Post by Alan Stern
Also, I don't see how you can avoid releasing the HCD's private lock
before calling usb_hcd_giveback_urb(). When the completion handler
Please see patch 1/4, :-)
Post by Alan Stern
tries to resubmit the URB, the HCD's enqueue routine will try to
reacquire the private lock. If it didn't release the lock earlier, it
Yes, but with the patchset, resubmitting is called in taskelet context,
instead of hardirq context, so there isn't any recursion any more.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-10 14:03:00 UTC
Permalink
[Thomas and Steve, please see the question below.]
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Hi,
The patchset supports to run giveback of URB in tasklet context, so that
DMA unmapping/mapping on transfer buffer and compelte() callback can be
run with interrupt enabled, then time of HCD interrupt handler can be
saved much. Also this approach may simplify HCD since HCD lock needn't
be released any more before calling usb_hcd_giveback_urb().
That's a lot of material. However, you haven't specifically said
_what_ you want to accomplish. It sounds like you're trying to
All IRQs are disabled(locally) in hard interrupt context, but they are
enabled in softirq context, this patchset can decrease the time a lot.
That isn't clear. It is documented that USB completion handlers are
called with local interrupts disabled. An IRQ arriving before the
tasklet starts up might therefore be serviced during the short interval
before the tasklet disables local interrupts, but if that happens it
would mean that the completion handler would be delayed.

In effect, you are prioritizing other IRQs higher than USB completion
handlers. Nevertheless, the total time spent with interrupts disabled
will remain the same.

(There's one other side effect that perhaps you haven't considered.
When multiple URBs are addressed to the same endpoint, their completion
handlers are called in order of URB completion, which is the same as
the order of URB submission unless an URB is cancelled. By delegating
the completion handlers to tasklets, and particularly by using per-CPU
tasklets, you may violate this behavior.)
Post by Ming Lei
Post by Alan Stern
minimize the amount of time spent in hardirq context, but I can't be
sure that is really what you want.
If it is, this is not a good way to do it. A better way to minimize
the time spent in hardirq context is to use a threaded interrupt
handler. But why do you want to minimize this time in the first place?
Process context switch may have more cost than switching to softirq,
then USB performance may be effected, that is why I choose to use
tasklet. Also now there is no any sleep in URB giveback, so it isn't
necessary to introduce process to handle the giveback or completion.
Thomas and Steve, can you comment on this? I do not have a good
understanding of the relative benefits/drawbacks of tasklets vs.
threaded interrupt handlers.
Post by Ming Lei
Post by Alan Stern
The CPU will be still have to use at least as much time as before; the
only difference is that it will be in softirq or in a high-priority
thread instead of in hardirq.
Yes, as I said above, but we can allow other IRQs to be served in
handling URB giveback in softirq context, that is the value of softirq/tasklet.
Of course, system may have more important things to do during the time.
Generally speaking, for one driver, hard interrupt handler should always
consume less time as possible.
As I understand it, the tradeoff between hardirq and softirq is
generally unimportant. It does matter a lot in realtime settings --
but the RT kernel effectively makes _every_ interrupt handler threaded,
so it won't benefit from this change.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-10 14:12:58 UTC
Permalink
Post by Alan Stern
[Thomas and Steve, please see the question below.]
That isn't clear. It is documented that USB completion handlers are
called with local interrupts disabled. An IRQ arriving before the
tasklet starts up might therefore be serviced during the short interval
before the tasklet disables local interrupts, but if that happens it
would mean that the completion handler would be delayed.
That is what tasklets do by definition, isn't it?
Post by Alan Stern
In effect, you are prioritizing other IRQs higher than USB completion
handlers. Nevertheless, the total time spent with interrupts disabled
will remain the same.
It pobably slightly increases. You have colder caches twice.
And potentially you swich CPUs.
Post by Alan Stern
(There's one other side effect that perhaps you haven't considered.
When multiple URBs are addressed to the same endpoint, their completion
handlers are called in order of URB completion, which is the same as
the order of URB submission unless an URB is cancelled. By delegating
the completion handlers to tasklets, and particularly by using per-CPU
tasklets, you may violate this behavior.)
This is quite serious. It mustn't happen.

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-10 15:33:55 UTC
Permalink
Post by Oliver Neukum
Post by Alan Stern
[Thomas and Steve, please see the question below.]
That isn't clear. It is documented that USB completion handlers are
called with local interrupts disabled. An IRQ arriving before the
tasklet starts up might therefore be serviced during the short interval
before the tasklet disables local interrupts, but if that happens it
would mean that the completion handler would be delayed.
That is what tasklets do by definition, isn't it?
Yes. Although tasklets in general have no reason to leave interrupts
disabled.
Post by Oliver Neukum
Post by Alan Stern
In effect, you are prioritizing other IRQs higher than USB completion
handlers. Nevertheless, the total time spent with interrupts disabled
will remain the same.
It pobably slightly increases. You have colder caches twice.
And potentially you swich CPUs.
Actually, the situation is a little better in one respect, which was
pointed out earlier but not emphasized. The DMA unmapping can be
deferred to the tasklet, and it can run with interrupts enabled before
the completion hanlder is called. Since the unmapping sometimes takes
a while to run, this is an advantage.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-10 15:37:30 UTC
Permalink
Post by Alan Stern
[Thomas and Steve, please see the question below.]
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Hi,
The patchset supports to run giveback of URB in tasklet context, so that
DMA unmapping/mapping on transfer buffer and compelte() callback can be
run with interrupt enabled, then time of HCD interrupt handler can be
saved much. Also this approach may simplify HCD since HCD lock needn't
be released any more before calling usb_hcd_giveback_urb().
That's a lot of material. However, you haven't specifically said
_what_ you want to accomplish. It sounds like you're trying to
All IRQs are disabled(locally) in hard interrupt context, but they are
enabled in softirq context, this patchset can decrease the time a lot.
That isn't clear. It is documented that USB completion handlers are
called with local interrupts disabled. An IRQ arriving before the
Is there any good reason to do URB giveback with interrupt disabled?

IMO, disabling IRQs isn't necessary for completing URB since the
complete() of URBs for same endpoint is reentrant.
Post by Alan Stern
tasklet starts up might therefore be serviced during the short interval
before the tasklet disables local interrupts, but if that happens it
Tasklet doesn't disable local interrupts.
Post by Alan Stern
would mean that the completion handler would be delayed.
Yes, the tasklet schedule delay is introduced to URB completion, but
the delay doesn't have much side effect about completing URB.

For async transfer, generally, it should have no effect since there should
have URBs queued on the qh queue before submitting URB.

For periodic transfer, the patch uses high priority tasklet to schedule it.
Post by Alan Stern
In effect, you are prioritizing other IRQs higher than USB completion
handlers.
Both other IRQs and HCD's IRQ are prioritizing the URB completion.
Post by Alan Stern
Nevertheless, the total time spent with interrupts disabled
will remain the same.
No, the total time spent with interrupts disabled does not remain same,
since no local IRQs are disabled during URB giveback anymore.
Post by Alan Stern
(There's one other side effect that perhaps you haven't considered.
When multiple URBs are addressed to the same endpoint, their completion
handlers are called in order of URB completion, which is the same as
the order of URB submission unless an URB is cancelled. By delegating
the completion handlers to tasklets, and particularly by using per-CPU
tasklets, you may violate this behavior.)
Even though URB giveback is handled by hard interrupt handler, it still can't
guarantee that 100%, please see below:

- interrupt for URB A comes in CPU 0, and handled, then HCD private
lock is released, and usb_hcd_giveback_urb() is called to complete URB A

- interrupt for URB B comes in CPU 1 just before CPU0 releases HCD private
lock, so CPU1 holds the private lock when CPU 0 releases the lock and
handles the HCD interrupt.

So now just like two CPUs are racing, if CPU0 wins, URB A is completed
first, otherwise URB B is completed first.

With introducing completing URB in percpu tasklet, the situation is
just very similar.

On ehci hcd, it can make sure that by always letting CPU0 complete all URBs
(use ehci->scanning/ehci->need_rescan flag), but the patch can change percpu
tasklet to tasklet to meet the demand.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
minimize the amount of time spent in hardirq context, but I can't be
sure that is really what you want.
If it is, this is not a good way to do it. A better way to minimize
the time spent in hardirq context is to use a threaded interrupt
handler. But why do you want to minimize this time in the first place?
Process context switch may have more cost than switching to softirq,
then USB performance may be effected, that is why I choose to use
tasklet. Also now there is no any sleep in URB giveback, so it isn't
necessary to introduce process to handle the giveback or completion.
Thomas and Steve, can you comment on this? I do not have a good
understanding of the relative benefits/drawbacks of tasklets vs.
threaded interrupt handlers.
Post by Ming Lei
Post by Alan Stern
The CPU will be still have to use at least as much time as before; the
only difference is that it will be in softirq or in a high-priority
thread instead of in hardirq.
Yes, as I said above, but we can allow other IRQs to be served in
handling URB giveback in softirq context, that is the value of softirq/tasklet.
Of course, system may have more important things to do during the time.
Generally speaking, for one driver, hard interrupt handler should always
consume less time as possible.
As I understand it, the tradeoff between hardirq and softirq is
generally unimportant. It does matter a lot in realtime settings --
I don't think so, and obviously softirq allows IRQs to be served quickly,
otherwise why is softirq and tasklet introduced? and why so many drivers
are using tasklet/softirq?
Post by Alan Stern
but the RT kernel effectively makes _every_ interrupt handler threaded,
so it won't benefit from this change.
We still need mainline kernel to benefit the change, :-)

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-10 17:36:11 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
That isn't clear. It is documented that USB completion handlers are
called with local interrupts disabled. An IRQ arriving before the
Is there any good reason to do URB giveback with interrupt disabled?
That's a good question. Offhand I can't think of any drivers that rely
on this -- although there certainly are places where callback routines
use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
because they know that interrupts are already disabled.

However, changing a documented API is not something to be done lightly.
You would have to audit _every_ USB driver to make sure no trouble
would arise.
Post by Ming Lei
IMO, disabling IRQs isn't necessary for completing URB since the
complete() of URBs for same endpoint is reentrant.
Whether it is _necessary_ isn't the point. It is _documented_, and
therefore it cannot be changed easily.
Post by Ming Lei
Post by Alan Stern
tasklet starts up might therefore be serviced during the short interval
before the tasklet disables local interrupts, but if that happens it
Tasklet doesn't disable local interrupts.
It is documented that interrupts will be disabled while the completion
handler runs. Therefore the tasklet _must_ disable local interrupts.
Post by Ming Lei
Post by Alan Stern
would mean that the completion handler would be delayed.
Yes, the tasklet schedule delay is introduced to URB completion, but
the delay doesn't have much side effect about completing URB.
What about delays that occur because a separate interrupt is serviced
before the URB's completion handler is called? With the current code
that can't happen, but with your scheme it can.
Post by Ming Lei
For async transfer, generally, it should have no effect since there should
have URBs queued on the qh queue before submitting URB.
That's not how the Bulk-Only mass-storage protocol works. There are
times when the protocol _requires_ bulk packets not to be submitted
until a particular URB has completed.

Since mass-storage is an important use case for USB, its requirements
should not be ignored.
Post by Ming Lei
Post by Alan Stern
In effect, you are prioritizing other IRQs higher than USB completion
handlers.
Both other IRQs and HCD's IRQ are prioritizing the URB completion.
I don't understand that sentence. "Prioritizing" means "assigning a
priority to". IRQs and HCDs don't assign priorities to anything;
priorities are assigned by human programmers.
Post by Ming Lei
Post by Alan Stern
Nevertheless, the total time spent with interrupts disabled
will remain the same.
No, the total time spent with interrupts disabled does not remain same,
since no local IRQs are disabled during URB giveback anymore.
That is a bug. Local IRQs must be disabled during URB giveback.
Post by Ming Lei
Post by Alan Stern
(There's one other side effect that perhaps you haven't considered.
When multiple URBs are addressed to the same endpoint, their completion
handlers are called in order of URB completion, which is the same as
the order of URB submission unless an URB is cancelled. By delegating
the completion handlers to tasklets, and particularly by using per-CPU
tasklets, you may violate this behavior.)
Even though URB giveback is handled by hard interrupt handler, it still can't
- interrupt for URB A comes in CPU 0, and handled, then HCD private
lock is released, and usb_hcd_giveback_urb() is called to complete URB A
- interrupt for URB B comes in CPU 1 just before CPU0 releases HCD private
lock, so CPU1 holds the private lock when CPU 0 releases the lock and
handles the HCD interrupt.
So now just like two CPUs are racing, if CPU0 wins, URB A is completed
first, otherwise URB B is completed first.
Although you didn't say so, I assume you mean that A and B are URBs for
the same endpoint. This means they use the same host controller and
hence they use the same IRQ line.

When an interrupt is handled, it is not possible for a second interrupt
to arrive on the same IRQ line before the handler for the first
interrupt returns. The kernel disables the IRQ line when the first
interrupt occurs, and keeps it disabled until all the handlers are
finished. Therefore the scenario you described cannot occur.
Post by Ming Lei
With introducing completing URB in percpu tasklet, the situation is
just very similar.
No, it isn't, for the reason given above.
Post by Ming Lei
On ehci hcd, it can make sure that by always letting CPU0 complete all URBs
(use ehci->scanning/ehci->need_rescan flag), but the patch can change percpu
tasklet to tasklet to meet the demand.
Or perhaps you could assign each endpoint to a particular tasklet.
Post by Ming Lei
Post by Alan Stern
As I understand it, the tradeoff between hardirq and softirq is
generally unimportant. It does matter a lot in realtime settings --
I don't think so, and obviously softirq allows IRQs to be served quickly,
otherwise why is softirq and tasklet introduced? and why so many drivers
are using tasklet/softirq?
This is part of what I would like to hear Thomas and Steve comment on.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2013-06-10 18:52:08 UTC
Permalink
Post by Ming Lei
Post by Ming Lei
I don't think so, and obviously softirq allows IRQs to be served quickly,
otherwise why is softirq and tasklet introduced?
Because of history. In the old days there were top halves (hard irq) and
bottom halves (like a softirq). The thing about a bottom half was, no
two could run at the same time. That is, if one CPU was running a bottom
half, another CPU would have to wait for that one to complete before it
could run any bottom half. The killer here is that it didn't matter if
it was the same bottom half or not! Microsoft/Mindcraft was kind enough
to point out this inefficiency with some benchmarks showing how horrible
Linux ran on a 4 way box. Thanks to them, softirqs and tasklets were
born :-)

A softirq does the rest of the interrupt work with interrupts enabled.

A tasklet is run from softirq context, but the difference between a
softirq and tasklet is that the same tasklet can not run on two
different CPUs at the same time. It acts similar to a task, as a task
can not run on two different CPUs at the same time either, hence the
name "tasklet". But tasklets are better than bottom halves because it
allows for different tasklets to run on different CPUs.
Post by Ming Lei
and why so many drivers
Post by Ming Lei
are using tasklet/softirq?
Because it's easy to set up and device driver authors don't know any
better ;-). Note, a lot of drivers are now using work queues today,
which run as a task.

Yes, there's a little more overhead with a task than for a softirq, but
the problem with softirqs and tasklets is that they can't be preempted,
and they are more important than all tasks on the system. If you have a
task that is critical, it must yield for your softirq. Almost!

That is, even if you have a softirq, it may not run in irq context at
all. If you get too many softirqs at a time (one comes while another one
is running), it will defer the processing to the ksoftirq thread. This
not only runs as a task, but also runs as SCHED_OTHER, and must yield to
other tasks like everyone else too.

Thus, adding it as a softirq does not guarantee that it will be
processed quickly. It just means that most of the time it will prevent
anything else from happening while your "most important handler in the
world" is running.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-10 20:47:09 UTC
Permalink
Post by Steven Rostedt
Post by Ming Lei
and why so many drivers
Post by Ming Lei
are using tasklet/softirq?
Because it's easy to set up and device driver authors don't know any
better ;-). Note, a lot of drivers are now using work queues today,
which run as a task.
Yes, there's a little more overhead with a task than for a softirq, but
the problem with softirqs and tasklets is that they can't be preempted,
and they are more important than all tasks on the system. If you have a
task that is critical, it must yield for your softirq. Almost!
That is, even if you have a softirq, it may not run in irq context at
all. If you get too many softirqs at a time (one comes while another one
is running), it will defer the processing to the ksoftirq thread. This
not only runs as a task, but also runs as SCHED_OTHER, and must yield to
other tasks like everyone else too.
Thus, adding it as a softirq does not guarantee that it will be
processed quickly. It just means that most of the time it will prevent
anything else from happening while your "most important handler in the
world" is running.
From this, it sounds like you generally advise using threaded interrupt
handlers rather than tasklets/softirqs.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2013-06-10 20:54:40 UTC
Permalink
Post by Alan Stern
Post by Steven Rostedt
Post by Ming Lei
and why so many drivers
Post by Ming Lei
are using tasklet/softirq?
Because it's easy to set up and device driver authors don't know any
better ;-). Note, a lot of drivers are now using work queues today,
which run as a task.
Yes, there's a little more overhead with a task than for a softirq, but
the problem with softirqs and tasklets is that they can't be preempted,
and they are more important than all tasks on the system. If you have a
task that is critical, it must yield for your softirq. Almost!
That is, even if you have a softirq, it may not run in irq context at
all. If you get too many softirqs at a time (one comes while another one
is running), it will defer the processing to the ksoftirq thread. This
not only runs as a task, but also runs as SCHED_OTHER, and must yield to
other tasks like everyone else too.
Thus, adding it as a softirq does not guarantee that it will be
processed quickly. It just means that most of the time it will prevent
anything else from happening while your "most important handler in the
world" is running.
From this, it sounds like you generally advise using threaded interrupt
handlers rather than tasklets/softirqs.
Yes, there's plenty of benefits for them, and I highly doubt that any
USB device would even notice the difference between a softirq and a
thread for response time latencies.

-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-11 08:40:03 UTC
Permalink
Post by Steven Rostedt
Post by Alan Stern
Post by Steven Rostedt
Post by Ming Lei
and why so many drivers
Post by Ming Lei
are using tasklet/softirq?
Because it's easy to set up and device driver authors don't know any
better ;-). Note, a lot of drivers are now using work queues today,
which run as a task.
Yes, there's a little more overhead with a task than for a softirq, but
the problem with softirqs and tasklets is that they can't be preempted,
and they are more important than all tasks on the system. If you have a
task that is critical, it must yield for your softirq. Almost!
That is, even if you have a softirq, it may not run in irq context at
all. If you get too many softirqs at a time (one comes while another one
is running), it will defer the processing to the ksoftirq thread. This
not only runs as a task, but also runs as SCHED_OTHER, and must yield to
other tasks like everyone else too.
Thus, adding it as a softirq does not guarantee that it will be
processed quickly. It just means that most of the time it will prevent
anything else from happening while your "most important handler in the
world" is running.
From this, it sounds like you generally advise using threaded interrupt
handlers rather than tasklets/softirqs.
Yes, there's plenty of benefits for them, and I highly doubt that any
USB device would even notice the difference between a softirq and a
thread for response time latencies.
Steven, thanks for your input.

IMO, about the problem, the most important point between threaded interrupt
handler and tasklet is the latency.

For USB video/audio application, the data need to be handled very quickly.
Also new transfer need to be scheduled without much latency. I am wondering
if interrupt thread handler can respond quickly enough if there is high load on
CPUs.

For USB mass storage driver, it still need to be handled quickly.

If tasklet is taken, tasklet_schedule() is always called from hard interrupt
context, so most of time the latency should be better than interrupt thread
handler latency since tasklet handler is called just after irq handler exits in
the situation. Also we can avoid to do tasklet_schedule when the tasklet
is being handled.

I will compare, collect and post out some latency data later for the sake of
choosing which one to handle URB giveback.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-10 20:51:57 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
Tasklet doesn't disable local interrupts.
It is documented that interrupts will be disabled while the completion
handler runs. Therefore the tasklet _must_ disable local interrupts.
You know, it may be that you can get most of the advantages you want by
enabling local interrupts around the call to unmap_urb_for_dma() in
usb_hcd_giveback_urb().

This may be a little dangerous, though, because it is possible for an
URB to be given back at the time it is submitted. Drivers may not
expect interrupts to get enabled temporarily when they call
usb_submit_urb().

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-11 06:19:33 UTC
Permalink
Post by Alan Stern
Post by Alan Stern
Post by Ming Lei
Tasklet doesn't disable local interrupts.
It is documented that interrupts will be disabled while the completion
handler runs. Therefore the tasklet _must_ disable local interrupts.
You know, it may be that you can get most of the advantages you want by
enabling local interrupts around the call to unmap_urb_for_dma() in
usb_hcd_giveback_urb().
No, please don't enable IRQs inside interrupt handler, and you will
get warning dump.

Except for unmap_urb_for_dma() in usb_hcd_giveback_urb(), I hope
map_urb_for_dma() inside complete() can benefit from the change too,
since map_urb_for_dma() is same time-consuming with unmap_urb_for_dma().

Also I hope complete() can be run with interrupt enabled since
driver's complete() may take long time on some ARCH, as I mentioned
in my last mail. And it isn't good to disable interrupt only for one interface
driver, looks I still don't get real good explanation about disabling
IRQs here, :-)
Post by Alan Stern
This may be a little dangerous, though, because it is possible for an
URB to be given back at the time it is submitted. Drivers may not
expect interrupts to get enabled temporarily when they call
usb_submit_urb().
Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-11 05:40:20 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
That isn't clear. It is documented that USB completion handlers are
called with local interrupts disabled. An IRQ arriving before the
Is there any good reason to do URB giveback with interrupt disabled?
That's a good question. Offhand I can't think of any drivers that rely
on this -- although there certainly are places where callback routines
use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
because they know that interrupts are already disabled.
Complete() may take long time, for example in UVC's driver, URB's
complete() may take more than 3 ms, as reported in below link:

http://marc.info/?t=136438111600010&r=1&w=2

Also complete() is provided by interface driver, and it does many driver
specific works, so it isn't good to disable interrupt only for one driver.

If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().

When the lock is to protect shared resources between complete() and
non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
context, which is enough to prevent tasklet from being run on the CPU,
so no problem for this situation too.

When all HCDs support to run URB giveback in tasklet context, we can
change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
Post by Alan Stern
However, changing a documented API is not something to be done lightly.
You would have to audit _every_ USB driver to make sure no trouble
would arise.
OK, I will write patch to request for changing the API if the improvement
on the situation isn't objected.

In fact, at least now, the change on API is only about document change, and
no drivers' change is required, isn't it? We can describe that URB->complete()
might run in hard irq or softirq context, depends on HCD_BH flag of
hcd->driver->flags.
Post by Alan Stern
Post by Ming Lei
IMO, disabling IRQs isn't necessary for completing URB since the
complete() of URBs for same endpoint is reentrant.
Whether it is _necessary_ isn't the point. It is _documented_, and
therefore it cannot be changed easily.
Same with above.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
tasklet starts up might therefore be serviced during the short interval
before the tasklet disables local interrupts, but if that happens it
Tasklet doesn't disable local interrupts.
It is documented that interrupts will be disabled while the completion
handler runs. Therefore the tasklet _must_ disable local interrupts.
Same with above, we can change the document which itself shouldn't
have been the only reason for blocking the change, :-)
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
would mean that the completion handler would be delayed.
Yes, the tasklet schedule delay is introduced to URB completion, but
the delay doesn't have much side effect about completing URB.
What about delays that occur because a separate interrupt is serviced
before the URB's completion handler is called? With the current code
that can't happen, but with your scheme it can.
Even with current code, one HCD's interrupt handling still may delay
the handling for another HCD interrupt handling for quite long, that is
the motivation to decrease the interrupt handling time of USB HCD, ;-)
And finally, USB HCDs themselves will benefit from the change, won't they?

Also it shouldn't be a problem since most of interrupt handlers
takes very less time.
Post by Alan Stern
Post by Ming Lei
For async transfer, generally, it should have no effect since there should
have URBs queued on the qh queue before submitting URB.
That's not how the Bulk-Only mass-storage protocol works. There are
times when the protocol _requires_ bulk packets not to be submitted
until a particular URB has completed.
Yes, I agree. But mass-storage driver itself is very fragile, it will perform
badly if CPU has a higher load. And it is better to submit DATA/CSW
transfer in previous transfer's complete(), IMO.

Compared with "usb-storage" task's schedule latency, the tasklet
schedule delay should be lower at most of situations since the tasklet
is scheduled inside irq handler.
Post by Alan Stern
Since mass-storage is an important use case for USB, its requirements
should not be ignored.
That is why I did many tests on mass storage case, and from the results
in commit log of patch 4/4, looks no performance loss is involved with
the change on usb storage.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
In effect, you are prioritizing other IRQs higher than USB completion
handlers.
Both other IRQs and HCD's IRQ are prioritizing the URB completion.
I don't understand that sentence. "Prioritizing" means "assigning a
priority to". IRQs and HCDs don't assign priorities to anything;
priorities are assigned by human programmers.
Sorry, prioritizing should have been "prioritized".
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
Nevertheless, the total time spent with interrupts disabled
will remain the same.
No, the total time spent with interrupts disabled does not remain same,
since no local IRQs are disabled during URB giveback anymore.
That is a bug. Local IRQs must be disabled during URB giveback.
I guess it is still the document API. If so, please see my above explanation,
and I will request to change it.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
(There's one other side effect that perhaps you haven't considered.
When multiple URBs are addressed to the same endpoint, their completion
handlers are called in order of URB completion, which is the same as
the order of URB submission unless an URB is cancelled. By delegating
the completion handlers to tasklets, and particularly by using per-CPU
tasklets, you may violate this behavior.)
Even though URB giveback is handled by hard interrupt handler, it still can't
- interrupt for URB A comes in CPU 0, and handled, then HCD private
lock is released, and usb_hcd_giveback_urb() is called to complete URB A
- interrupt for URB B comes in CPU 1 just before CPU0 releases HCD private
lock, so CPU1 holds the private lock when CPU 0 releases the lock and
handles the HCD interrupt.
So now just like two CPUs are racing, if CPU0 wins, URB A is completed
first, otherwise URB B is completed first.
Although you didn't say so, I assume you mean that A and B are URBs for
the same endpoint. This means they use the same host controller and
Exactly.
Post by Alan Stern
hence they use the same IRQ line.
When an interrupt is handled, it is not possible for a second interrupt
to arrive on the same IRQ line before the handler for the first
interrupt returns. The kernel disables the IRQ line when the first
interrupt occurs, and keeps it disabled until all the handlers are
finished. Therefore the scenario you described cannot occur.
OK, thanks for pointing it out.

So I decided to replace the percpu tasklet with one single tasklet,
which can avoid the problem.

Also the tasklet running in CPU0 can handle the work which should
have been done by the same tasket scheduled in CPU1, so we can
avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
in CPU1.
Post by Alan Stern
Post by Ming Lei
With introducing completing URB in percpu tasklet, the situation is
just very similar.
No, it isn't, for the reason given above.
Right, but the single tasklet may avoid the problem.
Post by Alan Stern
Post by Ming Lei
On ehci hcd, it can make sure that by always letting CPU0 complete all URBs
(use ehci->scanning/ehci->need_rescan flag), but the patch can change percpu
tasklet to tasklet to meet the demand.
Or perhaps you could assign each endpoint to a particular tasklet.
That still need changing percpu tasklet to single tasklet, since per-endpoint
tasklet can't be distributed on each CPU.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
As I understand it, the tradeoff between hardirq and softirq is
generally unimportant. It does matter a lot in realtime settings --
I don't think so, and obviously softirq allows IRQs to be served quickly,
otherwise why is softirq and tasklet introduced? and why so many drivers
are using tasklet/softirq?
This is part of what I would like to hear Thomas and Steve comment on.
Me too, :-)

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-11 07:18:42 UTC
Permalink
Post by Ming Lei
If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().
We also have exclusion between complete() and other contexts, i.e. timers.
Post by Ming Lei
When the lock is to protect shared resources between complete() and
non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
context, which is enough to prevent tasklet from being run on the CPU,
so no problem for this situation too.
When all HCDs support to run URB giveback in tasklet context, we can
change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
Even now we cannot guarantee that all calls to complete() are in irq.
There is the case of HCD hotunplug and other cases, like timeouts.
They will have to be verified.

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-11 08:14:25 UTC
Permalink
Post by Oliver Neukum
Post by Ming Lei
If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().
We also have exclusion between complete() and other contexts, i.e. timers.
The patch also converts the complete() called in timers to tasklet context too.

So could you let me know if that eases your concern? If not, could you explain
your concern about other contexts in a bit detail?
Post by Oliver Neukum
Post by Ming Lei
When the lock is to protect shared resources between complete() and
non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
context, which is enough to prevent tasklet from being run on the CPU,
so no problem for this situation too.
When all HCDs support to run URB giveback in tasklet context, we can
change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
Even now we cannot guarantee that all calls to complete() are in irq.
There is the case of HCD hotunplug and other cases, like timeouts.
They will have to be verified.
All URBs are completed via usb_hcd_giveback_urb(), so there should
be no differences between these cases and the common one about
introducing the patchset.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-11 08:49:23 UTC
Permalink
Post by Ming Lei
Post by Oliver Neukum
Post by Ming Lei
If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().
We also have exclusion between complete() and other contexts, i.e. timers.
The patch also converts the complete() called in timers to tasklet context too.
So could you let me know if that eases your concern? If not, could you explain
your concern about other contexts in a bit detail?
The driver itself may have submitted a timer and race against it.
What locking do you need in complete() and a timer to lock against
each other?
Post by Ming Lei
Post by Oliver Neukum
Even now we cannot guarantee that all calls to complete() are in irq.
There is the case of HCD hotunplug and other cases, like timeouts.
They will have to be verified.
All URBs are completed via usb_hcd_giveback_urb(), so there should
be no differences between these cases and the common one about
introducing the patchset.
But it makes no sense to go to a tasklet when you are already in task context.
In those cases you need to do something, essentially blocking the tasklet.

Regards
Oliver


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-11 09:27:28 UTC
Permalink
Post by Oliver Neukum
Post by Ming Lei
Post by Oliver Neukum
Post by Ming Lei
If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().
We also have exclusion between complete() and other contexts, i.e. timers.
The patch also converts the complete() called in timers to tasklet context too.
So could you let me know if that eases your concern? If not, could you explain
your concern about other contexts in a bit detail?
The driver itself may have submitted a timer and race against it.
What locking do you need in complete() and a timer to lock against
each other?
Good catch.

The problem will come if only spin_lock() is called inside complete(),
I will check main USB drivers in tree to see if there is such use case.
Post by Oliver Neukum
Post by Ming Lei
Post by Oliver Neukum
Even now we cannot guarantee that all calls to complete() are in irq.
There is the case of HCD hotunplug and other cases, like timeouts.
They will have to be verified.
All URBs are completed via usb_hcd_giveback_urb(), so there should
be no differences between these cases and the common one about
introducing the patchset.
But it makes no sense to go to a tasklet when you are already in task context.
In those cases you need to do something, essentially blocking the tasklet.
At least now, always doing complete() in tasklet handler can simplify
implementation since these cases aren't in hot path.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-11 10:51:41 UTC
Permalink
Post by Ming Lei
Post by Oliver Neukum
The driver itself may have submitted a timer and race against it.
What locking do you need in complete() and a timer to lock against
each other?
Good catch.
The problem will come if only spin_lock() is called inside complete(),
I will check main USB drivers in tree to see if there is such use case.
All network drivers race against timeout. I think they just unlink the URB,
but there's a lot of them.
Post by Ming Lei
Post by Oliver Neukum
But it makes no sense to go to a tasklet when you are already in task context.
In those cases you need to do something, essentially blocking the tasklet.
At least now, always doing complete() in tasklet handler can simplify
implementation since these cases aren't in hot path.
Well, I am afraid this is not simply the case. These cases are partially
synchronous. For example you need to make sure all calls to complete()
are finished before you disconnect a HCD itself. The same applies to a device
being disconnected.

It the same area, what happens if an URB is unlinked between the irq handler
and the tasklet?

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-11 11:19:29 UTC
Permalink
Post by Oliver Neukum
Post by Ming Lei
Post by Oliver Neukum
The driver itself may have submitted a timer and race against it.
What locking do you need in complete() and a timer to lock against
each other?
Good catch.
The problem will come if only spin_lock() is called inside complete(),
I will check main USB drivers in tree to see if there is such use case.
All network drivers race against timeout. I think they just unlink the URB,
but there's a lot of them.
usbnet is fine since no spin_lock() is used in complete() and the same lock
is acquired in timer handler.

As far as I think of, the problem only exists in the below situation:

complete():
spin_lock(&A)
...
spin_unlock(&A)

timer handler():
spin_lock(&A)
....
spin_unlock(&A)

And we need to replace spin_lock() in complete() with spin_lock_irqsave()
if the change is going to be introduced.
Post by Oliver Neukum
Post by Ming Lei
Post by Oliver Neukum
But it makes no sense to go to a tasklet when you are already in task context.
In those cases you need to do something, essentially blocking the tasklet.
At least now, always doing complete() in tasklet handler can simplify
implementation since these cases aren't in hot path.
Well, I am afraid this is not simply the case. These cases are partially
synchronous. For example you need to make sure all calls to complete()
are finished before you disconnect a HCD itself. The same applies to a device
being disconnected.
That is fine since the coming giveback() in tasklet context will drop the URB's
reference count(urb->use_count) finally if the scheduled tasklet can't
be dropped.
(looks tasklet schedule can make sure that)
Post by Oliver Neukum
It the same area, what happens if an URB is unlinked between the irq handler
and the tasklet?
The unlink will return failure since the URB isn't in queue of ep->urb_list.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-11 19:10:03 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Is there any good reason to do URB giveback with interrupt disabled?
That's a good question. Offhand I can't think of any drivers that rely
on this -- although there certainly are places where callback routines
use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
because they know that interrupts are already disabled.
Complete() may take long time, for example in UVC's driver, URB's
http://marc.info/?t=136438111600010&r=1&w=2
Also complete() is provided by interface driver, and it does many driver
specific works, so it isn't good to disable interrupt only for one driver.
You probably mean "it isn't good to disable interrupts for the sake of
only one driver". As things stand now, we disable interrupts for all
callbacks in all drivers.
Post by Ming Lei
If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().
When the lock is to protect shared resources between complete() and
non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
context, which is enough to prevent tasklet from being run on the CPU,
so no problem for this situation too.
When all HCDs support to run URB giveback in tasklet context, we can
change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
I don't follow your reasoning. Consider the following situation, where
the same spinlock is used in both a URB completion handler and an
interrupt handler:

URB completes
A tasklet calls the completion handler, with interrupts enabled
The completion handler does
spin_lock(&lock);
An interrupt occurs
The interrupt handler does
spin_lock(&lock); // Deadlock!

In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.
Post by Ming Lei
Post by Alan Stern
However, changing a documented API is not something to be done lightly.
You would have to audit _every_ USB driver to make sure no trouble
would arise.
OK, I will write patch to request for changing the API if the improvement
on the situation isn't objected.
I don't mind. But don't be surprised if you end up breaking some
drivers.
Post by Ming Lei
In fact, at least now, the change on API is only about document change, and
no drivers' change is required, isn't it? We can describe that URB->complete()
might run in hard irq or softirq context, depends on HCD_BH flag of
hcd->driver->flags.
The drivers _do_ need to be changed, as explained above. And that
explanation was only one failure scenario. There may be others.
Post by Ming Lei
Even with current code, one HCD's interrupt handling still may delay
the handling for another HCD interrupt handling for quite long, that is
the motivation to decrease the interrupt handling time of USB HCD, ;-)
And finally, USB HCDs themselves will benefit from the change, won't they?
How will they benefit? Merely from not releasing their private locks?
That involves a fairly small amount of code.
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
For async transfer, generally, it should have no effect since there should
have URBs queued on the qh queue before submitting URB.
That's not how the Bulk-Only mass-storage protocol works. There are
times when the protocol _requires_ bulk packets not to be submitted
until a particular URB has completed.
Yes, I agree. But mass-storage driver itself is very fragile, it will perform
badly if CPU has a higher load.
Why does that make it fragile? _Every_ driver and program will behave
worse when the system is under a heavy load.
Post by Ming Lei
And it is better to submit DATA/CSW
transfer in previous transfer's complete(), IMO.
Actually, it would be even better to submit DATA _before_ the CBW
completes, and to submit the CSW before submitting DATA-OUT.
Unfortunately, the design of the SG library makes it impossible to
submit the CSW during DATA-IN completion.
Post by Ming Lei
Compared with "usb-storage" task's schedule latency, the tasklet
schedule delay should be lower at most of situations since the tasklet
is scheduled inside irq handler.
True. But the two latencies add.
Post by Ming Lei
So I decided to replace the percpu tasklet with one single tasklet,
which can avoid the problem.
Hmmm. What happens when there is more than one host controller? The
tasklet will give back only one URB at a time, and it won't run on more
than one CPU at a time. The existing drivers allow URBs to be given
back on multiple CPUs simultaneously.
Post by Ming Lei
Also the tasklet running in CPU0 can handle the work which should
have been done by the same tasket scheduled in CPU1, so we can
avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
in CPU1.
Or you could have a separate tasklet for each host controller.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-12 02:43:32 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Is there any good reason to do URB giveback with interrupt disabled?
That's a good question. Offhand I can't think of any drivers that rely
on this -- although there certainly are places where callback routines
use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
because they know that interrupts are already disabled.
Complete() may take long time, for example in UVC's driver, URB's
http://marc.info/?t=136438111600010&r=1&w=2
Also complete() is provided by interface driver, and it does many driver
specific works, so it isn't good to disable interrupt only for one driver.
You probably mean "it isn't good to disable interrupts for the sake of
only one driver". As things stand now, we disable interrupts for all
callbacks in all drivers.
Post by Ming Lei
If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().
When the lock is to protect shared resources between complete() and
non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
context, which is enough to prevent tasklet from being run on the CPU,
so no problem for this situation too.
When all HCDs support to run URB giveback in tasklet context, we can
change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
I don't follow your reasoning. Consider the following situation, where
the same spinlock is used in both a URB completion handler and an
URB completes
A tasklet calls the completion handler, with interrupts enabled
The completion handler does
spin_lock(&lock);
An interrupt occurs
The interrupt handler does
spin_lock(&lock); // Deadlock!
If you mean the interrupt handler is HCD irq handler of the device, no
such problem since all complete() will run in tasklet context.

If not, I am wondering why one USB driver need register another hard
interrupt handler? Could you share such examples? I understand the
case only exists with timer handler as discussed with Oliver, don't I?
Post by Alan Stern
In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.
I have written one script to search usb drivers which may use timers and
spin_lock() at the same time, about 30 drivers are found, and looks we
can fix them.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
However, changing a documented API is not something to be done lightly.
You would have to audit _every_ USB driver to make sure no trouble
would arise.
OK, I will write patch to request for changing the API if the improvement
on the situation isn't objected.
I don't mind. But don't be surprised if you end up breaking some
drivers.
Sure, we should change these drivers before changing the API.
Post by Alan Stern
Post by Ming Lei
In fact, at least now, the change on API is only about document change, and
no drivers' change is required, isn't it? We can describe that URB->complete()
might run in hard irq or softirq context, depends on HCD_BH flag of
hcd->driver->flags.
The drivers _do_ need to be changed, as explained above. And that
explanation was only one failure scenario. There may be others.
OK, I'd like to know what the others are? :-)
Post by Alan Stern
Post by Ming Lei
Even with current code, one HCD's interrupt handling still may delay
the handling for another HCD interrupt handling for quite long, that is
the motivation to decrease the interrupt handling time of USB HCD, ;-)
And finally, USB HCDs themselves will benefit from the change, won't they?
How will they benefit? Merely from not releasing their private locks?
The interrupt of one HCD will be responded lately since another HCD's
interrupt handler takes very long time. With the change, the problem can
be avoided.
Post by Alan Stern
That involves a fairly small amount of code.
Maybe no much code, but :

1) Inside interrupt handler no submitting and unlinking requests from
drivers can happen any more, so interrupt handler should have became
simple.

2), Also the race between irq handler and related timer handler needn't
to be considered because the private lock can't be released in irq handler.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
For async transfer, generally, it should have no effect since there should
have URBs queued on the qh queue before submitting URB.
That's not how the Bulk-Only mass-storage protocol works. There are
times when the protocol _requires_ bulk packets not to be submitted
until a particular URB has completed.
Yes, I agree. But mass-storage driver itself is very fragile, it will perform
badly if CPU has a higher load.
Why does that make it fragile? _Every_ driver and program will behave
worse when the system is under a heavy load.
Tasklet and interrupt handler may be scheduled with less latency than
task scheduling latency under high load.
Post by Alan Stern
Post by Ming Lei
And it is better to submit DATA/CSW
transfer in previous transfer's complete(), IMO.
Actually, it would be even better to submit DATA _before_ the CBW
completes, and to submit the CSW before submitting DATA-OUT.
Unfortunately, the design of the SG library makes it impossible to
submit the CSW during DATA-IN completion.
Post by Ming Lei
Compared with "usb-storage" task's schedule latency, the tasklet
schedule delay should be lower at most of situations since the tasklet
is scheduled inside irq handler.
True. But the two latencies add.
I think it should be one tasklet latency added, but what is the other one?
Post by Alan Stern
Post by Ming Lei
So I decided to replace the percpu tasklet with one single tasklet,
which can avoid the problem.
Hmmm. What happens when there is more than one host controller? The
tasklet will give back only one URB at a time, and it won't run on more
than one CPU at a time. The existing drivers allow URBs to be given
back on multiple CPUs simultaneously.
Sorry for not describing it explicitly, I mean single tasklet is per HCD
since current percpu tasklet in this patchset is per HCD already.
Post by Alan Stern
Post by Ming Lei
Also the tasklet running in CPU0 can handle the work which should
have been done by the same tasket scheduled in CPU1, so we can
avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
in CPU1.
Or you could have a separate tasklet for each host controller.
Yes, but I will compare tasklet with interrupt threaded handler first.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-12 06:41:16 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
I don't follow your reasoning. Consider the following situation, where
the same spinlock is used in both a URB completion handler and an
URB completes
A tasklet calls the completion handler, with interrupts enabled
The completion handler does
spin_lock(&lock);
An interrupt occurs
The interrupt handler does
spin_lock(&lock); // Deadlock!
If you mean the interrupt handler is HCD irq handler of the device, no
such problem since all complete() will run in tasklet context.
If not, I am wondering why one USB driver need register another hard
interrupt handler? Could you share such examples? I understand the
case only exists with timer handler as discussed with Oliver, don't I?
In fact, timer funtion is still run in softirq context, so the deadlock won't
be caused with acquiring same lock(spin_lock) in both timer function
and complete() from tasklet.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Thomas Gleixner
2013-06-12 07:45:13 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Also the tasklet running in CPU0 can handle the work which should
have been done by the same tasket scheduled in CPU1, so we can
avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
in CPU1.
Or you could have a separate tasklet for each host controller.
Yes, but I will compare tasklet with interrupt threaded handler first.
Yes, please. I read through the thread and I really recommend that you
get rid of the tasklet. tasklets are a complete disaster by design and
we really should make efforts to get rid of the last users instead of
trying to work around their semantical shortcomings.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-13 02:25:10 UTC
Permalink
Post by Thomas Gleixner
Yes, please. I read through the thread and I really recommend that you
get rid of the tasklet. tasklets are a complete disaster by design and
we really should make efforts to get rid of the last users instead of
trying to work around their semantical shortcomings.
The attached patch001 supports tasklet in HCD and patch002 implements
interrupt threaded handler, then if USB_HCD_THREADED_IRQ
is set, interrupt threaded handler is used to do URB giveback, otherwise
tasklet is used to do that.

The other 3 patches(2/4, 3/4, 4/4) aren't changed now, if anyone want to try it.
(the patch 2 is only for comparing performance difference, and welcome to
review/comment it so that we can get accurate test result to make decision)

Follows the mass storage test result(average over 10 times test) with
tasklet /interrupt threaded handler/hard interrupt handler under same
environment of lenovo T410(x86), which means the test is switched by
reinserting module of usbcore or ehci-hcd without changing other things
in the machine.

- do below tests 10 times and figure out the average speed

dd if=/dev/sdN of=/dev/null iflag=direct bs=200M 1

device: sandisk extreme USB 3.0 16G, host: Lenovo T410 EHCI

- using interrupt threaded handler(default)
33.440 MB/sec

- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec

- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s


So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.

I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)

So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)

Also another problem with irq threaded handler is that there is no sort of
tasklet_schedule() interface to wakeup the thread handler manually, so
I have to use work to schedule some URB giveback from drivers(root hub
transfer, unlink), even though that isn't a big deal but will cause code a bit
much/complicated, :-)

Thanks,
--
Ming Lei
Alan Stern
2013-06-13 14:54:13 UTC
Permalink
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Post by Ming Lei
Also another problem with irq threaded handler is that there is no sort of
tasklet_schedule() interface to wakeup the thread handler manually, so
I have to use work to schedule some URB giveback from drivers(root hub
transfer, unlink), even though that isn't a big deal but will cause code a bit
much/complicated, :-)
Yes, I was going to bring that up.

Thomas, sometimes we need the IRQ handler thread to do some work even
though an interrupt hasn't occurred. Is there an API for this, or can
one be added?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman
2013-06-13 18:47:03 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.

Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.

Unless someone has a workload / benchmark that shows otherwise?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-13 19:41:17 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.
What about with USB-3 storage devices? Many of them still use the
bulk-only transport instead of UAS. They may push the limits up.
Post by Greg Kroah-Hartman
Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.
Unless someone has a workload / benchmark that shows otherwise?
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.

The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2013-06-13 20:08:32 UTC
Permalink
Post by Alan Stern
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
That's because it was written like a top half and not a full blown
interrupt. I just looked at the patch, and saw this:

+#ifndef USB_HCD_THREADED_IRQ
if (sched) {
if (async)
tasklet_schedule(&bh->bh);
else
tasklet_hi_schedule(&bh->bh);
}
+#else
+ if (sched)
+ schedule_work(&hcd->periodic_bh->work);
+#endif

What is this? The work isn't done by an interrupt thread, but by work queues!

The point of the interrupt thread is that you do *all* the work that
needs to be done when an interrupt comes in. You don't need to delay the
work.

If you just treat a threaded interrupt like a real interrupt and push
off work to something else, then yes, it will degrade performance.

If you go the threaded interrupt route, you need to rethink the
paradigm. There's no reason that the interrupt handler needs to be fast
like it needs to be in true interrupt context. The handler can now use
mutexes, and other full features that currently only threads benefit
from. It should improve locking issues, and can serialize things if
needed.

All this patch did was to switch the main irq to a thread and make a
bottom half into a work queue.

Why couldn't you just do:

if (sched)
usb_giveback_urb_bh(bh);

?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-13 21:09:43 UTC
Permalink
Post by Steven Rostedt
Post by Alan Stern
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
That's because it was written like a top half and not a full blown
+#ifndef USB_HCD_THREADED_IRQ
if (sched) {
if (async)
tasklet_schedule(&bh->bh);
else
tasklet_hi_schedule(&bh->bh);
}
+#else
+ if (sched)
+ schedule_work(&hcd->periodic_bh->work);
+#endif
What is this? The work isn't done by an interrupt thread, but by work queues!
You don't understand the patch. Most of the time, sched will be 0
here and hence the work queue won't be involved.

Yes, part of the work is done by a work queue rather than the interrupt
thread. But it is an unimportant part, the part that involves
transfers to root hubs or transfers that were cancelled. These things
can complete without any interrupt occurring, so they can't be handled
by the interrupt thread. However, they are the abnormal case; the
transfers we care about are not to root hubs and they do complete
normally.
Post by Steven Rostedt
The point of the interrupt thread is that you do *all* the work that
needs to be done when an interrupt comes in. You don't need to delay the
work.
You've got it backward. The patch doesn't leave part of the work
undone when an interrupt occurs. Rather it's the other way around --
sometimes work needs to be done when there isn't any interrupt. This
could happen in a timer callback, or it could happen as a direct result
of a function call.

Since there doesn't seem to be any way to invoke the interrupt thread
in the absence of an interrupt, Ming pushed the job off to a work
queue.
Post by Steven Rostedt
If you just treat a threaded interrupt like a real interrupt and push
off work to something else, then yes, it will degrade performance.
If you go the threaded interrupt route, you need to rethink the
paradigm. There's no reason that the interrupt handler needs to be fast
like it needs to be in true interrupt context. The handler can now use
mutexes, and other full features that currently only threads benefit
from. It should improve locking issues, and can serialize things if
needed.
All this patch did was to switch the main irq to a thread and make a
bottom half into a work queue.
In case it's not clear, the code you quoted above is part of the
interrupt handler, not part of the thread.
Post by Steven Rostedt
if (sched)
usb_giveback_urb_bh(bh);
?
Because usb_giveback_urb_bh() is supposed to run in the context of the
tasklet or interrupt thread or work queue, not in the context of the
interrupt handler.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2013-06-13 22:24:20 UTC
Permalink
Post by Alan Stern
Post by Steven Rostedt
Post by Alan Stern
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
That's because it was written like a top half and not a full blown
+#ifndef USB_HCD_THREADED_IRQ
if (sched) {
if (async)
tasklet_schedule(&bh->bh);
else
tasklet_hi_schedule(&bh->bh);
}
+#else
+ if (sched)
+ schedule_work(&hcd->periodic_bh->work);
+#endif
What is this? The work isn't done by an interrupt thread, but by work queues!
You don't understand the patch.
Hey, I'll admit that I don't understand how USB works ;-)

I also only looked at the second patch without applying it. Thus, a lot
of the changes were out of context for me.
Post by Alan Stern
Most of the time, sched will be 0
here and hence the work queue won't be involved.
OK, I only compared that current tasklets were being commented out, and
that's pretty much all I had to go on.
Post by Alan Stern
Yes, part of the work is done by a work queue rather than the interrupt
thread. But it is an unimportant part, the part that involves
transfers to root hubs or transfers that were cancelled. These things
can complete without any interrupt occurring, so they can't be handled
by the interrupt thread. However, they are the abnormal case; the
transfers we care about are not to root hubs and they do complete
normally.
Post by Steven Rostedt
The point of the interrupt thread is that you do *all* the work that
needs to be done when an interrupt comes in. You don't need to delay the
work.
You've got it backward. The patch doesn't leave part of the work
undone when an interrupt occurs. Rather it's the other way around --
sometimes work needs to be done when there isn't any interrupt. This
could happen in a timer callback, or it could happen as a direct result
of a function call.
Since there doesn't seem to be any way to invoke the interrupt thread
in the absence of an interrupt, Ming pushed the job off to a work
queue.
Post by Steven Rostedt
If you just treat a threaded interrupt like a real interrupt and push
off work to something else, then yes, it will degrade performance.
If you go the threaded interrupt route, you need to rethink the
paradigm. There's no reason that the interrupt handler needs to be fast
like it needs to be in true interrupt context. The handler can now use
mutexes, and other full features that currently only threads benefit
from. It should improve locking issues, and can serialize things if
needed.
All this patch did was to switch the main irq to a thread and make a
bottom half into a work queue.
In case it's not clear, the code you quoted above is part of the
interrupt handler, not part of the thread.
Got it.
Post by Alan Stern
Post by Steven Rostedt
if (sched)
usb_giveback_urb_bh(bh);
?
Because usb_giveback_urb_bh() is supposed to run in the context of the
tasklet or interrupt thread or work queue, not in the context of the
interrupt handler.
I only took a quick look at the second patch. I'm now looking at both
patches applied to the code. I didn't realize this was called from the
top half.

Usually the top half for threaded interrupts is used just to quite the
interrupt line. Either by acknowledging the interrupt or by disabling
the device from sending more interrupts till the bottom half (thread)
can run. This looks to be doing a bit more than that.

I'll look a bit deeper at the patch, but this still doesn't look like a
typical threaded interrupt usage.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-13 23:08:27 UTC
Permalink
Post by Steven Rostedt
I only took a quick look at the second patch. I'm now looking at both
patches applied to the code. I didn't realize this was called from the
top half.
Usually the top half for threaded interrupts is used just to quite the
interrupt line. Either by acknowledging the interrupt or by disabling
the device from sending more interrupts till the bottom half (thread)
can run. This looks to be doing a bit more than that.
Yes, it does. Ming left all the host controller processing in the top
half. The bottom half merely handles the completion callbacks.

One of the things we discussed during this email thread was the
possibility of moving _all_ the work into the threaded handler.
That's not as easy to do, though; it requires significant modification
of the host controller driver. And each controller driver would need
its own modifications, whereas with Ming's approach only the core needs
to be changed.
Post by Steven Rostedt
I'll look a bit deeper at the patch, but this still doesn't look like a
typical threaded interrupt usage.
I'll agree with that; it isn't typical. Ming claims that the work
remaining in the top half often takes less than 20 us on average in his
tests. Depending on the particular device, the work in the bottom half
can be very quick (little more than waking up a thread in the case of
usb-storage) or quite slow (perhaps on the order of a ms or more for
the UVC video driver on some ARM platforms).

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 01:27:47 UTC
Permalink
Post by Alan Stern
Post by Steven Rostedt
Post by Alan Stern
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
That's because it was written like a top half and not a full blown
+#ifndef USB_HCD_THREADED_IRQ
if (sched) {
if (async)
tasklet_schedule(&bh->bh);
else
tasklet_hi_schedule(&bh->bh);
}
+#else
+ if (sched)
+ schedule_work(&hcd->periodic_bh->work);
+#endif
What is this? The work isn't done by an interrupt thread, but by work queues!
You don't understand the patch. Most of the time, sched will be 0
here and hence the work queue won't be involved.
Yes, part of the work is done by a work queue rather than the interrupt
thread. But it is an unimportant part, the part that involves
transfers to root hubs or transfers that were cancelled. These things
can complete without any interrupt occurring, so they can't be handled
by the interrupt thread. However, they are the abnormal case; the
transfers we care about are not to root hubs and they do complete
normally.
Post by Steven Rostedt
The point of the interrupt thread is that you do *all* the work that
needs to be done when an interrupt comes in. You don't need to delay the
work.
You've got it backward. The patch doesn't leave part of the work
undone when an interrupt occurs. Rather it's the other way around --
sometimes work needs to be done when there isn't any interrupt. This
could happen in a timer callback, or it could happen as a direct result
of a function call.
Since there doesn't seem to be any way to invoke the interrupt thread
in the absence of an interrupt, Ming pushed the job off to a work
queue.
Post by Steven Rostedt
If you just treat a threaded interrupt like a real interrupt and push
off work to something else, then yes, it will degrade performance.
If you go the threaded interrupt route, you need to rethink the
paradigm. There's no reason that the interrupt handler needs to be fast
like it needs to be in true interrupt context. The handler can now use
mutexes, and other full features that currently only threads benefit
from. It should improve locking issues, and can serialize things if
needed.
All this patch did was to switch the main irq to a thread and make a
bottom half into a work queue.
In case it's not clear, the code you quoted above is part of the
interrupt handler, not part of the thread.
Post by Steven Rostedt
if (sched)
usb_giveback_urb_bh(bh);
?
Because usb_giveback_urb_bh() is supposed to run in the context of the
tasklet or interrupt thread or work queue, not in the context of the
interrupt handler.
Exactly, the code should have been the below shape:

#ifndef USB_HCD_THREADED_IRQ
......
#else
if (!in_irq()) {
bh = hcd->periodic_bh;
sched = 1;
}
#endif

Then it will be quite clear.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman
2013-06-14 00:35:03 UTC
Permalink
Post by Alan Stern
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.
What about with USB-3 storage devices? Many of them still use the
bulk-only transport instead of UAS. They may push the limits up.
Are they really? Have we seen that happen yet? With the number's I've
seen published, we are easily serving up enough data to keep the pipe
full, but that all depends on your CPU / host controller.
Post by Alan Stern
Post by Greg Kroah-Hartman
Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.
Unless someone has a workload / benchmark that shows otherwise?
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
Yes, I know that people keep wanting to worry about latency issues, and
the best answer for them has always been, "don't use USB." :)

You suffer throughput issues with predicitable latency dependancies, so
we need to be careful we don't slow down the 99% of the systems out
there that do not care about this at all.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 01:53:52 UTC
Permalink
On Fri, Jun 14, 2013 at 8:35 AM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.
What about with USB-3 storage devices? Many of them still use the
bulk-only transport instead of UAS. They may push the limits up.
Are they really? Have we seen that happen yet? With the number's I've
Yes, the device I am testing is bulk-only, no uas support , and it is very
popular in market.
Post by Greg Kroah-Hartman
seen published, we are easily serving up enough data to keep the pipe
full, but that all depends on your CPU / host controller.
Post by Alan Stern
Post by Greg Kroah-Hartman
Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.
Unless someone has a workload / benchmark that shows otherwise?
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
Yes, I know that people keep wanting to worry about latency issues, and
the best answer for them has always been, "don't use USB." :)
I think we can do it better, why don't do it? :-)
Post by Greg Kroah-Hartman
You suffer throughput issues with predicitable latency dependancies, so
This patchset don't cause throughout degradation but decrease latency much,
also has other advantages.
Post by Greg Kroah-Hartman
we need to be careful we don't slow down the 99% of the systems out
there that do not care about this at all.
Considered great amount of ARM devices in market, I think we need to
consider the problem on these devices, :-)

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman
2013-06-14 06:05:19 UTC
Permalink
Post by Ming Lei
On Fri, Jun 14, 2013 at 8:35 AM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.
What about with USB-3 storage devices? Many of them still use the
bulk-only transport instead of UAS. They may push the limits up.
Are they really? Have we seen that happen yet? With the number's I've
Yes, the device I am testing is bulk-only, no uas support , and it is very
popular in market.
Post by Greg Kroah-Hartman
seen published, we are easily serving up enough data to keep the pipe
full, but that all depends on your CPU / host controller.
Post by Alan Stern
Post by Greg Kroah-Hartman
Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.
Unless someone has a workload / benchmark that shows otherwise?
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
Yes, I know that people keep wanting to worry about latency issues, and
the best answer for them has always been, "don't use USB." :)
I think we can do it better, why don't do it? :-)
Because of other issues, that have been brought up here already.

But if you can do it without affecting others, that's fine.
Post by Ming Lei
Post by Greg Kroah-Hartman
You suffer throughput issues with predicitable latency dependancies, so
This patchset don't cause throughout degradation but decrease latency much,
also has other advantages.
Like what?
Post by Ming Lei
Post by Greg Kroah-Hartman
we need to be careful we don't slow down the 99% of the systems out
there that do not care about this at all.
Considered great amount of ARM devices in market, I think we need to
consider the problem on these devices, :-)
Is it a problem on those devices? I think they have host controller
issues that are way bigger problems than this device driver, right?

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 10:05:34 UTC
Permalink
On Fri, Jun 14, 2013 at 2:05 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Post by Ming Lei
On Fri, Jun 14, 2013 at 8:35 AM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.
What about with USB-3 storage devices? Many of them still use the
bulk-only transport instead of UAS. They may push the limits up.
Are they really? Have we seen that happen yet? With the number's I've
Yes, the device I am testing is bulk-only, no uas support , and it is very
popular in market.
Post by Greg Kroah-Hartman
seen published, we are easily serving up enough data to keep the pipe
full, but that all depends on your CPU / host controller.
Post by Alan Stern
Post by Greg Kroah-Hartman
Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.
Unless someone has a workload / benchmark that shows otherwise?
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
Yes, I know that people keep wanting to worry about latency issues, and
the best answer for them has always been, "don't use USB." :)
I think we can do it better, why don't do it? :-)
Because of other issues, that have been brought up here already.
But if you can do it without affecting others, that's fine.
It won't affect others. As discussed in the thread, we can do the change
with following steps:

1), move URB giveback from hard irq handler to tasklet, but call complete()
with local IRQs disabled;

2), cleanup current drivers which call 'spin_lock' in compele() by
replacing spin_lock() with spin_lock_irqrestore(), and change the API of
complete() callback by claiming it called with local IRQs enabled, but bh is
disabled.

3), enable local IRQs before calling complete().

There is no good reason for complete() to run with interrupt disabled, as
discussed, so the API change and driver cleanup still makes sense.

In fact, only the below two cases requires the 2nd step's change:

- drivers->complete() acquire a subsystem wide lock which may be called
in another hard irq context, and the subsystem wide lock is acquired by
spin_lock() in complete()

- drivers->complete() hold a private lock with spin_lock() but may export
APIs to let other drivers acquire the private lock in its interrupt handler.

Looks both two cases aren't very common.
Post by Greg Kroah-Hartman
Post by Ming Lei
Post by Greg Kroah-Hartman
You suffer throughput issues with predicitable latency dependancies, so
This patchset don't cause throughout degradation but decrease latency much,
also has other advantages.
Like what?
For example, the HCD private lock needn't to be released inside irq handler,
and HCD interrupt handling may be simplified a bit:

1) Inside interrupt handler no submitting and unlinking requests from
drivers can happen any more, so interrupt handler should have became
simple.

2) Also the race between irq handler and related timer handler needn't
to be considered because the private lock can't be released in irq handler.
Post by Greg Kroah-Hartman
Post by Ming Lei
Post by Greg Kroah-Hartman
we need to be careful we don't slow down the 99% of the systems out
there that do not care about this at all.
Considered great amount of ARM devices in market, I think we need to
consider the problem on these devices, :-)
Is it a problem on those devices? I think they have host controller
issues that are way bigger problems than this device driver, right?
Disabling IRQs for long time(e.g, several milliseconds) is always bad
since the CPU can't respond any events during the period, so USB
application may affect whole system response.

Generally the problem is arch related, not only about the host controller
(which can't have very big fifo to hold lots of packets)

- DMA mapping/unmapping(only cache clean/invalidate is involved) is
time-consuming, especially when the driver's transfer buffer is very big
(eg. mass storage, some usbnet devices, ...)

- Accessing(often reading) DMA coherent buffer is very slow(someone
complained[1] in linux-usb list that memcpy() in uvc complete() may take more
than 3ms, then later packets won't be moved to buffer from hw fifo in irq
handler until the memcpy is completed, and cause packet loss)

[1], http://marc.info/?l=linux-usb&m=136438101304211&w=2

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 01:37:15 UTC
Permalink
Post by Alan Stern
Post by Greg Kroah-Hartman
Post by Alan Stern
Post by Ming Lei
- using interrupt threaded handler(default)
33.440 MB/sec
- using tasklet(#undef USB_HCD_THREADED_IRQ)
34.29 MB/sec
- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
34.260 MB/s
So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.
I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)
So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Yes, but for "old-style" usb-storage, is this really a big deal? We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.
What about with USB-3 storage devices? Many of them still use the
bulk-only transport instead of UAS. They may push the limits up.
Exactly, my test device(sandisk extreme USB 3.0 16G, 0781:5580) is very
popular, which is faster than most USB 3.0 pendrive in market, but the device
is bulk-only, and no UAS support, so I guess most of the USB 3.0 pendrive in
market still may not support UAS.
Post by Alan Stern
Post by Greg Kroah-Hartman
Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.
Unless someone has a workload / benchmark that shows otherwise?
The test results above show a 2.4% degradation for threaded interrupts
as compared to tasklets. That's in addition to the bottlenecks caused
by the device; no doubt it would be worse for a faster device. This
result calls into question the benefits of threaded interrupts.
If I enable HCD_BH in xhci driver and enable requst_threaded_irq in xhci driver,
the degradation becomes >10%, see below test on the same device connected to
xhci-hcd:

[***@board]$ps -ax | grep xhci
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
4896 pts/1 S+ 0:00 grep --color=auto xhci
[***@board]$sudo ./ds-msg /dev/sdb 400M 1 4
No. 0, time 121 MB
No. 1, time 122 MB
No. 2, time 124 MB
No. 3, time 122 MB
count=4, total=489 ms, average=122.250 MB
[***@board]$
[***@board]$ps -ax | grep xhci
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
6037 ? S 0:00 [irq/42-xhci_hcd]
6038 ? S 0:00 [irq/43-xhci_hcd]
6039 ? S 0:00 [irq/44-xhci_hcd]
6040 ? S 0:00 [irq/45-xhci_hcd]
6041 ? S 0:00 [irq/46-xhci_hcd]
6304 pts/1 S+ 0:00 grep --color=auto xhci
[***@board]$
[***@board]$
[***@board]$sudo ./ds-msg /dev/sdb 400M 1 4
No. 0, time 107 MB
No. 1, time 108 MB
No. 2, time 108 MB
No. 3, time 109 MB
count=4, total=432 ms, average=108.000 MB

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 03:24:58 UTC
Permalink
Post by Alan Stern
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
With the two trace points of irq_handler_entry and irq_handler_exit, the
interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
One simple script can figure out the average/maximum latency for one irq
handler, like I did in 4/4.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-14 14:56:09 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
With the two trace points of irq_handler_entry and irq_handler_exit, the
interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
One simple script can figure out the average/maximum latency for one irq
handler, like I did in 4/4.
But that doesn't measure the time between when the IRQ request is
issued and when irq_handler_entry runs.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 15:15:49 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers. Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much. (Interrupt latency is hard to measure, however.)
With the two trace points of irq_handler_entry and irq_handler_exit, the
interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
One simple script can figure out the average/maximum latency for one irq
handler, like I did in 4/4.
But that doesn't measure the time between when the IRQ request is
issued and when irq_handler_entry runs.
That might be hard to measure, also I am wondering if the time can be
measured only by software, but these patches only focus on the time
between HCD irq entry and irq exit.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-14 20:23:20 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
With the two trace points of irq_handler_entry and irq_handler_exit, the
interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
One simple script can figure out the average/maximum latency for one irq
handler, like I did in 4/4.
But that doesn't measure the time between when the IRQ request is
issued and when irq_handler_entry runs.
That might be hard to measure, also I am wondering if the time can be
measured only by software, but these patches only focus on the time
between HCD irq entry and irq exit.
Not entirely. On a UP system, leaving interrupts disabled for a long
time (which is what we do now) increases the delay between when the IRQ
is raised and when it is serviced. On an SMP system, a long-running
interrupt handler will delay servicing a different device that shares
the same IRQ line.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Thomas Gleixner
2013-06-14 21:09:34 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
With the two trace points of irq_handler_entry and irq_handler_exit, the
interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
One simple script can figure out the average/maximum latency for one irq
handler, like I did in 4/4.
But that doesn't measure the time between when the IRQ request is
issued and when irq_handler_entry runs.
That might be hard to measure, also I am wondering if the time can be
measured only by software, but these patches only focus on the time
between HCD irq entry and irq exit.
Not entirely. On a UP system, leaving interrupts disabled for a long
time (which is what we do now) increases the delay between when the IRQ
is raised and when it is serviced. On an SMP system, a long-running
interrupt handler will delay servicing a different device that shares
the same IRQ line.
And on UP it delays ALL other interrupts. I've seen 500us+ caused by
the USB interrupt handlers...

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-15 01:49:20 UTC
Permalink
Post by Alan Stern
Not entirely. On a UP system, leaving interrupts disabled for a long
time (which is what we do now) increases the delay between when the IRQ
is raised and when it is serviced. On an SMP system, a long-running
Yes, I mean the HCD IRQ handling time is already too long, and it
isn't affected by the time between raising and servicing the IRQ.
Post by Alan Stern
interrupt handler will delay servicing a different device that shares
the same IRQ line.
Not always so.

Currently, ARM can only set one irq line to be served by one of CPU
at the same time for power saving, which still results in above situation.
In fact, without irq-balance, on ARM, all IRQs are handled by CPU0 only.
Post by Alan Stern
And on UP it delays ALL other interrupts. I've seen 500us+ caused by
the USB interrupt handlers...
On SMP the above case may be worse than UP, when the same completion
things(from hw view) happen on one of CPU, the releasing & reacquiring HCD
private lock in interrupt handler may cause longer time.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-14 02:03:45 UTC
Permalink
Post by Alan Stern
Maybe we should simply copy what the networking people do. They are
very concerned about performance and latency; whatever technique they
use should be good for USB too.
Most of net devices don't use interrupt threaded handler, looks we need to
copy that first, :-)

$git grep -n request_threaded_irq drivers/net/ | wc -l
9

$git grep -n request_threaded_irq drivers/net/wireless | wc -l
5

Also 5 of 9 using interrupt threaded handler are wireless network devices, which
are a bit slow, and only one ethernet driver uses it.

So I plan to use tasklet first, then we can switch to interrupt threaded handler
in the future if mass storage devices are OK with it.

If no one objects to use tasklet, I will post out the v1 patch for review later.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-12 14:35:44 UTC
Permalink
Post by Ming Lei
If not, I am wondering why one USB driver need register another hard
interrupt handler? Could you share such examples? I understand the
case only exists with timer handler as discussed with Oliver, don't I?
Here's another possibility. A USB driver has a private lock, and it
acquires this lock in its completion handler. But it also acquires
this lock in other routines, and those other routines may be called by
other drivers or subsystems in interrupt context.
Post by Ming Lei
Post by Alan Stern
The drivers _do_ need to be changed, as explained above. And that
explanation was only one failure scenario. There may be others.
OK, I'd like to know what the others are? :-)
Me too. :-)
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
And finally, USB HCDs themselves will benefit from the change, won't they?
How will they benefit? Merely from not releasing their private locks?
The interrupt of one HCD will be responded lately since another HCD's
interrupt handler takes very long time. With the change, the problem can
be avoided.
For reasonably sophisticated host controllers like EHCI, the delay in
responding to an interrupt doesn't matter much. But for simpler host
controllers (for example, those using PIO instead of DMA) it matters
more, so those other HCDs will benefit more from the conversion.
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Compared with "usb-storage" task's schedule latency, the tasklet
schedule delay should be lower at most of situations since the tasklet
is scheduled inside irq handler.
True. But the two latencies add.
I think it should be one tasklet latency added, but what is the other one?
I meant that the latency in tasklet scheduling has to be added to the
latency in scheduling the usb-storage thread.
Post by Ming Lei
Post by Alan Stern
Or you could have a separate tasklet for each host controller.
Yes, but I will compare tasklet with interrupt threaded handler first.
Switching to a threaded interrupt handler offers more possibilities.
Instead of moving just the givebacks to the thread, we could put almost
all the processing there. (At least, we can for ehci-hcd. Other HCDs
may not be able to do this.)

At this point we are talking about multiple changes:

Moving the givebacks to a tasklet or threaded handler.

Changing the USB completion handlers so that they can be called
with interrupts enabled.

Allowing the tasklet/thread to run with interrupts enabled.

Moving more of the HCD processing into the tasklet/thread.

Maybe other things too. In principle, the first and second items can
be done simultaneously.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-12 15:10:37 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
If not, I am wondering why one USB driver need register another hard
interrupt handler? Could you share such examples? I understand the
case only exists with timer handler as discussed with Oliver, don't I?
Here's another possibility. A USB driver has a private lock, and it
acquires this lock in its completion handler. But it also acquires
this lock in other routines, and those other routines may be called by
other drivers or subsystems in interrupt context.
That is fatal. But again mechanically using _irqsave in complete()
does the job. No other place can be affected.

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-13 03:41:58 UTC
Permalink
Post by Alan Stern
Post by Ming Lei
If not, I am wondering why one USB driver need register another hard
interrupt handler? Could you share such examples? I understand the
case only exists with timer handler as discussed with Oliver, don't I?
Here's another possibility. A USB driver has a private lock, and it
acquires this lock in its completion handler. But it also acquires
this lock in other routines, and those other routines may be called by
other drivers or subsystems in interrupt context.
OK, so the problem is more complicated than I imaged at start, :-(
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
The drivers _do_ need to be changed, as explained above. And that
explanation was only one failure scenario. There may be others.
OK, I'd like to know what the others are? :-)
Me too. :-)
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
And finally, USB HCDs themselves will benefit from the change, won't they?
How will they benefit? Merely from not releasing their private locks?
The interrupt of one HCD will be responded lately since another HCD's
interrupt handler takes very long time. With the change, the problem can
be avoided.
For reasonably sophisticated host controllers like EHCI, the delay in
responding to an interrupt doesn't matter much. But for simpler host
EHCI might still benefit from the change: suppose uvc video is playing on
one EHCI bus, and mass storage's performance over another EHCI may
decrease because of the IRQ latency introduced.
Post by Alan Stern
controllers (for example, those using PIO instead of DMA) it matters
more, so those other HCDs will benefit more from the conversion.
Indeed, musb and ehci/ohci are coexisted on OMAP3/4/5.
Post by Alan Stern
Post by Ming Lei
Post by Alan Stern
Post by Ming Lei
Compared with "usb-storage" task's schedule latency, the tasklet
schedule delay should be lower at most of situations since the tasklet
is scheduled inside irq handler.
True. But the two latencies add.
I think it should be one tasklet latency added, but what is the other one?
I meant that the latency in tasklet scheduling has to be added to the
latency in scheduling the usb-storage thread.
Post by Ming Lei
Post by Alan Stern
Or you could have a separate tasklet for each host controller.
Yes, but I will compare tasklet with interrupt threaded handler first.
Switching to a threaded interrupt handler offers more possibilities.
Instead of moving just the givebacks to the thread, we could put almost
all the processing there. (At least, we can for ehci-hcd. Other HCDs
may not be able to do this.)
Moving the givebacks to a tasklet or threaded handler.
Changing the USB completion handlers so that they can be called
with interrupts enabled.
Allowing the tasklet/thread to run with interrupts enabled.
Moving more of the HCD processing into the tasklet/thread.
Maybe other things too. In principle, the first and second items can
be done simultaneously.
Very good summery, we can push the 1st change with disabling local IRQ
when calling complete(), so that at least DMA unmapping can be done
with IRQ enabled, and at the same time do the API change, which
should be a bit slow but the mechanical way proposed by Oliver may
be OK.

The 3rd item is easy once the 1st two items are completed.

For the 4th one, it might be a long term thing, since refactoring
HCD interrupt handler is a bit complicated. Also, when the 1st
three items are completed, hard interrupt handler takes less time,
often less than 20us at average about EHCI, so I am wondering
if the 4th is worthy.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Stern
2013-06-13 15:05:25 UTC
Permalink
Post by Ming Lei
Post by Alan Stern
For reasonably sophisticated host controllers like EHCI, the delay in
responding to an interrupt doesn't matter much. But for simpler host
EHCI might still benefit from the change: suppose uvc video is playing on
one EHCI bus, and mass storage's performance over another EHCI may
decrease because of the IRQ latency introduced.
If multiple CPUs are involved, then we could see an improvement in the
case where the two EHCI controllers share the same IRQ line. With
different IRQs, though, everything would run in parallel on different
CPUs, so there is no advantage.
Post by Ming Lei
Post by Alan Stern
Moving the givebacks to a tasklet or threaded handler.
Changing the USB completion handlers so that they can be called
with interrupts enabled.
Allowing the tasklet/thread to run with interrupts enabled.
Moving more of the HCD processing into the tasklet/thread.
Maybe other things too. In principle, the first and second items can
be done simultaneously.
Very good summery, we can push the 1st change with disabling local IRQ
when calling complete(), so that at least DMA unmapping can be done
with IRQ enabled, and at the same time do the API change, which
should be a bit slow but the mechanical way proposed by Oliver may
be OK.
The 3rd item is easy once the 1st two items are completed.
For the 4th one, it might be a long term thing, since refactoring
HCD interrupt handler is a bit complicated. Also, when the 1st
three items are completed, hard interrupt handler takes less time,
often less than 20us at average about EHCI, so I am wondering
if the 4th is worthy.
Yes, I don't know about that. If the time spent in the hardirq
processing is reasonably short then moving it to the threaded handler
won't help very much.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-12 09:11:06 UTC
Permalink
Post by Alan Stern
In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.
Yes. However, it can be done mechanically. And we know only
the handlers for complete need to be fixed.

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-12 10:11:34 UTC
Permalink
Post by Oliver Neukum
Post by Alan Stern
In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.
Yes. However, it can be done mechanically. And we know only
the handlers for complete need to be fixed.
I am wondering if the change is needed since timer function is still
run in softirq context instead of hard irq.

Looks Alan concerned that one USB interface driver may have another
hard interrupt handler involved. Is there such kind of USB driver/device
in tree?

Thanks,
--
Ming
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum
2013-06-12 10:19:28 UTC
Permalink
Post by Ming Lei
Post by Oliver Neukum
Post by Alan Stern
In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.
Yes. However, it can be done mechanically. And we know only
the handlers for complete need to be fixed.
I am wondering if the change is needed since timer function is still
run in softirq context instead of hard irq.
Looks Alan concerned that one USB interface driver may have another
hard interrupt handler involved. Is there such kind of USB driver/device
in tree?
No. Suppose a USB network driver.
The complete() handler is written on the assumption that interrupts are off.
So it takes a spinlock from the network subsystem. It does so with spin_lock()

Other network drivers also take the lock. And they may take it from an IRQ handler.
If such an IRQ interrupts the tasklet complete() is running in, the CPU will deadlock.

The danger is not interrupt handlers in the same driver but IRQ handlers of _other_
drivers (PCI, ...) a lock is shared with.

You need to go through all USB drivers and change every spin_lock() that goes
for a lock that is exported to a spin_lock_irqsave()

Regards
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ming Lei
2013-06-12 11:33:03 UTC
Permalink
Post by Oliver Neukum
Post by Ming Lei
Post by Oliver Neukum
Post by Alan Stern
In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.
Yes. However, it can be done mechanically. And we know only
the handlers for complete need to be fixed.
I am wondering if the change is needed since timer function is still
run in softirq context instead of hard irq.
Looks Alan concerned that one USB interface driver may have another
hard interrupt handler involved. Is there such kind of USB driver/device
in tree?
No. Suppose a USB network driver.
The complete() handler is written on the assumption that interrupts are off.
So it takes a spinlock from the network subsystem. It does so with spin_lock()
Looks I misses the case, but IMO, it might not be very common, generally
subsystem provides API for drivers to do something(handle rx/report tx) inside
complete(), and seldom exports subsystem-wide lock directly to drivers.
API has no context info, and should have called spin_lock_irqrestore().
Post by Oliver Neukum
Other network drivers also take the lock. And they may take it from an IRQ handler.
If such an IRQ interrupts the tasklet complete() is running in, the CPU will deadlock.
Looks no usbnet drivers hold subsystem(network) locks in its complete().
Both the locks held are per device/per skb list. In usbnet_bh(), there
is one, eg.
netif_rx(), which is a API and disables local IRQ.
Post by Oliver Neukum
The danger is not interrupt handlers in the same driver but IRQ handlers of _other_
drivers (PCI, ...) a lock is shared with.
Right, so generally drivers which spin_lock(subsystem lock) in complete()
might be affected.
Post by Oliver Neukum
You need to go through all USB drivers and change every spin_lock() that goes
for a lock that is exported to a spin_lock_irqsave()
Yes, that is the safest way.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...