Discussion:
Network latency regressions from 2.6.22 to 2.6.29
Christoph Lameter
2009-04-16 16:10:01 UTC
Permalink
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant latencies
were added since 2.6.27. I surely wish we could get back to times below 90
microseconds.

The tests were done over 1G ethernet using
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754
Gigabit Ethernet PCI Express (rev 02)

Results:

2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes 89.50 90.75 89.61 91.51 91.89 99.17 99.80 99.34
400 Bytes 98.58 101.44 97.85 99.61 100.36 117.96 118.10 126.79
1400 Bytes 152.76 153.75 153.85 156.22 156.66 163.92 165.54 166.04

Compared to 2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes -1.4% -0.1% -2.2% -2.6% -9.8% -10.3% -9.9%
400 Bytes -2.8% 0.7% -1.0% -1.8% -16.4% -16.5%-22.2%
1400 Bytes -0.6% -0.7% -2.2% -2.5% -6.8% -7.7% -8.0%

I presented these numbers with some nice graphs at the Linux Collab Summit
last week.

See
http://www.kernel.org/pub/linux/kernel/people/christoph/collab-spring-2009/Collab-summit-2009-sf.pdf

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rick Jones
2009-04-16 17:21:10 UTC
Permalink
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant latencies
were added since 2.6.27. I surely wish we could get back to times below 90
microseconds.
The tests were done over 1G ethernet using
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754
Gigabit Ethernet PCI Express (rev 02)
2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes 89.50 90.75 89.61 91.51 91.89 99.17 99.80 99.34
400 Bytes 98.58 101.44 97.85 99.61 100.36 117.96 118.10 126.79
1400 Bytes 152.76 153.75 153.85 156.22 156.66 163.92 165.54 166.04
Compared to 2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes -1.4% -0.1% -2.2% -2.6% -9.8% -10.3% -9.9%
400 Bytes -2.8% 0.7% -1.0% -1.8% -16.4% -16.5%-22.2%
1400 Bytes -0.6% -0.7% -2.2% -2.5% -6.8% -7.7% -8.0%
I presented these numbers with some nice graphs at the Linux Collab Summit
last week.
See
http://www.kernel.org/pub/linux/kernel/people/christoph/collab-spring-2009/Collab-summit-2009-sf.pdf
Does udpping have a concept of service demand a la netperf? That could help show
how much was code bloat vs say some tweak to interrupt coalescing parameters in
the NIC/driver.

[***@bl870c1 netperf2_trunk]# netperf -T 0 -c -C -t UDP_RR -H bl870c2.west -v 2
UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to bl870c2.west
(10.208.0.210) port 0 AF_INET : histogram : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr

126976 126976 1 1 10.00 7550.46 2.33 2.41 24.721 25.551
126976 126976

Histogram of request/reponse times.
UNIT_USEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
TEN_USEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 1
HUNDRED_USEC : 0: 75508: 0: 0: 0: 0: 0: 0: 0: 0
UNIT_MSEC : 0: 2: 0: 0: 0: 0: 0: 0: 0: 0
TEN_MSEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
HUNDRED_MSEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
UNIT_SEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
TEN_SEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
Post by Christoph Lameter
100_SECS: 0
HIST_TOTAL: 75511

I guess one of these days I have to tweak netperf to be able to get latencies
when doing bursts on the connection...

[***@bl870c1 netperf2_trunk]# ethtool -C eth0 rx-frames 1
[do the same on the other end]
[***@bl870c1 netperf2_trunk]# netperf -T 0 -c -C -t UDP_RR -H bl870c2.west -v 2
UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to bl870c2.west
(10.208.0.210) port 0 AF_INET : histogram : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr

126976 126976 1 1 10.00 11126.15 2.60 3.43 18.711 24.633
126976 126976

Histogram of request/reponse times.
UNIT_USEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
TEN_USEC : 0: 0: 0: 0: 0: 0: 0: 16: 71799: 38583
HUNDRED_USEC : 0: 856: 8: 0: 0: 0: 0: 0: 0: 0
UNIT_MSEC : 0: 1: 0: 0: 0: 0: 1: 0: 0: 0
TEN_MSEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
HUNDRED_MSEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
UNIT_SEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
TEN_SEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
Post by Christoph Lameter
100_SECS: 0
HIST_TOTAL: 111264
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-16 19:06:59 UTC
Permalink
Post by Rick Jones
Does udpping have a concept of service demand a la netperf? That could help
show how much was code bloat vs say some tweak to interrupt coalescing
parameters in the NIC/driver.
No. What does service on demand mean? The ping pong tests are very simple
back and forths without any streaming or overlay.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-16 19:29:02 UTC
Permalink
=20
Does udpping have a concept of service demand a la netperf? That co=
uld help
show how much was code bloat vs say some tweak to interrupt coalesci=
ng
parameters in the NIC/driver.
=20
No. What does service on demand mean? The ping pong tests are very si=
mple
back and forths without any streaming or overlay.
--
Would be good to measure a ping flood, because this one doesnt
need wakeup of a user process on the remote side.


$ ping -f 192.168.20.112 -c 10000
PING 192.168.20.112 (192.168.20.112) 56(84) bytes of data.

--- 192.168.20.112 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 3337ms
rtt min/avg/max/mdev =3D 0.072/0.184/20.011/1.389 ms, pipe 2, ipg/ewma =
0.333/0.078 ms

and check if various kernel versions already give you different values =
(for avg)

When you perform your test, you upgrade kernels of both machines ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-16 19:33:06 UTC
Permalink
Post by Eric Dumazet
When you perform your test, you upgrade kernels of both machines ?
Yes. Both machines always run the same kernel version.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2009-04-16 22:47:01 UTC
Permalink
From: Christoph Lameter <***@linux.com>
Date: Thu, 16 Apr 2009 15:33:06 -0400 (EDT)
Post by Christoph Lameter
Post by Eric Dumazet
When you perform your test, you upgrade kernels of both machines ?
Yes. Both machines always run the same kernel version.
It's good to change multiple variables at a time between
tests...

More seriously, it's going to be hard to isolate anything
unless you concern yourself with changes to one thing
on one system at a time.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-17 13:46:11 UTC
Permalink
Post by David Miller
More seriously, it's going to be hard to isolate anything
unless you concern yourself with changes to one thing
on one system at a time.
Only the kernel version was changed on both systems. Changing only a
single system would have meant considering complex interactions between
the network layers of varying versions which would make
the problem even more difficult to track down.

I changed one thing: The kernel version.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen
2009-04-17 21:43:09 UTC
Permalink
Post by Christoph Lameter
Post by David Miller
More seriously, it's going to be hard to isolate anything
unless you concern yourself with changes to one thing
on one system at a time.
Only the kernel version was changed on both systems. Changing only a
single system would have meant considering complex interactions between
the network layers of varying versions which would make
the problem even more difficult to track down.
I changed one thing: The kernel version.
No we're not talking about n^2 results here but claiming that changing
only a single network endpoint makes things _more complex_ is quite bogus?
Like it or not, those network layers you mention are still separated by
pretty standadized medium whose encoding and timing is certainly quite
independent of the kernel version :-).

I'd suggest you take the latest and v2.6.22 and run the remaining two
mixed setups (I hope your scripts are not too fixed with identical
versions... :-)).
--
i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rick Jones
2009-04-16 20:05:01 UTC
Permalink
Post by Christoph Lameter
Post by Rick Jones
Does udpping have a concept of service demand a la netperf? That could help
show how much was code bloat vs say some tweak to interrupt coalescing
parameters in the NIC/driver.
No. What does service on demand mean? The ping pong tests are very simple
back and forths without any streaming or overlay.
It is a measure of efficiency - quantity of CPU consumed per unit of work. For
example, from my previous email:

UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to bl870c2.west
(10.208.0.210) port 0 AF_INET : histogram : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr

126976 126976 1 1 10.00 7550.46 2.33 2.41 24.721 25.551
126976 126976

The transaction rate was 7550 (invert for the latency) and service demand was
24.721 (give or take :) microseconds of CPU time consumed per transaction on the
one side and 25.551 on the other (identical systems and kernels). If we make the
handwaving assumption that virtually all the CPU consumption on either side is in
the latency path, and calculate the overall latency we have:

overall: 132.44 microseconds per transaction
CPU time: 50.27
other: 80.17

With "other" being such a large component, it is a tip-off (not a slam dunk, but
a big clue, that there was a sub-standard interrupt avoidance mechanism at work.
Even if we calculate the transmission time on the wire for the request and
response - 1 byte payload, 8 bytes UDP header, 20 bytes IPv4 header, 14 bytes
Ethernet header - 344 bits or 688 both request and response (does full-duplex GbE
still enforce the 60 byte minimum? I forget) we have:

wiretime: 0.69

and even if DMA time were twice that, there is still 75+ microseconds
unaccounted. Smells like a timer running in a NIC. And/or some painfully slow
firmware on the NIC.

If the latency were constrained almost entirely by the CPU consumption in the
case above, the transaction rate should have been more like 19000 transactions
per second. And with those two systems, with a different, 10G NIC installed (not
that 10G speed is required for single-byte _RR), I've seen 20,000
transactions/second.

That is with netperf/netserver running on the same CPU as taking the NIC
interrupt(s). When running on a different core from the interrupt(s) the
cache-to-cache traffic dropped the transaction rate by 30% (a 2.6.18esque kernel,
but I've seen similar stuff elsewhere).

So, when looking for latency regressions, there can be a lot of variables.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ben Hutchings
2009-04-16 18:07:59 UTC
Permalink
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant latencies
were added since 2.6.27. I surely wish we could get back to times below 90
microseconds.
[...]

This "90 microseconds" figure is specific to a particular driver and
hardware. Have you verified that it applies to others? The variation
you've reported is tiny compared to the improvements that can be made or
lost by hardware tuning.

Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-16 19:02:36 UTC
Permalink
Post by Ben Hutchings
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant latencies
were added since 2.6.27. I surely wish we could get back to times below 90
microseconds.
[...]
This "90 microseconds" figure is specific to a particular driver and
hardware. Have you verified that it applies to others? The variation
you've reported is tiny compared to the improvements that can be made or
lost by hardware tuning.
The RX delay can influence this of course and so can the driver. But the
measurements are with the same RX delay and the same driver. Variations
are up to 20% which is not tiny and its only due to different kernel
versions.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ben Hutchings
2009-04-16 21:19:18 UTC
Permalink
Post by Christoph Lameter
Post by Ben Hutchings
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant latencies
were added since 2.6.27. I surely wish we could get back to times below 90
microseconds.
[...]
This "90 microseconds" figure is specific to a particular driver and
hardware. Have you verified that it applies to others? The variation
you've reported is tiny compared to the improvements that can be made or
lost by hardware tuning.
The RX delay can influence this of course and so can the driver. But the
measurements are with the same RX delay and the same driver. Variations
are up to 20% which is not tiny and its only due to different kernel
versions.
I just ran netperf UDP_RR against sfc (the out-of-tree version, so that
driver changes should not be a factor) on "my" test rig (a couple of
servers from 2005; don't quote these figures). The transaction rates
were:

2.6.22: 38584.58
2.6.27: 35312.01
2.6.29.1: 38006.50

So for this hardware and driver, 2.6.27 has slightly _higher_ latency.
This is why I'm asking whether you tested with multiple drivers, so you
are not measuring changes in tg3 (or whichever driver it was).

Note also that 1G Ethernet is hardly representative of HPC hardware.

Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2009-04-16 22:47:52 UTC
Permalink
From: Christoph Lameter <***@linux.com>
Date: Thu, 16 Apr 2009 15:02:36 -0400 (EDT)
Post by Christoph Lameter
Post by Ben Hutchings
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant latencies
were added since 2.6.27. I surely wish we could get back to times below 90
microseconds.
[...]
This "90 microseconds" figure is specific to a particular driver and
hardware. Have you verified that it applies to others? The variation
you've reported is tiny compared to the improvements that can be made or
lost by hardware tuning.
The RX delay can influence this of course and so can the driver. But the
measurements are with the same RX delay and the same driver. Variations
are up to 20% which is not tiny and its only due to different kernel
versions.
The driver is changing, and that can also be enough.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-16 19:43:53 UTC
Permalink
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant late=
ncies
Post by Christoph Lameter
were added since 2.6.27. I surely wish we could get back to times bel=
ow 90
Post by Christoph Lameter
microseconds.
=20
The tests were done over 1G ethernet using
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754
Gigabit Ethernet PCI Express (rev 02)
=20
=20
2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes 89.50 90.75 89.61 91.51 91.89 99.17 99.80 99.34
400 Bytes 98.58 101.44 97.85 99.61 100.36 117.96 118.10 126.79
1400 Bytes 152.76 153.75 153.85 156.22 156.66 163.92 165.54 166.04
=20
Compared to 2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes -1.4% -0.1% -2.2% -2.6% -9.8% -10.3% -9.9%
400 Bytes -2.8% 0.7% -1.0% -1.8% -16.4% -16.5%-22.2%
1400 Bytes -0.6% -0.7% -2.2% -2.5% -6.8% -7.7% -8.0%
=20
I presented these numbers with some nice graphs at the Linux Collab S=
ummit
Post by Christoph Lameter
last week.
=20
See
http://www.kernel.org/pub/linux/kernel/people/christoph/collab-spring=
-2009/Collab-summit-2009-sf.pdf
Post by Christoph Lameter
=20
Is it a bnx2 or tg3 driver ?

copybreak feature could explain a regression for small packets, since
an extra copy is done.

It would be sane to always set on both machines :
ethtool -C eth0 rx-usecs 0 rx-frames 1
prior to your tests...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-16 19:50:54 UTC
Permalink
Post by Eric Dumazet
Post by Christoph Lameter
The following are results of lantency measurements using udpping
(available from http://gentwo.org/ll). It shows that significant lat=
encies
Post by Eric Dumazet
Post by Christoph Lameter
were added since 2.6.27. I surely wish we could get back to times be=
low 90
Post by Eric Dumazet
Post by Christoph Lameter
microseconds.
The tests were done over 1G ethernet using
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754
Gigabit Ethernet PCI Express (rev 02)
2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes 89.50 90.75 89.61 91.51 91.89 99.17 99.80 99.34
400 Bytes 98.58 101.44 97.85 99.61 100.36 117.96 118.10 126.79
1400 Bytes 152.76 153.75 153.85 156.22 156.66 163.92 165.54 166.04
Compared to 2.6.22 2.6.23 2.6.24 2.6.25 2.6.26 2.6.27 2.6.28 2.6.29
40 Bytes -1.4% -0.1% -2.2% -2.6% -9.8% -10.3% -9.9%
400 Bytes -2.8% 0.7% -1.0% -1.8% -16.4% -16.5%-22.2%
1400 Bytes -0.6% -0.7% -2.2% -2.5% -6.8% -7.7% -8.0%
I presented these numbers with some nice graphs at the Linux Collab =
Summit
Post by Eric Dumazet
Post by Christoph Lameter
last week.
See
http://www.kernel.org/pub/linux/kernel/people/christoph/collab-sprin=
g-2009/Collab-summit-2009-sf.pdf
Post by Eric Dumazet
=20
Is it a bnx2 or tg3 driver ?
=20
copybreak feature could explain a regression for small packets, since
an extra copy is done.
=20
ethtool -C eth0 rx-usecs 0 rx-frames 1
prior to your tests...
=20
Also you might check /proc/interrupts, because recent linux versions ar=
e
distributing interrupts on all cpus instead of one on previous version

Old kernel :
# grep eth1 /proc/interrupts
58: 25754498 0 0 0 0 0 =
0 0 PCI-MSI eth1


New kernel :
# grep eth1 /proc/interrupts
221: 16476622 16478581 16478007 16478793 16478237 16472191 =
16474083 16476374 PCI-MSI-edge eth1

(and /proc/irq/221/smp_affinity is not taken into account on some NICS)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-16 20:01:16 UTC
Permalink
Also you might check /proc/interrupts, because recent linux versions are
distributing interrupts on all cpus instead of one on previous version
Argh. I just checked it and the systems used for the test had no irq
pinning. We are rerunning the series.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2009-04-16 23:00:02 UTC
Permalink
From: Christoph Lameter <***@linux.com>
Date: Thu, 16 Apr 2009 16:01:16 -0400 (EDT)
Post by Christoph Lameter
Also you might check /proc/interrupts, because recent linux versions are
distributing interrupts on all cpus instead of one on previous version
Argh. I just checked it and the systems used for the test had no irq
pinning. We are rerunning the series.
Good catch Eric :-)

We really need to find a good way to fix that IRQ affinity issue.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-17 16:42:40 UTC
Permalink
Post by David Miller
We really need to find a good way to fix that IRQ affinity issue.
Here are the results with setting interrupt affinity. We are still loosing
5 usecs from .22 to .30-rc2. We will investigate more in depth time
permitting.

UDP ping pong 40 bytes

Kernel Test 1 Test 2 Test 3 Test 4 Avg
2.6.22 83.1125 83.4 83.03 82.95 83.07
2.6.23 83.33 83.51 83.18 83.42 83.21
2.6.24 82.7775 82.8 82.77 82.77 82.77
2.6.25 85.875 85.94 85.75 86.06 85.75
2.6.26 87.4025 87.53 87.25 87.3 87.53
2.6.27 87.81 87.81 87.77 87.82 87.84
2.6.28 87.4275 87.77 87.24 87.28 87.42
2.6.29 88.4325 88.48 88.42 88.44 88.39
2.6.30-rc2 88.4925 88.49 88.44 88.56 88.48
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-18 08:18:22 UTC
Permalink
=20
Post by David Miller
We really need to find a good way to fix that IRQ affinity issue.
=20
Here are the results with setting interrupt affinity. We are still lo=
osing
5 usecs from .22 to .30-rc2. We will investigate more in depth time
permitting.
=20
UDP ping pong 40 bytes
=20
Kernel Test 1 Test 2 Test 3 Test 4 Avg
2.6.22 83.1125 83.4 83.03 82.95 83.07
2.6.23 83.33 83.51 83.18 83.42 83.21
2.6.24 82.7775 82.8 82.77 82.77 82.77
2.6.25 85.875 85.94 85.75 86.06 85.75
2.6.26 87.4025 87.53 87.25 87.3 87.53
2.6.27 87.81 87.81 87.77 87.82 87.84
2.6.28 87.4275 87.77 87.24 87.28 87.42
2.6.29 88.4325 88.48 88.42 88.44 88.39
2.6.30-rc2 88.4925 88.49 88.44 88.56 88.48
You are using UDP, and UDP got memory accounting. This=20

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-18 08:20:28 UTC
Permalink
Post by Eric Dumazet
Post by David Miller
We really need to find a good way to fix that IRQ affinity issue.
Here are the results with setting interrupt affinity. We are still l=
oosing
Post by Eric Dumazet
5 usecs from .22 to .30-rc2. We will investigate more in depth time
permitting.
UDP ping pong 40 bytes
Kernel Test 1 Test 2 Test 3 Test 4 Avg
2.6.22 83.1125 83.4 83.03 82.95 83.07
2.6.23 83.33 83.51 83.18 83.42 83.21
2.6.24 82.7775 82.8 82.77 82.77 82.77
2.6.25 85.875 85.94 85.75 86.06 85.75
2.6.26 87.4025 87.53 87.25 87.3 87.53
2.6.27 87.81 87.81 87.77 87.82 87.84
2.6.28 87.4275 87.77 87.24 87.28 87.42
2.6.29 88.4325 88.48 88.42 88.44 88.39
2.6.30-rc2 88.4925 88.49 88.44 88.56 88.48
=20
=20
You are using UDP, and UDP got memory accounting. This=20
This could explain a slowdown, but not 2.5 us per packet
(I take half your 5us, since you are measuring the two ways latencies)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-18 19:43:59 UTC
Permalink
=20
Post by David Miller
We really need to find a good way to fix that IRQ affinity issue.
Yep... this is annoying, but disable_msi=3D1 can help
=20
Here are the results with setting interrupt affinity. We are still lo=
osing
5 usecs from .22 to .30-rc2. We will investigate more in depth time
permitting.
=20
UDP ping pong 40 bytes
=20
Kernel Test 1 Test 2 Test 3 Test 4 Avg
2.6.22 83.1125 83.4 83.03 82.95 83.07
2.6.23 83.33 83.51 83.18 83.42 83.21
2.6.24 82.7775 82.8 82.77 82.77 82.77
2.6.25 85.875 85.94 85.75 86.06 85.75
2.6.26 87.4025 87.53 87.25 87.3 87.53
2.6.27 87.81 87.81 87.77 87.82 87.84
2.6.28 87.4275 87.77 87.24 87.28 87.42
2.6.29 88.4325 88.48 88.42 88.44 88.39
2.6.30-rc2 88.4925 88.49 88.44 88.56 88.48
I did my tests with 3 machines, trying to reproduce your results and fa=
iled.

So its good news for us (we can go back watching TV :) )

192.168.20.110 : linux-2.6 latest git tree, receiver "newkernel"
192.168.20.120 : linux-2.6.22, receiver "oldkernel"
192.168.20.112 : 2.6.28.7 as the sender machine (we dont really care)

I only took traces on receivers to rule out different coalescing delays
of NIC that could depend on firmware or version

bnx2 driver statically included in vmlinux.
(on newkernel, its "disable_msi" set to 1 (because this is the only way=
my BCM5708S can
have interrupt affinities set by the admin used by the hardware on 2.6=
=2E28+)

Result for both new and old kernel :

3 us of latency to answer to a ICMP ping request, if irq affinity corre=
ct.

This 3 us includes the overhead of timestamping and pcap copy of packet=
s
since I have my tcpdump launched.

In my case I want NIC IRQs on CPU0 : "echo 1 >/proc/irq/16/smp_affinity=
"


(trace taken on the receiver side)
ICMP Ping for 2.6.30-rc2 : 3us, 3us, 3us, 3us
15:11:52.856072 IP 192.168.20.112 > 192.168.20.110: ICMP echo request, =
id 7957, seq 3330, length 64
15:11:52.856075 IP 192.168.20.110 > 192.168.20.112: ICMP echo reply, id=
7957, seq 3330, length 64

15:11:52.856168 IP 192.168.20.112 > 192.168.20.110: ICMP echo request, =
id 7957, seq 3331, length 64
15:11:52.856171 IP 192.168.20.110 > 192.168.20.112: ICMP echo reply, id=
7957, seq 3331, length 64

15:11:52.856264 IP 192.168.20.112 > 192.168.20.110: ICMP echo request, =
id 7957, seq 3332, length 64
15:11:52.856267 IP 192.168.20.110 > 192.168.20.112: ICMP echo reply, id=
7957, seq 3332, length 64

15:11:52.856359 IP 192.168.20.112 > 192.168.20.110: ICMP echo request, =
id 7957, seq 3333, length 64
15:11:52.856362 IP 192.168.20.110 > 192.168.20.112: ICMP echo reply, id=
7957, seq 3333, length 64


ICMP Ping for 2.6.22: 3us, 3us, 3us, 3us : Exactly same latencies than =
current kernel

20:35:06.375011 IP 192.168.20.112 > 192.168.20.120: icmp 64: echo reque=
st seq 110
20:35:06.375014 IP 192.168.20.120 > 192.168.20.112: icmp 64: echo reply=
seq 110

20:35:06.375106 IP 192.168.20.112 > 192.168.20.120: icmp 64: echo reque=
st seq 111
20:35:06.375109 IP 192.168.20.120 > 192.168.20.112: icmp 64: echo reply=
seq 111

20:35:06.375201 IP 192.168.20.112 > 192.168.20.120: icmp 64: echo reque=
st seq 112
20:35:06.375204 IP 192.168.20.120 > 192.168.20.112: icmp 64: echo reply=
seq 112

20:35:06.375296 IP 192.168.20.112 > 192.168.20.120: icmp 64: echo reque=
st seq 113
20:35:06.375299 IP 192.168.20.120 > 192.168.20.112: icmp 64: echo reply=
seq 113



Answering to a ping involves the regular network stack, with softirq an=
d all...
But we know for sure that the reply comes from the cpu that handled the=
rx, so
no false sharing between cpus, and that no user land process was involv=
ed, only softirqs.


Now with your udpping I get very variables times : 21 us, 14 us, 27 us,=
11 us on following random samples

(trace taken on receiver side again)
15:18:50.256238 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
15:18:50.256259 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

15:18:50.256371 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
15:18:50.256385 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

15:18:50.256498 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
15:18:50.256525 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

15:18:50.256638 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
15:18:50.256649 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

Then if I force uddpping receiver running on CPU2 (same physical CPU re=
ceiving NIC interrupts,
but a different core) :

I get : 13us, 11us, 10us, 10us, which is not bad.

21:07:48.823741 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:07:48.823754 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

21:07:48.823852 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:07:48.823863 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

21:07:48.823962 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:07:48.823972 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

21:07:48.824087 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:07:48.824097 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

If run on another physical cpu : I get 15us, 12us, 12us, 12us

21:09:10.667120 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:09:10.667135 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300
21:09:10.667234 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:09:10.667246 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300
21:09:10.667345 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:09:10.667357 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300
21:09:10.667471 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:09:10.667483 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300


If I force udpping to same cpu than IRQS (CPU0) I get : 7us, 8us, 7us, =
7us
And yes, this is very good and expected IMHO :)

21:04:23.780421 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:04:23.780428 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

21:04:23.780539 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:04:23.780547 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

21:04:23.780663 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:04:23.780670 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300

21:04:23.780783 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300
21:04:23.780790 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300


Now running the receiver on 2.6.22

udpping: 6us, 15us, 14us, 13us

16:34:34.266069 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
16:34:34.266075 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

16:34:34.266093 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
16:34:34.266108 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

16:34:34.266209 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
16:34:34.266223 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

16:34:34.266323 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
16:34:34.266335 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

Same cpu for user process and IRQS :

udpping : 7us, 9us, 8us, 10us : Very good, as we could expect !

20:39:07.106698 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
20:39:07.106705 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

20:39:07.106808 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
20:39:07.106819 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

20:39:07.106918 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
20:39:07.106926 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

20:39:07.107046 IP 192.168.20.112.9001 > 192.168.20.120.9000: UDP, leng=
th 300
20:39:07.107056 IP 192.168.20.120.9000 > 192.168.20.112.9001: UDP, leng=
th 300

But in the overall, I get litle bit better results with 2.6.30-rc2 : 7 =
us


I did an oprofile session on 2.6.30-rc2 for the 'best case' (7us of lat=
ency)
tracing only kernel events.

(and unfortunalty IRQ16 is shared by eth0, eth1 and uhci_hcd:usb1


CPU: Core 2, speed 3000.19 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a u=
nit mask of 0x00 (Unhalted core cycles) count 10000
samples cum. samples % cum. % symbol name
252423 252423 19.4717 19.4717 mwait_idle
99195 351618 7.6518 27.1235 apic_timer_interrupt
64108 415726 4.9452 32.0687 bnx2_interrupt
62195 477921 4.7977 36.8664 find_busiest_group
60022 537943 4.6300 41.4964 scheduler_tick
40772 578715 3.1451 44.6415 read_tsc
38550 617265 2.9737 47.6152 uhci_irq
36074 653339 2.7827 50.3980 sched_clock_cpu
25319 678658 1.9531 52.3510 rb_get_reader_page
25114 703772 1.9373 54.2883 native_sched_clock
23753 727525 1.8323 56.1206 getnstimeofday
22103 749628 1.7050 57.8256 ktime_get
19924 769552 1.5369 59.3625 hrtimer_interrupt
18999 788551 1.4656 60.8281 ktime_get_ts
17910 806461 1.3816 62.2096 tick_check_oneshot_broadcast
17664 824125 1.3626 63.5722 ring_buffer_consume
13354 837479 1.0301 64.6023 irq_entries_start
12885 850364 0.9939 65.5963 sched_clock_tick
12270 862634 0.9465 66.5428 tick_sched_timer
11232 873866 0.8664 67.4092 __schedule
10607 884473 0.8182 68.2274 softlockup_tick
9446 893919 0.7287 68.9561 raise_softirq
8483 902402 0.6544 69.6104 run_timer_softirq
8295 910697 0.6399 70.2503 __do_softirq
8208 918905 0.6332 70.8835 copy_to_user
7929 926834 0.6116 71.4951 run_rebalance_domains
7478 934312 0.5768 72.0720 bnx2_poll_work
6872 941184 0.5301 72.6021 cpu_idle
6267 947451 0.4834 73.0855 rcu_pending
5953 953404 0.4592 73.5447 rb_advance_reader
5781 959185 0.4459 73.9906 handle_fasteoi_irq
5762 964947 0.4445 74.4351 cpu_clock
5615 970562 0.4331 74.8682 udp_sendmsg
5544 976106 0.4277 75.2959 __skb_recv_datagram
5493 981599 0.4237 75.7196 set_normalized_timespec
5437 987036 0.4194 76.1390 update_wall_time
5282 992318 0.4074 76.5465 smp_apic_timer_interrupt
5097 997415 0.3932 76.9397 ip_append_data
5034 1002449 0.3883 77.3280 find_next_bit


my vmstat 1 on 2.6.30-rC2 receiver during test :

0 0 0 4047312 12944 40156 0 0 0 0 64231 26101 0=
1 99 0
0 1 0 4047312 12948 40156 0 0 0 32 64297 25762 0=
1 99 0
0 0 0 4047304 12952 40156 0 0 0 8 64650 25510 0=
1 99 0
0 0 0 4047312 12960 40152 0 0 0 344 64533 25434 0=
1 98 1
0 0 0 4047312 12960 40156 0 0 0 0 64279 25628 0=
1 99 0
0 0 0 4047312 12960 40156 0 0 0 4 64606 25356 0=
1 99 0
0 0 0 4047312 12960 40156 0 0 0 0 64253 25732 0=
1 99 0
0 1 0 4047304 12968 40152 0 0 0 40 64735 25211 0=
1 99 0
0 0 0 4047312 12968 40156 0 0 0 356 64638 25443 0=
1 99 0
0 0 0 4047312 12968 40156 0 0 0 0 64269 25608 0=
1 99 0
0 0 0 4047312 12968 40156 0 0 0 0 64562 25120 0=
1 99 0
1 0 0 4047312 12968 40156 0 0 0 4 64668 24811 0=
1 99 0


My three machines have same hardware :
HP ProLiant BL460c G1
2 quad core Intel(R) Xeon(R) CPU E5450 @ 3.00GHz FSB 1333MHz
4GB ram (667 MHz)
Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (rev 12)
32 bit kernel


In conclusion, something might be wrong in your methodology, or your ha=
rdware, or...

Please Christoph, would you mind posting more information on your confi=
g ?

Any chance you have CONFIG_PREEMT enabled or any expensive kernel debug=
thing ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-20 17:29:54 UTC
Permalink
I did my tests with 3 machines, trying to reproduce your results and failed.
Ow.... Where are the results? Seems that you only included the results of
traces? The numbers would be much higher with udpping since it measures
the full round trip time from user space of one machine to the other and
back. The code segment that you are tracing may not be relevant.
I only took traces on receivers to rule out different coalescing delays
of NIC that could depend on firmware or version
Hmmm... Okay there could be a firmware influence....
bnx2 driver statically included in vmlinux.
(on newkernel, its "disable_msi" set to 1 (because this is the only way my BCM5708S can
have interrupt affinities set by the admin used by the hardware on 2.6.28+)
Here bnx2 is a module and MSI is enabled after it became available
(2.6.27+). Affinities seem to be effective after resetting the driver.
If I force udpping to same cpu than IRQS (CPU0) I get : 7us, 8us, 7us, 7us
And yes, this is very good and expected IMHO :)
Sounds very good. If I just knew what you are measuring.
In conclusion, something might be wrong in your methodology, or your hardware, or...
Cannot really say since the results are not published? We can retry this
by making sure that udpping runs on the same processor as the IRQ.
Please Christoph, would you mind posting more information on your config ?
Sure.

Dell PowerEdge 1950
http://www1.ap.dell.com/content/products/productdetails.aspx/pedge_1950?c=au&cs=aubsd1&l=en&s=bsd
Any chance you have CONFIG_PREEMT enabled or any expensive kernel debug thing ?
Nope. Even if so: Would that not be offset by running the same kernel
version?

CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
CONFIG_X86_LOCAL_APIC=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_4KSTACKS is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-20 17:57:09 UTC
Permalink
=20
I did my tests with 3 machines, trying to reproduce your results and=
failed.
=20
Ow.... Where are the results? Seems that you only included the result=
s of
traces? The numbers would be much higher with udpping since it measur=
es
the full round trip time from user space of one machine to the other =
and
back. The code segment that you are tracing may not be relevant.
=20
I only took traces on receivers to rule out different coalescing del=
ays
of NIC that could depend on firmware or version
=20
Hmmm... Okay there could be a firmware influence....
=20
bnx2 driver statically included in vmlinux.
(on newkernel, its "disable_msi" set to 1 (because this is the only =
way my BCM5708S can
have interrupt affinities set by the admin used by the hardware on =
2.6.28+)
=20
Here bnx2 is a module and MSI is enabled after it became available
(2.6.27+). Affinities seem to be effective after resetting the driver=
=2E

Hmm, I am not sure I can do that, I would loose my ssh connection...
=20
If I force udpping to same cpu than IRQS (CPU0) I get : 7us, 8us, 7u=
s, 7us
And yes, this is very good and expected IMHO :)
=20
Sounds very good. If I just knew what you are measuring.
Rephrasing my email, I was measuring latencies on the receiving machine=
,
by using tcpdump and doing substraction of 'answer_time' and 'request_t=
ime'
Thing is that timestamps dont care about hardware delays.
(we note time when RX interrupt delivers the packet, and time right bef=
ore giving frame to hardware)


21:04:23.780421 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, leng=
th 300 (request)
21:04:23.780428 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, leng=
th 300 (answer)

Here, [21:04:23.780428 - 21:04:23.780421] =3D 7 us

So my results are extensively published :)
=20
In conclusion, something might be wrong in your methodology, or your=
hardware, or...
=20
Cannot really say since the results are not published? We can retry t=
his
by making sure that udpping runs on the same processor as the IRQ.
=20
Please Christoph, would you mind posting more information on your co=
nfig ?
=20
Sure.
=20
Dell PowerEdge 1950
http://www1.ap.dell.com/content/products/productdetails.aspx/pedge_19=
50?c=3Dau&cs=3Daubsd1&l=3Den&s=3Dbsd
=20
Any chance you have CONFIG_PREEMT enabled or any expensive kernel de=
bug thing ?
=20
Nope. Even if so: Would that not be offset by running the same kernel
version?
I dont know, it was a random thought of mine, since I had in the past b=
ad results
if this one was enabled. This might had changed.
=20
CONFIG_HPET_TIMER=3Dy
CONFIG_HPET_EMULATE_RTC=3Dy
CONFIG_NR_CPUS=3D32
CONFIG_SCHED_SMT=3Dy
OK, I had "# CONFIG_SCHED_SMT is not set"
I'll try with this option set
CONFIG_SCHED_MC=3Dy
CONFIG_PREEMPT_NONE=3Dy
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
CONFIG_X86_LOCAL_APIC=3Dy
=20
#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=3Dy
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_MUST_CHECK=3Dy
CONFIG_MAGIC_SYSRQ=3Dy
CONFIG_UNUSED_SYMBOLS=3Dy
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=3Dy
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=3Dy
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=3Dy
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_EARLY_PRINTK=3Dy
CONFIG_DEBUG_STACKOVERFLOW=3Dy
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_4KSTACKS is not set
CONFIG_X86_FIND_SMP_CONFIG=3Dy
CONFIG_X86_MPPARSE=3Dy
CONFIG_DOUBLEFAULT=3Dy
=20
Are you running a 32 or 64 bits kernel ?

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-20 18:13:23 UTC
Permalink
Post by Christoph Lameter
Sounds very good. If I just knew what you are measuring.
Rephrasing my email, I was measuring latencies on the receiving machine,
by using tcpdump and doing substraction of 'answer_time' and 'request_time'
Thing is that timestamps dont care about hardware delays.
(we note time when RX interrupt delivers the packet, and time right before giving frame to hardware)
21:04:23.780421 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, length 300 (request)
21:04:23.780428 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, length 300 (answer)
Here, [21:04:23.780428 - 21:04:23.780421] = 7 us
So my results are extensively published :)
But they are not comparable with my results. There could be other effects
in the system call API etc that have caused this regression. Plus tcpdump
causes additional copies of the packet to be delivered to user space.
Post by Christoph Lameter
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
OK, I had "# CONFIG_SCHED_SMT is not set"
I'll try with this option set
Should not be relevant since the processor has no hyperthreading.
Are you running a 32 or 64 bits kernel ?
Test was done using a 64 bit kernel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-20 18:46:34 UTC
Permalink
=20
Post by Christoph Lameter
Sounds very good. If I just knew what you are measuring.
Rephrasing my email, I was measuring latencies on the receiving mach=
ine,
by using tcpdump and doing substraction of 'answer_time' and 'reques=
t_time'
Thing is that timestamps dont care about hardware delays.
(we note time when RX interrupt delivers the packet, and time right =
before giving frame to hardware)
21:04:23.780421 IP 192.168.20.112.9001 > 192.168.20.110.9000: UDP, l=
ength 300 (request)
21:04:23.780428 IP 192.168.20.110.9000 > 192.168.20.112.9001: UDP, l=
ength 300 (answer)
Here, [21:04:23.780428 - 21:04:23.780421] =3D 7 us
So my results are extensively published :)
=20
But they are not comparable with my results. There could be other eff=
ects
in the system call API etc that have caused this regression. Plus tcp=
dump
causes additional copies of the packet to be delivered to user space.
Yep, this was all mentioned in my mail.
I wanted to compare latencies on receiver only, ruling out hardware, an=
d ruling out sender (no need to reboot it)
These latencies are higher than ones without tcpdump, since more copies=
are involved with tcpdump.

About system call API effects, they are included in my tests.
Since :

t0 : time we receive packet from NIC
-> wakeup user process, scheduler...
User process returns from the recvfrom() copy from system to user=
space
User process does the sendto() copy from user to system space

t1: -> calling dev_start_xmit()
packet given to NIC driver (idle during the tests, so should reall=
y send the packet asap)

User call again recvfrom() and block (this one is not accounted in=
latency, as in your test)
t2: NIC driver acknowledge the TX=20

delta =3D t1 - t0

One thing that could hurt is the TX done interrupt, but this is done a =
few us after "t1" so it
doesnt hurt your workload, since next frame is received at least 100us =
after the last answer...

(cpu is idle 99%)

Point is that even with tcpdump running, latencies are very good on 2.6=
=2E30-rc2, and were very good
with 2.6.22. I see no significant increase/decrease...
=20
Post by Christoph Lameter
CONFIG_HPET_TIMER=3Dy
CONFIG_HPET_EMULATE_RTC=3Dy
CONFIG_NR_CPUS=3D32
CONFIG_SCHED_SMT=3Dy
OK, I had "# CONFIG_SCHED_SMT is not set"
I'll try with this option set
=20
Should not be relevant since the processor has no hyperthreading.
=20
Are you running a 32 or 64 bits kernel ?
=20
Test was done using a 64 bit kernel.
Ok, I'll try 64bit too :)

1 us is time to access about 10 false shared cache lines.
64 bit arches store less pointers/long per cache line.
So a 64 bit kernel could be slower on this kind of workload in the gene=
ral case (if several cpus play the game)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-20 19:16:49 UTC
Permalink
Point is that even with tcpdump running, latencies are very good on 2.6.30-rc2, and were very good
with 2.6.22. I see no significant increase/decrease...
Well okay that applies to your testing methodology but the statement that
you have shown that the regression that I reported does not exist is not
proven since you ran a different test.
1 us is time to access about 10 false shared cache lines.
That depends on the size of the system and the number of processors
contending for the cache line.
64 bit arches store less pointers/long per cache line.
So a 64 bit kernel could be slower on this kind of workload in the general case (if several cpus play the game)
Right. But in practice I have also seen slight performance increases due
to the increased availability of memory and the avoidence of various 32
bit hacks (like highmem). Plus several recent subsystems seem to be
optimized for 64 bit like f.e. Infiniband.

I'd still like to see udpping results on your testing rigg to get another
datapoint. If the udpping results are not showing regressions on your
tests then there is likely a config issue at the core of the regression
that I am seeing here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-20 20:07:00 UTC
Permalink
=20
Point is that even with tcpdump running, latencies are very good on =
2.6.30-rc2, and were very good
with 2.6.22. I see no significant increase/decrease...
=20
Well okay that applies to your testing methodology but the statement =
that
you have shown that the regression that I reported does not exist is =
not
proven since you ran a different test.
I ran half the test. Receiver is OK, and this is the latency we all exp=
ect, as
service provider.

Now you can focus to the sender point.

=46or example, your program uses a kernel service to gather time with n=
anosecond precision.

Maybe there is a problem with it, I dont know...

Your test has so many variables it his hard to guess which part has a p=
roblem.

Maybe this is what you wanted to show after all, and you are not really=
interested
to really discover what is happening. Oh well, just kidding.

I am not trying to say you are right or wrong Christoph, just trying to=
=20
check if really linux got a regression in various past releases. So far=
,
I did not found some strange results on UDP path, once IRQ affinities
are fixed of course.
=20
1 us is time to access about 10 false shared cache lines.
=20
That depends on the size of the system and the number of processors
contending for the cache line.
=20
64 bit arches store less pointers/long per cache line.
So a 64 bit kernel could be slower on this kind of workload in the g=
eneral case (if several cpus play the game)
=20
Right. But in practice I have also seen slight performance increases =
due
to the increased availability of memory and the avoidence of various =
32
bit hacks (like highmem). Plus several recent subsystems seem to be
optimized for 64 bit like f.e. Infiniband.
Maybe, but on udpping of 40 bytes messages, I am not sure it can make a=
difference.
=20
I'd still like to see udpping results on your testing rigg to get ano=
ther
datapoint. If the udpping results are not showing regressions on your
tests then there is likely a config issue at the core of the regressi=
on
that I am seeing here.
No changes in udpping but noise.

Also, my machines use bonding and vlans, so I probably have a litle bit=
of overhead
(bonding uses rwlock, not very SMP friendly...)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-20 21:14:53 UTC
Permalink
Post by Eric Dumazet
No changes in udpping but noise.
No changes in the results of udpping that you would not consider noise?

The time syscalls are virtual syscalls which are quite cheap.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-04-20 21:52:30 UTC
Permalink
=20
Post by Eric Dumazet
No changes in udpping but noise.
=20
No changes in the results of udpping that you would not consider nois=
e?

Only changes where when every cpus played, so it really is not meaningf=
ull.
=20
The time syscalls are virtual syscalls which are quite cheap.
Not on 32bits afaik, with glibc I use at least, strace told me.

clock_gettime(CLOCK_REALTIME, {1240262824, 400770067}) =3D 0
sendto(3, "\314\335\356\377\300\250\24p#)\0\0\23\324\207\264\377\377\37=
7\377\0\1\2\3\4\5\6\7\10\t\0\0\3"..., 300, 0, {sa_family=3DAF_INET, sin=
_port=3Dhtons(9000), sin_addr=3Dinet_addr("192.168.20.112")}, 16) =3D 3=
00
recvfrom(3, "\314\335\356\377\300\250\24p#)\0\0\23\324\207\264\377\377\=
377\377\0\1\2\3\4\5\6\7\10\t\0\0\3"..., 300, 0, NULL, NULL) =3D 300
clock_gettime(CLOCK_REALTIME, {1240262824, 401108324}) =3D 0
clock_gettime(CLOCK_REALTIME, {1240262824, 401177464}) =3D 0

cost is about 384 ns on this 32bits 3GHz machine (without strace), whil=
e its 51 ns on 64 bit

Anyway I'll try to boot 64bit kernels on those 32bit distro machines in=
following days...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-21 14:00:15 UTC
Permalink
Post by Eric Dumazet
Not on 32bits afaik, with glibc I use at least, strace told me.
Right the vsyscall optimizations are not for 32 bit.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-21 19:36:32 UTC
Permalink
Switching off MSI increases the latency another step:


MSI off
UDP ping pong 40 bytes Test 1 Test 2 Test 3 Test 4
2.6.27 91.1725 91.31 91.17 91.12 91.09
2.6.28 91.5625 91.71 91.45 91.67 91.42
2.6.29 91.7 91.71 91.67 91.72 91.7
2.6.30-rc2 91.835 91.83 91.78 91.82 91.91

MSI on

UDP ping pong 40 bytes Test 1 Test 2 Test 3 Test 4
2.6.22 83.1125 83.4 83.03 82.95 83.07
2.6.23 83.33 83.51 83.18 83.42 83.21
2.6.24 82.7775 82.8 82.77 82.77 82.77
2.6.25 85.875 85.94 85.75 86.06 85.75
2.6.26 87.4025 87.53 87.25 87.3 87.53
2.6.27 87.81 87.81 87.77 87.82 87.84
2.6.28 87.4275 87.77 87.24 87.28 87.42
2.6.29 88.4325 88.48 88.42 88.44 88.39
2.6.30-rc2 88.4925 88.49 88.44 88.56 88.48

Next test series will have always the same kernel on one side etc.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-04-20 19:44:10 UTC
Permalink
Hi.
Post by Christoph Lameter
Post by Eric Dumazet
I only took traces on receivers to rule out different coalescing delays
of NIC that could depend on firmware or version
Hmmm... Okay there could be a firmware influence....
Can you replace the driver in one or another kernel?
Can those be scheduler (I said wrong words I know) iteractions? Maybe
test with sched_rt priority?
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-16 19:55:02 UTC
Permalink
Post by Eric Dumazet
Is it a bnx2 or tg3 driver ?
bnx2.
Post by Eric Dumazet
ethtool -C eth0 rx-usecs 0 rx-frames 1
prior to your tests...
rx-usecs is set to 18 by default by the driver and rx-frames is set to 6.
We saw some funky behavior with rx-usecs 0 so we left the defaults.

But what does it matter? Just changing the kernel should not cause
increases in latency.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael Chan
2009-04-16 21:57:17 UTC
Permalink
Post by Christoph Lameter
Post by Eric Dumazet
Is it a bnx2 or tg3 driver ?
bnx2.
Post by Eric Dumazet
ethtool -C eth0 rx-usecs 0 rx-frames 1
prior to your tests...
rx-usecs is set to 18 by default by the driver and rx-frames is set to 6.
We saw some funky behavior with rx-usecs 0 so we left the defaults.
We haven't changed these defaults since the beginning, but firmware
updates between 2.6.22 and 2.6.29 can affect latency.

Yes, Eric is right, using rx-frames 1 will give the best latency number.
(rx-usecs won't matter if rx-frames is set to 1).
Post by Christoph Lameter
But what does it matter? Just changing the kernel should not cause
increases in latency.
It will not matter if you use the same out-of-tree driver when switching
kernels. If you use in-tree driver, the firmware may have changed
between kernel versions.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2009-04-17 13:47:42 UTC
Permalink
Post by Michael Chan
It will not matter if you use the same out-of-tree driver when switching
kernels. If you use in-tree driver, the firmware may have changed
between kernel versions.
The in tree driver was used.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2009-04-16 22:59:25 UTC
Permalink
From: Christoph Lameter <***@linux.com>
Date: Thu, 16 Apr 2009 15:55:02 -0400 (EDT)
Post by Christoph Lameter
But what does it matter? Just changing the kernel should not cause
increases in latency.
There's firmware running on those cards doing all kinds of
fun things, the firmware gets changed in driver updates,
therefore...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...