Discussion:
Question on "net: allocate skbs on local node"
Eric Dumazet
2011-04-07 04:58:47 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 10:16 +0800, Wei Gu a =C3=A9crit :
> Hi Eric,
> Testing with ixgbe Linux 2.6.38 driver:
> We have a little better thruput figure with this driver, but it looks
> not scalling at all, I always stressed one CPU core/24.
> And when look the perf report for ksoftirqd/24, the most cost functio=
n
> is still "_raw_spin_unlock_irqstore" and the IRQ/s is huge, it's
> somehow conflicts with desgin of NAPI. On linux 2.6.32 while the CPU
> was stressed the IRQ will descreased while the NAPI will running much
> on the polling mode. I don't know why on 2.6.38 the IRQ was keep
> increasing.


CC netdev and Intel guys, since they said it should not happen (TM)

IF you dont use DCA (make sure ioatdma module is not loaded), how comes
alloc_iova() is called at all ?

IF you use DCA, how comes its called, since the same CPU serves a given
interrupt ?



> =20
> CONFIG_TICK_ONESHOT=3Dy
> CONFIG_NO_HZ=3Dy
> =20
> PerfTop: 512417 irqs/sec kernel:91.3% exact: 0.0% [1000Hz cpu-clo=
ck-msecs], (all, 64 CPUs)
> ---------------------------------------------------------------------=
-----------------------------------------------------------------------=
----------
> - 0.82% ksoftirqd/24 [kernel.kallsyms] [k] _raw_sp=
in_unlock_irqrestore =
=
=20
> \u2592 - _raw_spin_unlock_irqrestore =
=
=20
> \u2592 - 44.27% alloc_iova =
=20
> \u2592 intel_alloc_iova =
=
=
=20
> \u2592 __intel_map_single =
=20
> \u2592 intel_map_page =
=
=20
> \u2592 - ixgbe_init_interrupt_scheme =
=
=20
> \u2592 - 59.97% ixgbe_alloc_rx_buffers =
=
=20
> \u2592 ixgbe_clean_rx_irq =
=
=20
> \u2592 0xffffffffa033a5 =
=20
> \u2592 net_rx_action =
=
=20
> u2592 __do_softirq =
=20
> \u2592 + call_softirq =
=
=20
> \u2592 - 40.03% ixgbe_change_mtu =
=
=
=20
> \u2592 ixgbe_change_mtu =
=20
> \u2592 dev_hard_start_xmit =
=20
> \u2592 sch_direct_xmit =
=20
> \u2592 dev_queue_xmit =
=20
> \u2592 vlan_dev_hard_start_xmit =
=
=
=20
> \u2592 hook_func =
=
=
=20
> \u2592 nf_iterate =
=
=
=20
> \u2592 nf_hook_slow =
=
=
=20
> \u2592 NF_HOOK.clone.1 =
=
=
=20
> \u2592 ip_rcv =
=
=
=20
> \u2592 __netif_receive_skb =
=
=
=20
> \u2592 __netif_receive_skb =
=
=
=20
> \u2592 netif_receive_skb =
=
=
=20
> \u2592 napi_skb_finish =
=
=
=20
> \u2592 napi_gro_receive =
=
=
=20
> \u2592 ixgbe_clean_rx_irq =
=
=
=20
> \u2592 0xffffffffa033a5 =
=
=
=20
> \u2592 net_rx_action =
=
=
=20
> \u2592 __do_softirq =
=
=
=20
> \u2592 + call_softirq =
=
=
=20
> \u2592 + 35.85% find_iova =
=
=
=20
> \u2592 + 19.44% add_unmap =20
> =20
> =20
> Thanks
> WeiGu
> =20


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 05:16:52 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 06:58 +0200, Eric Dumazet a =C3=A9crit :
> Le jeudi 07 avril 2011 =C3=A0 10:16 +0800, Wei Gu a =C3=A9crit :
> > Hi Eric,
> > Testing with ixgbe Linux 2.6.38 driver:
> > We have a little better thruput figure with this driver, but it loo=
ks
> > not scalling at all, I always stressed one CPU core/24.
> > And when look the perf report for ksoftirqd/24, the most cost funct=
ion
> > is still "_raw_spin_unlock_irqstore" and the IRQ/s is huge, it's
> > somehow conflicts with desgin of NAPI. On linux 2.6.32 while the CP=
U
> > was stressed the IRQ will descreased while the NAPI will running mu=
ch
> > on the polling mode. I don't know why on 2.6.38 the IRQ was keep
> > increasing.
>=20
>=20
> CC netdev and Intel guys, since they said it should not happen (TM)
>=20
> IF you dont use DCA (make sure ioatdma module is not loaded), how com=
es
> alloc_iova() is called at all ?
>=20
> IF you use DCA, how comes its called, since the same CPU serves a giv=
en
> interrupt ?
>=20
>=20

But then, maybe you forgot to cpu affine IRQS ?

High performance routing setup is tricky, since you probably want to
disable many features that are ON by default : Most machines act as a
end host.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 06:16:52 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 07:16 +0200, Eric Dumazet a =C3=A9crit :
> Le jeudi 07 avril 2011 =C3=A0 06:58 +0200, Eric Dumazet a =C3=A9crit =
:
> > Le jeudi 07 avril 2011 =C3=A0 10:16 +0800, Wei Gu a =C3=A9crit :
> > > Hi Eric,
> > > Testing with ixgbe Linux 2.6.38 driver:
> > > We have a little better thruput figure with this driver, but it l=
ooks
> > > not scalling at all, I always stressed one CPU core/24.
> > > And when look the perf report for ksoftirqd/24, the most cost fun=
ction
> > > is still "_raw_spin_unlock_irqstore" and the IRQ/s is huge, it's
> > > somehow conflicts with desgin of NAPI. On linux 2.6.32 while the =
CPU
> > > was stressed the IRQ will descreased while the NAPI will running =
much
> > > on the polling mode. I don't know why on 2.6.38 the IRQ was keep
> > > increasing.
> >=20
> >=20
> > CC netdev and Intel guys, since they said it should not happen (TM)
> >=20
> > IF you dont use DCA (make sure ioatdma module is not loaded), how c=
omes
> > alloc_iova() is called at all ?
> >=20
> > IF you use DCA, how comes its called, since the same CPU serves a g=
iven
> > interrupt ?
> >=20
> >=20
>=20
> But then, maybe you forgot to cpu affine IRQS ?
>=20
> High performance routing setup is tricky, since you probably want to
> disable many features that are ON by default : Most machines act as a
> end host.
>=20
>=20

Please dont send me anymore private mails, I do think the issue you hav=
e
is on a setup, not a particular optimization done in network stack.


Copy of your private mail :

> On 2.6.38, I got a lot of "rx_missed_errors" on NIC, which means the
> rx loop was really busy to get packet from the receiving ring. Usuall=
y
> in this case it shouldn't exit the softirqs and keep polling in order
> to decrease the initrs.
>=20
> On 2.6.32, I can Rx and Tx 2.3Mpps with no packet lost(error on NIC),
> but on 2.6.38 I can only reach 50kpps with a lot of
> "rx_missed_errors", and all the binding cpu core was 100% in SI. I
> don't think there was any optimizations on it.

I hope you understand there is something wrong with your setup ?

50.000 pps on a 64 cpu machine is a bad joke.

We can reach +10.000.000 on a 16 cpus one.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-07 07:22:50 UTC
Permalink
Hi guys,
As I talked with Eric, that I get a very low performance on Linux 2.6.3=
8 kernel with intel ixgbe-3.2.10 driver.
I test different rx buff size on the Intel 10G NIC, by setting ethtool =
-G rx 4096.
I get the lowest performance(~50Kpps Rx&Tx) by setting the rx=3D=3D4096=
=2E
Once I decrease the Rx to 512 (default) then I can get Max 250Kpps Rx&T=
x on 1 NIC.

I was runing this test with HP DL580 4 Sock CPUs, and full memeory conf=
iguration.
modprobe ixgbe RSS=3D8,8,8,8,8,8,8,8 FdirMode=3D0,0,0,0,0,0,0,0 Node=3D=
0,0,1,1,2,2,3,3
Numactrl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65525 MB
node 0 free: 63053 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 63388 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 63344 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65535 MB
node 3 free: 63376 MB

Then I binding the eth10's rx and tx's IRQs to core "24 25 26 27 28 29 =
30 31", one by one, which means 1 rx and 1 tx was share 1 core.


I did the same test on 2.6.32 kernel, I can get >2.5M tx&rx with the sa=
me setup on RHEL6(2.6.32) Linux. But never reach 10.000.000 rx&tx on a =
single NIC:)

I also test the 2.6.38 shipped intel ixgbe driver It has the same probl=
em.

This is a perf record with linux shipped ixgbe driver, looks it has a v=
ery high irq/s rate. And the softirq was busy on alloc_iova


PerfTop: 512417 irqs/sec kernel:91.3% exact: 0.0% [1000Hz cpu-clock=
-msecs], (all, 64 CPUs)
-----------------------------------------------------------------------=
-----------------------------------------------------------------------=
--------
- 0.82% ksoftirqd/24 [kernel.kallsyms] [k] _raw_spin=
_unlock_irqrestore
\u2592 - _raw_spin_unlock_irqrestore
\u2592 - 44.27% alloc_iova
\u2592 intel_alloc_iova
\u2592 __intel_map_single
\u2592 intel_map_page
\u2592 - ixgbe_init_interrupt_scheme
\u2592 - 59.97% ixgbe_alloc_rx_buffers
\u2592 ixgbe_clean_rx_irq
\u2592 0xffffffffa033a5
\u2592 net_rx_action
u2592 __do_softirq
\u2592 + call_softirq
\u2592 - 40.03% ixgbe_change_mtu
\u2592 ixgbe_change_mtu
\u2592 dev_hard_start_xmit
\u2592 sch_direct_xmit
\u2592 dev_queue_xmit
\u2592 vlan_dev_hard_start_xmit
\u2592 hook_func
\u2592 nf_iterate
\u2592 nf_hook_slow
\u2592 NF_HOOK.clone.1
\u2592 ip_rcv
\u2592 __netif_receive_skb
\u2592 __netif_receive_skb
\u2592 netif_receive_skb
\u2592 napi_skb_finish
\u2592 napi_gro_receive
\u2592 ixgbe_clean_rx_irq
\u2592 0xffffffffa033a5
\u2592 net_rx_action
\u2592 __do_softirq
\u2592 + call_softirq
\u2592 + 35.85% find_iova
\u2592 + 19.44% add_unmap


Thanks
WeiGu


-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Thursday, April 07, 2011 2:17 PM
To: Wei Gu
Cc: netdev; Alexander Duyck; Jeff Kirsher
Subject: RE: Question on "net: allocate skbs on local node"

Le jeudi 07 avril 2011 =E0 07:16 +0200, Eric Dumazet a =E9crit :
> Le jeudi 07 avril 2011 =E0 06:58 +0200, Eric Dumazet a =E9crit :
> > Le jeudi 07 avril 2011 =E0 10:16 +0800, Wei Gu a =E9crit :
> > > Hi Eric,
> > > Testing with ixgbe Linux 2.6.38 driver:
> > > We have a little better thruput figure with this driver, but it
> > > looks not scalling at all, I always stressed one CPU core/24.
> > > And when look the perf report for ksoftirqd/24, the most cost
> > > function is still "_raw_spin_unlock_irqstore" and the IRQ/s is
> > > huge, it's somehow conflicts with desgin of NAPI. On linux 2.6.32
> > > while the CPU was stressed the IRQ will descreased while the NAPI
> > > will running much on the polling mode. I don't know why on 2.6.38
> > > the IRQ was keep increasing.
> >
> >
> > CC netdev and Intel guys, since they said it should not happen (TM)
> >
> > IF you dont use DCA (make sure ioatdma module is not loaded), how
> > comes
> > alloc_iova() is called at all ?
> >
> > IF you use DCA, how comes its called, since the same CPU serves a
> > given interrupt ?
> >
> >
>
> But then, maybe you forgot to cpu affine IRQS ?
>
> High performance routing setup is tricky, since you probably want to
> disable many features that are ON by default : Most machines act as a
> end host.
>
>

Please dont send me anymore private mails, I do think the issue you hav=
e is on a setup, not a particular optimization done in network stack.


Copy of your private mail :

> On 2.6.38, I got a lot of "rx_missed_errors" on NIC, which means the
> rx loop was really busy to get packet from the receiving ring. Usuall=
y
> in this case it shouldn't exit the softirqs and keep polling in order
> to decrease the initrs.
>
> On 2.6.32, I can Rx and Tx 2.3Mpps with no packet lost(error on NIC),
> but on 2.6.38 I can only reach 50kpps with a lot of
> "rx_missed_errors", and all the binding cpu core was 100% in SI. I
> don't think there was any optimizations on it.

I hope you understand there is something wrong with your setup ?

50.000 pps on a 64 cpu machine is a bad joke.

We can reach +10.000.000 on a 16 cpus one.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 08:07:30 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 15:22 +0800, Wei Gu a =C3=A9crit :
> Hi guys,
> As I talked with Eric, that I get a very low performance on Linux 2.6=
=2E38 kernel with intel ixgbe-3.2.10 driver.
> I test different rx buff size on the Intel 10G NIC, by setting ethtoo=
l -G rx 4096.
> I get the lowest performance(~50Kpps Rx&Tx) by setting the rx=3D=3D40=
96.
> Once I decrease the Rx to 512 (default) then I can get Max 250Kpps Rx=
&Tx on 1 NIC.
>=20
> I was runing this test with HP DL580 4 Sock CPUs, and full memeory co=
nfiguration.
> modprobe ixgbe RSS=3D8,8,8,8,8,8,8,8 FdirMode=3D0,0,0,0,0,0,0,0 Node=3D=
0,0,1,1,2,2,3,3
> Numactrl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
> node 0 size: 65525 MB
> node 0 free: 63053 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
> node 1 size: 65536 MB
> node 1 free: 63388 MB
> node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
> node 2 size: 65536 MB
> node 2 free: 63344 MB
> node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
> node 3 size: 65535 MB
> node 3 free: 63376 MB
>=20
> Then I binding the eth10's rx and tx's IRQs to core "24 25 26 27 28 2=
9 30 31", one by one, which means 1 rx and 1 tx was share 1 core.
>=20
>=20
> I did the same test on 2.6.32 kernel, I can get >2.5M tx&rx with the =
same setup on RHEL6(2.6.32) Linux. But never reach 10.000.000 rx&tx on =
a single NIC:)
>=20
> I also test the 2.6.38 shipped intel ixgbe driver It has the same pro=
blem.
>=20
> This is a perf record with linux shipped ixgbe driver, looks it has a=
very high irq/s rate. And the softirq was busy on alloc_iova
>=20
>=20
> PerfTop: 512417 irqs/sec kernel:91.3% exact: 0.0% [1000Hz cpu-clo=
ck-msecs], (all, 64 CPUs)
> ---------------------------------------------------------------------=
-----------------------------------------------------------------------=
----------
> - 0.82% ksoftirqd/24 [kernel.kallsyms] [k] _raw_sp=
in_unlock_irqrestore
> \u2592 - _raw_spin_unlock_irqrestore
> \u2592 - 44.27% alloc_iova
> \u2592 intel_alloc_iova
> \u2592 __intel_map_single
> \u2592 intel_map_page
> \u2592 - ixgbe_init_interrupt_scheme
> \u2592 - 59.97% ixgbe_alloc_rx_buffers
> \u2592 ixgbe_clean_rx_irq
> \u2592 0xffffffffa033a5
> \u2592 net_rx_action
> u2592 __do_softirq
> \u2592 + call_softirq
> \u2592 - 40.03% ixgbe_change_mtu
> \u2592 ixgbe_change_mtu
> \u2592 dev_hard_start_xmit
> \u2592 sch_direct_xmit
> \u2592 dev_queue_xmit
> \u2592 vlan_dev_hard_start_xmit
> \u2592 hook_func
> \u2592 nf_iterate
> \u2592 nf_hook_slow
> \u2592 NF_HOOK.clone.1
> \u2592 ip_rcv
> \u2592 __netif_receive_skb
> \u2592 __netif_receive_skb
> \u2592 netif_receive_skb
> \u2592 napi_skb_finish
> \u2592 napi_gro_receive
> \u2592 ixgbe_clean_rx_irq
> \u2592 0xffffffffa033a5
> \u2592 net_rx_action
> \u2592 __do_softirq
> \u2592 + call_softirq
> \u2592 + 35.85% find_iova
> \u2592 + 19.44% add_unmap
>=20
>=20
> Thanks
> WeiGu

What about using the driver as provided in 2.6.38 ?

No custom module parameter, only play with irq affinities

Say you have 64 queues but want only 8 cpus (24 -> 31) receiving trafic

for i in `seq 0 7`
do
echo 01000000 >/proc/irq/*/eth1-fp-$i/../smp_affinity
done

for i in `seq 8 15`
do
echo 02000000 >/proc/irq/*/eth1-fp-$i/../smp_affinity
done

=2E..

for i in `seq 56 63`
do
echo 80000000 >/proc/irq/*/eth1-fp-$i/../smp_affinity
done


Why is ixgbe_change_mtu() seen on your profile ?
Its damn expensive, since it must call ixgbe_reinit_locked()

Are you using a custom code in kernel ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-07 08:39:20 UTC
Permalink
I'm only insert a prerouting hook to make a copy of the incomming packe=
t and swap the L2/L3 header, send it back on the same interface.

BTW, some times I notices that the perf tool was not mapping the symbol=
correclly, I don't why?

I will try a fresh install of kernel 2.6.30 and do the test with the sh=
ipped ixgbe driver again.


-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Thursday, April 07, 2011 4:08 PM
To: Wei Gu
Cc: netdev; Alexander Duyck; Jeff Kirsher
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le jeudi 07 avril 2011 =E0 15:22 +0800, Wei Gu a =E9crit :
> Hi guys,
> As I talked with Eric, that I get a very low performance on Linux 2.6=
=2E38 kernel with intel ixgbe-3.2.10 driver.
> I test different rx buff size on the Intel 10G NIC, by setting ethtoo=
l -G rx 4096.
> I get the lowest performance(~50Kpps Rx&Tx) by setting the rx=3D=3D40=
96.
> Once I decrease the Rx to 512 (default) then I can get Max 250Kpps Rx=
&Tx on 1 NIC.
>
> I was runing this test with HP DL580 4 Sock CPUs, and full memeory co=
nfiguration.
> modprobe ixgbe RSS=3D8,8,8,8,8,8,8,8 FdirMode=3D0,0,0,0,0,0,0,0
> Node=3D0,0,1,1,2,2,3,3 Numactrl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size:
> 65525 MB node 0 free: 63053 MB node 1 cpus: 8 9 10 11 12 13 14 15 40
> 41 42 43 44 45 46 47 node 1 size: 65536 MB node 1 free: 63388 MB node
> 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size:
> 65536 MB node 2 free: 63344 MB node 3 cpus: 24 25 26 27 28 29 30 31 5=
6
> 57 58 59 60 61 62 63 node 3 size: 65535 MB node 3 free: 63376 MB
>
> Then I binding the eth10's rx and tx's IRQs to core "24 25 26 27 28 2=
9 30 31", one by one, which means 1 rx and 1 tx was share 1 core.
>
>
> I did the same test on 2.6.32 kernel, I can get >2.5M tx&rx with the
> same setup on RHEL6(2.6.32) Linux. But never reach 10.000.000 rx&tx o=
n
> a single NIC:)
>
> I also test the 2.6.38 shipped intel ixgbe driver It has the same pro=
blem.
>
> This is a perf record with linux shipped ixgbe driver, looks it has a
> very high irq/s rate. And the softirq was busy on alloc_iova
>
>
> PerfTop: 512417 irqs/sec kernel:91.3% exact: 0.0% [1000Hz
> cpu-clock-msecs], (all, 64 CPUs)
> ---------------------------------------------------------------------=
-----------------------------------------------------------------------=
----------
> - 0.82% ksoftirqd/24 [kernel.kallsyms] [k] _raw_sp=
in_unlock_irqrestore
> \u2592 - _raw_spin_unlock_irqrestore
> \u2592 - 44.27% alloc_iova
> \u2592 intel_alloc_iova
> \u2592 __intel_map_single
> \u2592 intel_map_page
> \u2592 - ixgbe_init_interrupt_scheme
> \u2592 - 59.97% ixgbe_alloc_rx_buffers
> \u2592 ixgbe_clean_rx_irq
> \u2592 0xffffffffa033a5
> \u2592 net_rx_action
> u2592 __do_softirq
> \u2592 + call_softirq
> \u2592 - 40.03% ixgbe_change_mtu
> \u2592 ixgbe_change_mtu
> \u2592 dev_hard_start_xmit
> \u2592 sch_direct_xmit
> \u2592 dev_queue_xmit
> \u2592 vlan_dev_hard_start_xmit
> \u2592 hook_func
> \u2592 nf_iterate
> \u2592 nf_hook_slow
> \u2592 NF_HOOK.clone.1
> \u2592 ip_rcv
> \u2592 __netif_receive_skb
> \u2592 __netif_receive_skb
> \u2592 netif_receive_skb
> \u2592 napi_skb_finish
> \u2592 napi_gro_receive
> \u2592 ixgbe_clean_rx_irq
> \u2592 0xffffffffa033a5
> \u2592 net_rx_action
> \u2592 __do_softirq
> \u2592 + call_softirq
> \u2592 + 35.85% find_iova
> \u2592 + 19.44% add_unmap
>
>
> Thanks
> WeiGu

What about using the driver as provided in 2.6.38 ?

No custom module parameter, only play with irq affinities

Say you have 64 queues but want only 8 cpus (24 -> 31) receiving trafic

for i in `seq 0 7`
do
echo 01000000 >/proc/irq/*/eth1-fp-$i/../smp_affinity
done

for i in `seq 8 15`
do
echo 02000000 >/proc/irq/*/eth1-fp-$i/../smp_affinity
done

=2E..

for i in `seq 56 63`
do
echo 80000000 >/proc/irq/*/eth1-fp-$i/../smp_affinity
done


Why is ixgbe_change_mtu() seen on your profile ?
Its damn expensive, since it must call ixgbe_reinit_locked()

Are you using a custom code in kernel ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 09:06:08 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 16:39 +0800, Wei Gu a =C3=A9crit :
> I'm only insert a prerouting hook to make a copy of the incomming
> packet and swap the L2/L3 header, send it back on the same interface.
>=20

Small packets or big ones ?

You dont need to copy the packet, its expensive.


> BTW, some times I notices that the perf tool was not mapping the
> symbol correclly, I don't why?
>=20

You might try to put ixgbe in static kernel, not in a module.

> I will try a fresh install of kernel 2.6.30 and do the test with the
> shipped ixgbe driver again.
>=20

OK thanks.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-07 11:15:10 UTC
Permalink
Hi,
I compile the ixgbe driver into the kernel and run the test again and also change the copy to clone in the fw hook
This is the perf report while I was forwarding 150Kpps with
The attached file include the basic info about my test system. Please let me know if I did some thing wrong.

+ 71.91% swapper [kernel.kallsyms] [k] poll_idle
+ 10.43% swapper [kernel.kallsyms] [k] intel_idle
- 8.00% ksoftirqd/24 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
\u2592 - _raw_spin_unlock_irqrestore
\u2592 - 42.25% alloc_iova
\u2592 intel_alloc_iova
\u2592 __intel_map_single
\u2592 intel_map_page
\u2592 - dma_map_single_attrs.clone.3
\u2592 + 59.89% ixgbe_alloc_rx_buffers
\u2592 - 40.11% ixgbe_xmit_frame_ring
\u2592 ixgbe_xmit_frame
\u2592 dev_hard_start_xmit
\u2592 sch_direct_xmit
\u2592 dev_queue_xmit
\u2592 vlan_dev_hard_start_xmit
\u2592 hook_func
\u2592 nf_iterate
\u2592 nf_hook_slow
\u2592 NF_HOOK.clone.1
\u2592 ip_rcv
\u2592 __netif_receive_skb
\u2592 __netif_receive_skb
\u2592 netif_receive_skb
\u2592 napi_skb_finish
\u2592 napi_gro_receive
\u2592 ixgbe_clean_rx_irq
\u2592 ixgbe_clean_rxtx_many
\u2592 net_rx_action
\u2592 __do_softirq
\u2592 + call_softirq
\u2592 + 36.30% find_iova
\u2592 + 20.89% add_unmap
\u2592+ 1.60% kworker/24:1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
\u2592+ 0.80% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
\u2592+ 0.66% snmpd [kernel.kallsyms] [k] snmp_fold_field
\u2592+ 0.53% ksoftirqd/24 [kernel.kallsyms] [k] clflush_cache_range


If I zoom out to this ksoftirqd/24
+ 80.38% ksoftirqd/24 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 5.35% ksoftirqd/24 [kernel.kallsyms] [k] clflush_cache_range
+ 1.49% ksoftirqd/24 [kernel.kallsyms] [k] __domain_mapping
+ 0.84% ksoftirqd/24 [kernel.kallsyms] [k] kmem_cache_alloc
+ 0.55% ksoftirqd/24 [kernel.kallsyms] [k] _raw_spin_lock
+ 0.54% ksoftirqd/24 [kernel.kallsyms] [k] ixgbe_xmit_frame_ring
+ 0.52% ksoftirqd/24 [kernel.kallsyms] [k] ixgbe_clean_rx_irq
+ 0.50% ksoftirqd/24 [kernel.kallsyms] [k] domain_get_iommu
+ 0.49% ksoftirqd/24 [kernel.kallsyms] [k] dma_map_single_attrs.clone.3
+ 0.48% ksoftirqd/24 [kernel.kallsyms] [k] kmem_cache_free

Perf top

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PerfTop: 10615 irqs/sec kernel:99.7% exact: 0.0% [1000Hz cpu-clock-msecs], (all, 64 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _______________________________ __________________________________________________________________________

11786.00 54.9% intel_idle [kernel.kallsyms]
7180.00 33.4% _raw_spin_unlock_irqrestore [kernel.kallsyms]
469.00 2.2% clflush_cache_range [kernel.kallsyms]
138.00 0.6% __domain_mapping [kernel.kallsyms]
81.00 0.4% dso__find_symbol /root/rpmbuild/BUILD/kernel-2.6.38.el6/linux-2.6.38.x86_64/tools/perf/perf
73.00 0.3% _raw_spin_lock [kernel.kallsyms]
72.00 0.3% dso__load_sym.clone.0 /root/rpmbuild/BUILD/kernel-2.6.38.el6/linux-2.6.38.x86_64/tools/perf/perf
68.00 0.3% kmem_cache_alloc [kernel.kallsyms]
53.00 0.2% symbol_filter /root/rpmbuild/BUILD/kernel-2.6.38.el6/linux-2.6.38.x86_64/tools/perf/perf
51.00 0.2% domain_get_iommu [kernel.kallsyms]
44.00 0.2% ixgbe_clean_rx_irq [kernel.kallsyms]
42.00 0.2% kmem_cache_free [kernel.kallsyms]
42.00 0.2% ixgbe_xmit_frame_ring [kernel.kallsyms]
41.00 0.2% ixgbe_clean_tx_irq [kernel.kallsyms]
40.00 0.2% dma_map_single_attrs.clone.3 [kernel.kallsyms]


Top:

Tasks: 425 total, 2 running, 423 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 96.0%id, 0.0%wa, 0.0%hi, 3.9%si, 0.0%st
Mem: 264733684k total, 6374016k used, 258359668k free, 43720k buffers
Swap: 4194300k total, 0k used, 4194300k free, 137308k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
79 root 20 0 0 0 0 R 38.8 0.0 29:22.85 24 ksoftirqd/24
233 root 20 0 0 0 0 S 7.6 0.0 4:06.60 24 kworker/24:1
1538 root 20 0 0 0 0 S 0.3 0.0 0:00.78 33 kworker/33:3
2271 root 20 0 200m 5564 1460 S 0.3 0.0 0:03.31 2 snmpd


Thanks
WeiGu

-----Original Message-----
From: Eric Dumazet [mailto:***@gmail.com]
Sent: Thursday, April 07, 2011 5:06 PM
To: Wei Gu
Cc: netdev; Alexander Duyck; Jeff Kirsher
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le jeudi 07 avril 2011 à 16:39 +0800, Wei Gu a écrit :
> I'm only insert a prerouting hook to make a copy of the incomming
> packet and swap the L2/L3 header, send it back on the same interface.
>

Small packets or big ones ?

You dont need to copy the packet, its expensive.


> BTW, some times I notices that the perf tool was not mapping the
> symbol correclly, I don't why?
>

You might try to put ixgbe in static kernel, not in a module.

> I will try a fresh install of kernel 2.6.30 and do the test with the
> shipped ixgbe driver again.
>

OK thanks.
Eric Dumazet
2011-04-07 11:46:51 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 19:15 +0800, Wei Gu a =C3=A9crit :
> Hi,
> I compile the ixgbe driver into the kernel and run the test again and=
also change the copy to clone in the fw hook
> This is the perf report while I was forwarding 150Kpps with
> The attached file include the basic info about my test system. Please=
let me know if I did some thing wrong.
>=20
> + 71.91% swapper [kernel.kallsyms] [k] poll_=
idle
> + 10.43% swapper [kernel.kallsyms] [k] intel=
_idle
> - 8.00% ksoftirqd/24 [kernel.kallsyms] [k] _raw_=
spin_unlock_irqrestore
> \u2592 - _raw_spin_unlock_irqrestore
> \u2592 - 42.25% alloc_iova
> \u2592 intel_alloc_iova
> \u2592 __intel_map_single
> \u2592 intel_map_page
> \u2592 - dma_map_single_attrs.clone.3
> \u2592 + 59.89% ixgbe_alloc_rx_buffers
> \u2592 - 40.11% ixgbe_xmit_frame_ring
> \u2592 ixgbe_xmit_frame
> \u2592 dev_hard_start_xmit
> \u2592 sch_direct_xmit
> \u2592 dev_queue_xmit
> \u2592 vlan_dev_hard_start_xmit
> \u2592 hook_func
> \u2592 nf_iterate
> \u2592 nf_hook_slow
> \u2592 NF_HOOK.clone.1
> \u2592 ip_rcv
> \u2592 __netif_receive_skb
> \u2592 __netif_receive_skb
> \u2592 netif_receive_skb
> \u2592 napi_skb_finish
> \u2592 napi_gro_receive
> \u2592 ixgbe_clean_rx_irq
> \u2592 ixgbe_clean_rxtx_many
> \u2592 net_rx_action
> \u2592 __do_softirq
> \u2592 + call_softirq
> \u2592 + 36.30% find_iova
> \u2592 + 20.89% add_unmap
> \u2592+ 1.60% kworker/24:1 [kernel.kallsyms] [k]=
_raw_spin_unlock_irqrestore
> \u2592+ 0.80% swapper [kernel.kallsyms] [k]=
_raw_spin_unlock_irqrestore
> \u2592+ 0.66% snmpd [kernel.kallsyms] [k]=
snmp_fold_field
> \u2592+ 0.53% ksoftirqd/24 [kernel.kallsyms] [k]=
clflush_cache_range
>=20
>=20
> If I zoom out to this ksoftirqd/24
> + 80.38% ksoftirqd/24 [kernel.kallsyms] [k] _raw_spin_unlock_i=
rqrestore
> + 5.35% ksoftirqd/24 [kernel.kallsyms] [k] clflush_cache_rang=
e
> + 1.49% ksoftirqd/24 [kernel.kallsyms] [k] __domain_mapping
> + 0.84% ksoftirqd/24 [kernel.kallsyms] [k] kmem_cache_alloc
> + 0.55% ksoftirqd/24 [kernel.kallsyms] [k] _raw_spin_lock
> + 0.54% ksoftirqd/24 [kernel.kallsyms] [k] ixgbe_xmit_frame_r=
ing
> + 0.52% ksoftirqd/24 [kernel.kallsyms] [k] ixgbe_clean_rx_irq
> + 0.50% ksoftirqd/24 [kernel.kallsyms] [k] domain_get_iommu
> + 0.49% ksoftirqd/24 [kernel.kallsyms] [k] dma_map_single_att=
rs.clone.3
> + 0.48% ksoftirqd/24 [kernel.kallsyms] [k] kmem_cache_free
>=20
> Perf top
>=20
> ---------------------------------------------------------------------=
-----------------------------------------------------------------------=
-------------------------------------------------------------------
> PerfTop: 10615 irqs/sec kernel:99.7% exact: 0.0% [1000Hz cpu-=
clock-msecs], (all, 64 CPUs)
> ---------------------------------------------------------------------=
-----------------------------------------------------------------------=
-------------------------------------------------------------------
>=20
> samples pcnt function DSO
> _______ _____ _______________________________ __________=
________________________________________________________________
>=20
> 11786.00 54.9% intel_idle [kernel.ka=
llsyms]
> 7180.00 33.4% _raw_spin_unlock_irqrestore [kernel.ka=
llsyms]
> 469.00 2.2% clflush_cache_range [kernel.ka=
llsyms]
> 138.00 0.6% __domain_mapping [kernel.ka=
llsyms]
> 81.00 0.4% dso__find_symbol /root/rpmb=
uild/BUILD/kernel-2.6.38.el6/linux-2.6.38.x86_64/tools/perf/perf
> 73.00 0.3% _raw_spin_lock [kernel.ka=
llsyms]
> 72.00 0.3% dso__load_sym.clone.0 /root/rpmb=
uild/BUILD/kernel-2.6.38.el6/linux-2.6.38.x86_64/tools/perf/perf
> 68.00 0.3% kmem_cache_alloc [kernel.ka=
llsyms]
> 53.00 0.2% symbol_filter /root/rpmb=
uild/BUILD/kernel-2.6.38.el6/linux-2.6.38.x86_64/tools/perf/perf
> 51.00 0.2% domain_get_iommu [kernel.ka=
llsyms]
> 44.00 0.2% ixgbe_clean_rx_irq [kernel.ka=
llsyms]
> 42.00 0.2% kmem_cache_free [kernel.ka=
llsyms]
> 42.00 0.2% ixgbe_xmit_frame_ring [kernel.ka=
llsyms]
> 41.00 0.2% ixgbe_clean_tx_irq [kernel.ka=
llsyms]
> 40.00 0.2% dma_map_single_attrs.clone.3 [kernel.ka=
llsyms]
>=20
>=20
> Top:
>=20
> Tasks: 425 total, 2 running, 423 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 96.0%id, 0.0%wa, 0.0%hi, 3.9%si=
, 0.0%st
> Mem: 264733684k total, 6374016k used, 258359668k free, 43720k bu=
ffers
> Swap: 4194300k total, 0k used, 4194300k free, 137308k cach=
ed
>=20
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMA=
ND
> 79 root 20 0 0 0 0 R 38.8 0.0 29:22.85 24 ksoft=
irqd/24
> 233 root 20 0 0 0 0 S 7.6 0.0 4:06.60 24 kwork=
er/24:1
> 1538 root 20 0 0 0 0 S 0.3 0.0 0:00.78 33 kwork=
er/33:3
> 2271 root 20 0 200m 5564 1460 S 0.3 0.0 0:03.31 2 snmpd
>=20
>=20
> Thanks
> WeiGu

OK, please send your .config file



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 13:41:18 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 13:46 +0200, Eric Dumazet a =C3=A9crit :
> OK, please send your .config file
>=20
>=20

I suspect you have

CONFIG_DMAR=3Dy



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2011-04-07 15:58:59 UTC
Permalink
On 4/7/2011 4:46 AM, Eric Dumazet wrote:
> Le jeudi 07 avril 2011 =C3=A0 19:15 +0800, Wei Gu a =C3=A9crit :
>> Hi,
>> I compile the ixgbe driver into the kernel and run the test again an=
d also change the copy to clone in the fw hook
>> This is the perf report while I was forwarding 150Kpps with
>> The attached file include the basic info about my test system. Pleas=
e let me know if I did some thing wrong.
>>
>> + 71.91% swapper [kernel.kallsyms] [k] poll=
_idle
>> + 10.43% swapper [kernel.kallsyms] [k] inte=
l_idle
>> - 8.00% ksoftirqd/24 [kernel.kallsyms] [k] _raw=
_spin_unlock_irqrestore
>> \u2592 - _raw_spin_unlock_irqrestore
>> \u2592 - 42.25% alloc_iova
>> \u2592 intel_alloc_iova
>> \u2592 __intel_map_single
>> \u2592 intel_map_page

I'm almost certain this is the issue here. I am pretty sure the=20
intel_map_page call indicates that you are running with the Intel IOMMU=
=20
enabled. As Eric suggested you can either rebuild your kernel with=20
"CONFIG_DMAR=3DN", or pass the kernel the parameter "intel_iommu=3Doff"=
in=20
order to disable it so that it will instead just use SWIOTLB.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 16:03:38 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 08:58 -0700, Alexander Duyck a =C3=A9crit=
:
> On 4/7/2011 4:46 AM, Eric Dumazet wrote:
> > Le jeudi 07 avril 2011 =C3=A0 19:15 +0800, Wei Gu a =C3=A9crit :
> >> Hi,
> >> I compile the ixgbe driver into the kernel and run the test again =
and also change the copy to clone in the fw hook
> >> This is the perf report while I was forwarding 150Kpps with
> >> The attached file include the basic info about my test system. Ple=
ase let me know if I did some thing wrong.
> >>
> >> + 71.91% swapper [kernel.kallsyms] [k] po=
ll_idle
> >> + 10.43% swapper [kernel.kallsyms] [k] in=
tel_idle
> >> - 8.00% ksoftirqd/24 [kernel.kallsyms] [k] _r=
aw_spin_unlock_irqrestore
> >> \u2592 - _raw_spin_unlock_irqrestore
> >> \u2592 - 42.25% alloc_iova
> >> \u2592 intel_alloc_iova
> >> \u2592 __intel_map_single
> >> \u2592 intel_map_page
>=20
> I'm almost certain this is the issue here. I am pretty sure the=20
> intel_map_page call indicates that you are running with the Intel IOM=
MU=20
> enabled. As Eric suggested you can either rebuild your kernel with=20
> "CONFIG_DMAR=3DN", or pass the kernel the parameter "intel_iommu=3Dof=
f" in=20
> order to disable it so that it will instead just use SWIOTLB.

What's the purpose of intel_iommu ?

Could this be speedup if ixgbe uses a perqueue iommu context instead of
a per device, so that we dont hit a single spinlock ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2011-04-07 16:20:53 UTC
Permalink
On 4/7/2011 9:03 AM, Eric Dumazet wrote:
> Le jeudi 07 avril 2011 =C3=A0 08:58 -0700, Alexander Duyck a =C3=A9cr=
it :
>> On 4/7/2011 4:46 AM, Eric Dumazet wrote:
>>> Le jeudi 07 avril 2011 =C3=A0 19:15 +0800, Wei Gu a =C3=A9crit :
>>>> Hi,
>>>> I compile the ixgbe driver into the kernel and run the test again =
and also change the copy to clone in the fw hook
>>>> This is the perf report while I was forwarding 150Kpps with
>>>> The attached file include the basic info about my test system. Ple=
ase let me know if I did some thing wrong.
>>>>
>>>> + 71.91% swapper [kernel.kallsyms] [k] po=
ll_idle
>>>> + 10.43% swapper [kernel.kallsyms] [k] in=
tel_idle
>>>> - 8.00% ksoftirqd/24 [kernel.kallsyms] [k] _r=
aw_spin_unlock_irqrestore
>>>> \u2592 - _raw_spin_unlock_irqrestore
>>>> \u2592 - 42.25% alloc_iova
>>>> \u2592 intel_alloc_iova
>>>> \u2592 __intel_map_single
>>>> \u2592 intel_map_page
>>
>> I'm almost certain this is the issue here. I am pretty sure the
>> intel_map_page call indicates that you are running with the Intel IO=
MMU
>> enabled. As Eric suggested you can either rebuild your kernel with
>> "CONFIG_DMAR=3DN", or pass the kernel the parameter "intel_iommu=3Do=
ff" in
>> order to disable it so that it will instead just use SWIOTLB.
>
> What's the purpose of intel_iommu ?
>
> Could this be speedup if ixgbe uses a perqueue iommu context instead =
of
> a per device, so that we dont hit a single spinlock ?

The intel_iommu is meant to be a security feature. Primarily it is use=
d=20
in virtualzation where it allows KVM or Xen to direct assign a device=20
without having to worry about the guest getting access to the hosts=20
physical memory by submitting invalid DMA requests.

If virtualzation isn't in use I would recommend turning it off as it ca=
n=20
have a negative impact especially on small packet performance due to th=
e=20
extra locking overhead that is required for DMA map and unmap calls.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-07 16:37:07 UTC
Permalink
Le jeudi 07 avril 2011 =C3=A0 09:20 -0700, Alexander Duyck a =C3=A9crit=
:

> The intel_iommu is meant to be a security feature. Primarily it is u=
sed=20
> in virtualzation where it allows KVM or Xen to direct assign a device=
=20
> without having to worry about the guest getting access to the hosts=20
> physical memory by submitting invalid DMA requests.
>=20

I see

> If virtualzation isn't in use I would recommend turning it off as it =
can=20
> have a negative impact especially on small packet performance due to =
the=20
> extra locking overhead that is required for DMA map and unmap calls.

Sure, but then, if this thing is ON, we should copy small packets in
freshly allocated skbs and reuse old one in driver rx handler, to avoid
the expensive dma calls ?

[ The thing called copybreak ]

I understand tx path cost cannot be avoided, so it would not help Wei
use case.

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-08 08:59:49 UTC
Permalink
Hi,
On my configuration of Linux 2.6.38 we do enabled the CONFIG_DMAR
CONFIG_DMAR=3Dy
CONFIG_DMAR_DEFAULT_ON=3Dy
CONFIG_DMAR_FLOPPY_WA=3Dy
CONFIG_INTR_REMAP=3Dy

But I enabled this DMAR also by default of the RHEL6 2.6.32 kernel, whi=
ch we don't see this problem on this kernel. The only different I can s=
ee on the config is "CONFIG_DMAR_DEFAULT_ON"
CONFIG_DMAR=3Dy
# CONFIG_DMAR_DEFAULT_ON is not set
CONFIG_DMAR_FLOPPY_WA=3Dy
CONFIG_INTR_REMAP=3Dy

I could try the same configuration as 2.6.32 for the 2.6.38 kernel by u=
nset the CONFIG_DMAR_DEFAULT_ON. And run the test again, hopefully we c=
an get the test result next Monday:)

Thanks
WeiGu

-----Original Message-----
=46rom: Alexander Duyck [mailto:***@intel.com]
Sent: Friday, April 08, 2011 12:21 AM
To: Eric Dumazet
Cc: Wei Gu; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On 4/7/2011 9:03 AM, Eric Dumazet wrote:
> Le jeudi 07 avril 2011 =E0 08:58 -0700, Alexander Duyck a =E9crit :
>> On 4/7/2011 4:46 AM, Eric Dumazet wrote:
>>> Le jeudi 07 avril 2011 =E0 19:15 +0800, Wei Gu a =E9crit :
>>>> Hi,
>>>> I compile the ixgbe driver into the kernel and run the test again
>>>> and also change the copy to clone in the fw hook This is the perf
>>>> report while I was forwarding 150Kpps with The attached file inclu=
de the basic info about my test system. Please let me know if I did som=
e thing wrong.
>>>>
>>>> + 71.91% swapper [kernel.kallsyms] [k] po=
ll_idle
>>>> + 10.43% swapper [kernel.kallsyms] [k] in=
tel_idle
>>>> - 8.00% ksoftirqd/24 [kernel.kallsyms] [k] _r=
aw_spin_unlock_irqrestore
>>>> \u2592 - _raw_spin_unlock_irqrestore
>>>> \u2592 - 42.25% alloc_iova
>>>> \u2592 intel_alloc_iova
>>>> \u2592 __intel_map_single
>>>> \u2592 intel_map_page
>>
>> I'm almost certain this is the issue here. I am pretty sure the
>> intel_map_page call indicates that you are running with the Intel
>> IOMMU enabled. As Eric suggested you can either rebuild your kernel
>> with "CONFIG_DMAR=3DN", or pass the kernel the parameter
>> "intel_iommu=3Doff" in order to disable it so that it will instead j=
ust use SWIOTLB.
>
> What's the purpose of intel_iommu ?
>
> Could this be speedup if ixgbe uses a perqueue iommu context instead
> of a per device, so that we dont hit a single spinlock ?

The intel_iommu is meant to be a security feature. Primarily it is use=
d in virtualzation where it allows KVM or Xen to direct assign a device=
without having to worry about the guest getting access to the hosts ph=
ysical memory by submitting invalid DMA requests.

If virtualzation isn't in use I would recommend turning it off as it ca=
n have a negative impact especially on small packet performance due to =
the extra locking overhead that is required for DMA map and unmap calls=
=2E

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-08 09:07:30 UTC
Permalink
Le vendredi 08 avril 2011 =C3=A0 16:59 +0800, Wei Gu a =C3=A9crit :
> Hi,
> On my configuration of Linux 2.6.38 we do enabled the CONFIG_DMAR
> CONFIG_DMAR=3Dy
> CONFIG_DMAR_DEFAULT_ON=3Dy
> CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>=20
> But I enabled this DMAR also by default of the RHEL6 2.6.32 kernel,
> which we don't see this problem on this kernel. The only different I
> can see on the config is "CONFIG_DMAR_DEFAULT_ON"
> CONFIG_DMAR=3Dy
> # CONFIG_DMAR_DEFAULT_ON is not set
> CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>=20
> I could try the same configuration as 2.6.32 for the 2.6.38 kernel by
> unset the CONFIG_DMAR_DEFAULT_ON. And run the test again, hopefully w=
e
> can get the test result next Monday:)

Just say no to CONFIG_DMAR, end of problems, no need to try various
knobs and find the solution next month ;)

Unless you have a reason to use it.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-08 09:15:51 UTC
Permalink
Yeap, you are right, right now I will try the CONFIG_DMAR off:)

If you guys know the relation between CONFIG_DMAR and CONFIG_DMAR_DEFAU=
LT_ON, please let me know:)

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 08, 2011 5:08 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le vendredi 08 avril 2011 =E0 16:59 +0800, Wei Gu a =E9crit :
> Hi,
> On my configuration of Linux 2.6.38 we do enabled the CONFIG_DMAR
> CONFIG_DMAR=3Dy CONFIG_DMAR_DEFAULT_ON=3Dy CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>
> But I enabled this DMAR also by default of the RHEL6 2.6.32 kernel,
> which we don't see this problem on this kernel. The only different I
> can see on the config is "CONFIG_DMAR_DEFAULT_ON"
> CONFIG_DMAR=3Dy
> # CONFIG_DMAR_DEFAULT_ON is not set
> CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>
> I could try the same configuration as 2.6.32 for the 2.6.38 kernel by
> unset the CONFIG_DMAR_DEFAULT_ON. And run the test again, hopefully w=
e
> can get the test result next Monday:)

Just say no to CONFIG_DMAR, end of problems, no need to try various kno=
bs and find the solution next month ;)

Unless you have a reason to use it.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-08 09:49:18 UTC
Permalink
Le vendredi 08 avril 2011 =C3=A0 17:15 +0800, Wei Gu a =C3=A9crit :
> Yeap, you are right, right now I will try the CONFIG_DMAR off:)
>=20
> If you guys know the relation between CONFIG_DMAR and CONFIG_DMAR_DEF=
AULT_ON, please let me know:)


CONFIG_DMAR_DEFAULT_ON is ON by default since linux-2.6.29 and commit

commit f6be37fdc62d0c0214bc49815d1180ebfbd716e2
Author: Kyle McMartin <***@redhat.com>
Date: Thu Feb 26 12:57:56 2009 -0500

x86: enable DMAR by default
=20
Now that the obvious bugs have been worked out, specifically
the iwlagn issue, and the write buffer errata, DMAR should be safe
to turn back on by default. (We've had it on since those patches
were
first written a few weeks ago, without any noticeable bug reports
(most have been due to the dma-api debug patchset.))
=20
Signed-off-by: Kyle McMartin <***@redhat.com>
Acked-by: David Woodhouse <***@intel.com>
Signed-off-by: Ingo Molnar <***@elte.hu>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9c39095..bc2fbad 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1803,7 +1803,7 @@ config DMAR
remapping devices.
=20
config DMAR_DEFAULT_ON
- def_bool n
+ def_bool y
prompt "Enable DMA Remapping Devices by default"
depends on DMAR
help


git describe --contains f6be37fd
v2.6.29-rc7~24^2


But the .config used to build your 2.6.32 kernel, had it set to OFF



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-08 09:59:46 UTC
Permalink
Yeap, I guess the red hat guy make this config, since I chosed a no vir=
tualization setup of the RHEL6.

If the from 2.6.29 the DMAR_DEFAULT_ON is set, then I guess there will =
a big performance degrade at least for the Intel 10GE NIC.

I don't know if this kind of performance degrade is a fault Or it just =
meet the design expectation?

@Alexander, I guess you know much about this?

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 08, 2011 5:49 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le vendredi 08 avril 2011 =E0 17:15 +0800, Wei Gu a =E9crit :
> Yeap, you are right, right now I will try the CONFIG_DMAR off:)
>
> If you guys know the relation between CONFIG_DMAR and
> CONFIG_DMAR_DEFAULT_ON, please let me know:)


CONFIG_DMAR_DEFAULT_ON is ON by default since linux-2.6.29 and commit

commit f6be37fdc62d0c0214bc49815d1180ebfbd716e2
Author: Kyle McMartin <***@redhat.com>
Date: Thu Feb 26 12:57:56 2009 -0500

x86: enable DMAR by default

Now that the obvious bugs have been worked out, specifically
the iwlagn issue, and the write buffer errata, DMAR should be safe
to turn back on by default. (We've had it on since those patches we=
re
first written a few weeks ago, without any noticeable bug reports
(most have been due to the dma-api debug patchset.))

Signed-off-by: Kyle McMartin <***@redhat.com>
Acked-by: David Woodhouse <***@intel.com>
Signed-off-by: Ingo Molnar <***@elte.hu>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 9c39095..bc2fbad=
100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1803,7 +1803,7 @@ config DMAR
remapping devices.

config DMAR_DEFAULT_ON
- def_bool n
+ def_bool y
prompt "Enable DMA Remapping Devices by default"
depends on DMAR
help


git describe --contains f6be37fd
v2.6.29-rc7~24^2


But the .config used to build your 2.6.32 kernel, had it set to OFF



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-08 09:41:25 UTC
Permalink
Hi,
I just tried the CONFIG_DMAR off and CONFIG_DMAR_DEFAULT_ON not set.
I get a pretty good performance(1.5Mpps 400byte Rx&Tx) with shipped ixg=
be driver.

The I think it is the root cause why I get that low performance previou=
slly.

Is it better to put some notes on the Intel-IOMMU.txt Or somewhere like=
the Q&A about the " CONFIG_DMAR" in no virtualization setup:)

I will keep testing the large setup by enable the rest 3 NICs, to see h=
ow much we can have with this 4 sock machine.

BTW, even we gain some preformance with this shipped ixgbe driver, but =
it doesn't seem that much scallable compare to the Intel offical releas=
e 3.2.10 driver. I don't know if it is a problem?

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 08, 2011 5:08 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le vendredi 08 avril 2011 =E0 16:59 +0800, Wei Gu a =E9crit :
> Hi,
> On my configuration of Linux 2.6.38 we do enabled the CONFIG_DMAR
> CONFIG_DMAR=3Dy CONFIG_DMAR_DEFAULT_ON=3Dy CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>
> But I enabled this DMAR also by default of the RHEL6 2.6.32 kernel,
> which we don't see this problem on this kernel. The only different I
> can see on the config is "CONFIG_DMAR_DEFAULT_ON"
> CONFIG_DMAR=3Dy
> # CONFIG_DMAR_DEFAULT_ON is not set
> CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>
> I could try the same configuration as 2.6.32 for the 2.6.38 kernel by
> unset the CONFIG_DMAR_DEFAULT_ON. And run the test again, hopefully w=
e
> can get the test result next Monday:)

Just say no to CONFIG_DMAR, end of problems, no need to try various kno=
bs and find the solution next month ;)

Unless you have a reason to use it.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-08 12:19:41 UTC
Permalink
Hi again,
I tried more testing with by disable this CONFIG_DMAR with shipped 2.6.=
38 ixgbe and Intel released 3.2.10/3.1.15.
All these test looks we can get >1Mpps 400bype packtes but not stable a=
t all, there will huge number missing errors with 100% CPU IDLE:
ethtool -S eth10 |grep rx_missed_errors

rx_missed_errors: 76832040

SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0


Do you know if there is other options in the kernel will cause high rat=
e rx_missed_errors with low CPU usage. (No problem on 2.6.32 with same =
test case)

perf record:
+ 69.74% swapper [kernel.kallsyms] [k] poll_idle
+ 11.62% swapper [kernel.kallsyms] [k] intel_idl=
e
+ 0.80% swapper [ixgbe] [k] ixgbe_pol=
l
+ 0.79% perf [ixgbe] [k] ixgbe_pol=
l
+ 0.77% perf [kernel.kallsyms] [k] skb_copy_=
bits
+ 0.64% swapper [kernel.kallsyms] [k] skb_copy_=
bits
+ 0.48% perf [kernel.kallsyms] [k] __kmalloc=
_node_track_caller
+ 0.44% swapper [kernel.kallsyms] [k] __kmalloc=
_node_track_caller
+ 0.36% swapper [kernel.kallsyms] [k] kmem_cach=
e_alloc_node
+ 0.35% swapper [kernel.kallsyms] [k] kfree
+ 0.35% perf [kernel.kallsyms] [k] kmem_cach=
e_alloc_node


-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 08, 2011 5:08 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le vendredi 08 avril 2011 =E0 16:59 +0800, Wei Gu a =E9crit :
> Hi,
> On my configuration of Linux 2.6.38 we do enabled the CONFIG_DMAR
> CONFIG_DMAR=3Dy CONFIG_DMAR_DEFAULT_ON=3Dy CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>
> But I enabled this DMAR also by default of the RHEL6 2.6.32 kernel,
> which we don't see this problem on this kernel. The only different I
> can see on the config is "CONFIG_DMAR_DEFAULT_ON"
> CONFIG_DMAR=3Dy
> # CONFIG_DMAR_DEFAULT_ON is not set
> CONFIG_DMAR_FLOPPY_WA=3Dy
> CONFIG_INTR_REMAP=3Dy
>
> I could try the same configuration as 2.6.32 for the 2.6.38 kernel by
> unset the CONFIG_DMAR_DEFAULT_ON. And run the test again, hopefully w=
e
> can get the test result next Monday:)

Just say no to CONFIG_DMAR, end of problems, no need to try various kno=
bs and find the solution next month ;)

Unless you have a reason to use it.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-08 12:56:40 UTC
Permalink
Le vendredi 08 avril 2011 =C3=A0 20:19 +0800, Wei Gu a =C3=A9crit :
> Hi again,
> I tried more testing with by disable this CONFIG_DMAR with shipped
> 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> All these test looks we can get >1Mpps 400bype packtes but not stable
> at all, there will huge number missing errors with 100% CPU IDLE:
> ethtool -S eth10 |grep rx_missed_errors
>=20
> rx_missed_errors: 76832040
>=20
> SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
>=20
>=20
> Do you know if there is other options in the kernel will cause high
> rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 with
> same test case)
>=20
> perf record:
> + 69.74% swapper [kernel.kallsyms] [k] poll_id=
le
> + 11.62% swapper [kernel.kallsyms] [k] intel_i=
dle
> + 0.80% swapper [ixgbe] [k] ixgbe_p=
oll
> + 0.79% perf [ixgbe] [k] ixgbe_p=
oll
> + 0.77% perf [kernel.kallsyms] [k] skb_cop=
y_bits
> + 0.64% swapper [kernel.kallsyms] [k] skb_cop=
y_bits
> + 0.48% perf [kernel.kallsyms] [k] __kmall=
oc_node_track_caller
> + 0.44% swapper [kernel.kallsyms] [k] __kmall=
oc_node_track_caller
> + 0.36% swapper [kernel.kallsyms] [k] kmem_ca=
che_alloc_node
> + 0.35% swapper [kernel.kallsyms] [k] kfree
> + 0.35% perf [kernel.kallsyms] [k] kmem_ca=
che_alloc_node
>=20


Make sure enough cpus serves interrupts, _before_ even starting your
stress test.

Then, make sure trafic is distributed to many different queues.
If a single flow is used, it probably uses a single queue ->single CPU.

Say you have irq affinities set to fffffffffffff (all cpus able to
serve IRQ X,Y,Z,T,...)

Then you have a network burst (because you start your packet generator
at full rate), spreaded on many queues.

CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode.
CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode.
CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode.
CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode.
CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode.
CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode.
=2E..
CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode.
=2E..

Then softirq can start, and only CPU0 is able to handle NAPI for all th=
e
queued devices. You are stuck, with CPU0 never leaving ksoftirqd.

NAPI handling is always performed on the CPU that received the hardware
interrupt, until we exit NAPI (and rearm interrupt delivery).
It cannot migrate to an "idle cpu"


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-08 14:10:50 UTC
Permalink
Hi,
Got you mean.
But as I decribed before, I start the eth10 with 8 rx queues and 8 tx q=
ueues, and then I binding these 8 tx&rx queue each to CPU core 24-32 (N=
UMA3), which I think could gain the best performance in my case (It's t=
rue on Linux 2.6.32)
single queue ->single CPU
Then I can descibe a little bit with packet generator, I config the IXI=
A to continues increase the dest ip address towards the test server, so=
the packet was evenly distributed to each receving queues of the eth10=
=2E And according the IXIA tools the transmit sharp was really good, no=
too much peaks

What I observed on Linux 2.6.38 during the test, there is no softqd was=
stressed (< 03% on SI for each core(24-31)) while the packet lost happ=
ens, so we are not really stress the CPU:), It looks like we are limite=
d on some memory bandwidth (DMA) on this release

And with same test case on 2.6.32, no such problem at all. It running p=
retty stable > 2Mpps without rx_missing_error. There is no HW limitatio=
n on this DL580


BTW what is these "swapper"
+ 0.80% swapper [ixgbe] [k] ixgbe_pol=
l
+ 0.79% perf [ixgbe] [k] ixgbe_pol=
l
Why the ixgbe_poll was on swapper/perf?

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 08, 2011 8:57 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le vendredi 08 avril 2011 =E0 20:19 +0800, Wei Gu a =E9crit :
> Hi again,
> I tried more testing with by disable this CONFIG_DMAR with shipped
> 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> All these test looks we can get >1Mpps 400bype packtes but not stable
> at all, there will huge number missing errors with 100% CPU IDLE:
> ethtool -S eth10 |grep rx_missed_errors
>
> rx_missed_errors: 76832040
>
> SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
>
>
> Do you know if there is other options in the kernel will cause high
> rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 with
> same test case)
>
> perf record:
> + 69.74% swapper [kernel.kallsyms] [k] poll_id=
le
> + 11.62% swapper [kernel.kallsyms] [k] intel_i=
dle
> + 0.80% swapper [ixgbe] [k] ixgbe_p=
oll
> + 0.79% perf [ixgbe] [k] ixgbe_p=
oll
> + 0.77% perf [kernel.kallsyms] [k] skb_cop=
y_bits
> + 0.64% swapper [kernel.kallsyms] [k] skb_cop=
y_bits
> + 0.48% perf [kernel.kallsyms] [k] __kmall=
oc_node_track_caller
> + 0.44% swapper [kernel.kallsyms] [k] __kmall=
oc_node_track_caller
> + 0.36% swapper [kernel.kallsyms] [k] kmem_ca=
che_alloc_node
> + 0.35% swapper [kernel.kallsyms] [k] kfree
> + 0.35% perf [kernel.kallsyms] [k] kmem_ca=
che_alloc_node
>


Make sure enough cpus serves interrupts, _before_ even starting your st=
ress test.

Then, make sure trafic is distributed to many different queues.
If a single flow is used, it probably uses a single queue ->single CPU.

Say you have irq affinities set to fffffffffffff (all cpus able to ser=
ve IRQ X,Y,Z,T,...)

Then you have a network burst (because you start your packet generator =
at full rate), spreaded on many queues.

CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode.
CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode.
CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode.
CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode.
CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode.
CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode.
=2E..
CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode.
=2E..

Then softirq can start, and only CPU0 is able to handle NAPI for all th=
e queued devices. You are stuck, with CPU0 never leaving ksoftirqd.

NAPI handling is always performed on the CPU that received the hardware=
interrupt, until we exit NAPI (and rearm interrupt delivery).
It cannot migrate to an "idle cpu"


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2011-04-08 14:49:02 UTC
Permalink
On Fri, 8 Apr 2011 22:10:50 +0800
Wei Gu <***@ericsson.com> wrote:

> Hi,
> Got you mean.
> But as I decribed before, I start the eth10 with 8 rx queues and 8 tx=
queues, and then I binding these 8 tx&rx queue each to CPU core 24-32 =
(NUMA3), which I think could gain the best performance in my case (It's=
true on Linux 2.6.32)
> single queue ->single CPU
> Then I can descibe a little bit with packet generator, I config the I=
XIA to continues increase the dest ip address towards the test server, =
so the packet was evenly distributed to each receving queues of the eth=
10. And according the IXIA tools the transmit sharp was really good, no=
too much peaks
>=20
> What I observed on Linux 2.6.38 during the test, there is no softqd w=
as stressed (< 03% on SI for each core(24-31)) while the packet lost ha=
ppens, so we are not really stress the CPU:), It looks like we are limi=
ted on some memory bandwidth (DMA) on this release
>=20
> And with same test case on 2.6.32, no such problem at all. It running=
pretty stable > 2Mpps without rx_missing_error. There is no HW limitat=
ion on this DL580
>=20
>=20
> BTW what is these "swapper"
> + 0.80% swapper [ixgbe] [k] ixgbe_p=
oll
> + 0.79% perf [ixgbe] [k] ixgbe_p=
oll
> Why the ixgbe_poll was on swapper/perf?
>=20
> Thanks
> WeiGu
>=20
> -----Original Message-----
> From: Eric Dumazet [mailto:***@gmail.com]
> Sent: Friday, April 08, 2011 8:57 PM
> To: Wei Gu
> Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
> Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
>=20
> Le vendredi 08 avril 2011 =E0 20:19 +0800, Wei Gu a =E9crit :
> > Hi again,
> > I tried more testing with by disable this CONFIG_DMAR with shipped
> > 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> > All these test looks we can get >1Mpps 400bype packtes but not stab=
le
> > at all, there will huge number missing errors with 100% CPU IDLE:
> > ethtool -S eth10 |grep rx_missed_errors
> >
> > rx_missed_errors: 76832040
> >
> > SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> > SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> > SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> > SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> > SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> > SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
> >
> >
> > Do you know if there is other options in the kernel will cause high
> > rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 wit=
h
> > same test case)
> >
> > perf record:
> > + 69.74% swapper [kernel.kallsyms] [k] poll_=
idle
> > + 11.62% swapper [kernel.kallsyms] [k] intel=
_idle
> > + 0.80% swapper [ixgbe] [k] ixgbe=
_poll
> > + 0.79% perf [ixgbe] [k] ixgbe=
_poll
> > + 0.77% perf [kernel.kallsyms] [k] skb_c=
opy_bits
> > + 0.64% swapper [kernel.kallsyms] [k] skb_c=
opy_bits
> > + 0.48% perf [kernel.kallsyms] [k] __kma=
lloc_node_track_caller
> > + 0.44% swapper [kernel.kallsyms] [k] __kma=
lloc_node_track_caller
> > + 0.36% swapper [kernel.kallsyms] [k] kmem_=
cache_alloc_node
> > + 0.35% swapper [kernel.kallsyms] [k] kfree
> > + 0.35% perf [kernel.kallsyms] [k] kmem_=
cache_alloc_node
> >
>=20
>=20
> Make sure enough cpus serves interrupts, _before_ even starting your =
stress test.
>=20
> Then, make sure trafic is distributed to many different queues.
> If a single flow is used, it probably uses a single queue ->single CP=
U.
>=20
> Say you have irq affinities set to fffffffffffff (all cpus able to s=
erve IRQ X,Y,Z,T,...)
>=20
> Then you have a network burst (because you start your packet generato=
r at full rate), spreaded on many queues.
>=20
> CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode.
> ...
> CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode.
> ...
>=20
> Then softirq can start, and only CPU0 is able to handle NAPI for all =
the queued devices. You are stuck, with CPU0 never leaving ksoftirqd.
>=20
> NAPI handling is always performed on the CPU that received the hardwa=
re interrupt, until we exit NAPI (and rearm interrupt delivery).
> It cannot migrate to an "idle cpu"

=46or performance, you need to assign each network interrupt to a singl=
e
CPU. There is no load balancing effect in the IRQ controller.

If you have a multi-socket system, then it is a good idea to make the I=
RQ's
for the NIC's be on the same socket as the bus interface. Multi socket =
systems
are really NUMA and putting IRQ on non-local CPU has measurable impact.



--=20
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-09 03:51:23 UTC
Permalink
Hi Stephen,
Thanks for you reply,
I do awared that local Bus, local CPU socket would gain +20% performanc=
e, that why I put the eth10 on core 24-31 (sock 3) and NUMA Node 2/3.

Thanks
WeiGu

-----Original Message-----
=46rom: Stephen Hemminger [mailto:***@vyatta.com]
Sent: Friday, April 08, 2011 10:49 PM
To: Wei Gu
Cc: Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Fri, 8 Apr 2011 22:10:50 +0800
Wei Gu <***@ericsson.com> wrote:

> Hi,
> Got you mean.
> But as I decribed before, I start the eth10 with 8 rx queues and 8 tx
> queues, and then I binding these 8 tx&rx queue each to CPU core 24-32
> (NUMA3), which I think could gain the best performance in my case
> (It's true on Linux 2.6.32) single queue ->single CPU Then I can
> descibe a little bit with packet generator, I config the IXIA to
> continues increase the dest ip address towards the test server, so th=
e
> packet was evenly distributed to each receving queues of the eth10.
> And according the IXIA tools the transmit sharp was really good, no
> too much peaks
>
> What I observed on Linux 2.6.38 during the test, there is no softqd
> was stressed (< 03% on SI for each core(24-31)) while the packet lost
> happens, so we are not really stress the CPU:), It looks like we are
> limited on some memory bandwidth (DMA) on this release
>
> And with same test case on 2.6.32, no such problem at all. It running
> pretty stable > 2Mpps without rx_missing_error. There is no HW
> limitation on this DL580
>
>
> BTW what is these "swapper"
> + 0.80% swapper [ixgbe] [k] ixgbe_p=
oll
> + 0.79% perf [ixgbe] [k] ixgbe_p=
oll
> Why the ixgbe_poll was on swapper/perf?
>
> Thanks
> WeiGu
>
> -----Original Message-----
> From: Eric Dumazet [mailto:***@gmail.com]
> Sent: Friday, April 08, 2011 8:57 PM
> To: Wei Gu
> Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
> Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
>
> Le vendredi 08 avril 2011 =E0 20:19 +0800, Wei Gu a =E9crit :
> > Hi again,
> > I tried more testing with by disable this CONFIG_DMAR with shipped
> > 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> > All these test looks we can get >1Mpps 400bype packtes but not
> > stable at all, there will huge number missing errors with 100% CPU =
IDLE:
> > ethtool -S eth10 |grep rx_missed_errors
> >
> > rx_missed_errors: 76832040
> >
> > SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> > SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> > SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> > SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> > SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> > SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
> >
> >
> > Do you know if there is other options in the kernel will cause high
> > rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 wit=
h
> > same test case)
> >
> > perf record:
> > + 69.74% swapper [kernel.kallsyms] [k] poll_=
idle
> > + 11.62% swapper [kernel.kallsyms] [k] intel=
_idle
> > + 0.80% swapper [ixgbe] [k] ixgbe=
_poll
> > + 0.79% perf [ixgbe] [k] ixgbe=
_poll
> > + 0.77% perf [kernel.kallsyms] [k] skb_c=
opy_bits
> > + 0.64% swapper [kernel.kallsyms] [k] skb_c=
opy_bits
> > + 0.48% perf [kernel.kallsyms] [k] __kma=
lloc_node_track_caller
> > + 0.44% swapper [kernel.kallsyms] [k] __kma=
lloc_node_track_caller
> > + 0.36% swapper [kernel.kallsyms] [k] kmem_=
cache_alloc_node
> > + 0.35% swapper [kernel.kallsyms] [k] kfree
> > + 0.35% perf [kernel.kallsyms] [k] kmem_=
cache_alloc_node
> >
>
>
> Make sure enough cpus serves interrupts, _before_ even starting your =
stress test.
>
> Then, make sure trafic is distributed to many different queues.
> If a single flow is used, it probably uses a single queue ->single CP=
U.
>
> Say you have irq affinities set to fffffffffffff (all cpus able to
> serve IRQ X,Y,Z,T,...)
>
> Then you have a network burst (because you start your packet generato=
r at full rate), spreaded on many queues.
>
> CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode.
> ...
> CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode.
> ...
>
> Then softirq can start, and only CPU0 is able to handle NAPI for all =
the queued devices. You are stuck, with CPU0 never leaving ksoftirqd.
>
> NAPI handling is always performed on the CPU that received the hardwa=
re interrupt, until we exit NAPI (and rearm interrupt delivery).
> It cannot migrate to an "idle cpu"

=46or performance, you need to assign each network interrupt to a singl=
e CPU. There is no load balancing effect in the IRQ controller.

If you have a multi-socket system, then it is a good idea to make the I=
RQ's for the NIC's be on the same socket as the bus interface. Multi so=
cket systems are really NUMA and putting IRQ on non-local CPU has measu=
rable impact.



--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-08 15:07:03 UTC
Permalink
Le vendredi 08 avril 2011 =C3=A0 22:10 +0800, Wei Gu a =C3=A9crit :
> Hi,
> Got you mean.
> But as I decribed before, I start the eth10 with 8 rx queues and 8 tx
> queues, and then I binding these 8 tx&rx queue each to CPU core 24-32
> (NUMA3), which I think could gain the best performance in my case
> (It's true on Linux 2.6.32)
> single queue ->single CPU

Try with other cpus ? Maybe a mix.

Maybe your thinking is not good, and you chose the cpus that were not
the best candidates. This was OK in 2.6.32 because you were lucky.

Using cpus from an unique NUMA node is not very good, since only one
NUMA node is going to be used, and other NUMA nodes are idle.


NUMA binding is tricky. Linux try to use local node, hoping that all
cpus are running and use local memory. In the end, global throughput is
better.

But if your workload use cpus from one single node, then it means you
lose part of the memory bandwidth.


> Then I can descibe a little bit with packet generator, I config the
> IXIA to continues increase the dest ip address towards the test
> server, so the packet was evenly distributed to each receving queues
> of the eth10. And according the IXIA tools the transmit sharp was
> really good, no too much peaks
>=20
> What I observed on Linux 2.6.38 during the test, there is no softqd
> was stressed (< 03% on SI for each core(24-31)) while the packet lost
> happens, so we are not really stress the CPU:), It looks like we are
> limited on some memory bandwidth (DMA) on this release

That would mean you chose the wrong cpus to handle this load.


>=20
> And with same test case on 2.6.32, no such problem at all. It running
> pretty stable > 2Mpps without rx_missing_error. There is no HW
> limitation on this DL580
>=20
>=20
> BTW what is these "swapper"
> + 0.80% swapper [ixgbe] [k]
> ixgbe_poll
> + 0.79% perf [ixgbe] [k]
> ixgbe_poll
> Why the ixgbe_poll was on swapper/perf?
>=20

softirq are run behalf the current interrupted thread, unless you enter
ksoftirqd if load is high.

It can be "idle task" or the "perf" task, or another ones...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-09 03:27:03 UTC
Permalink
HI Eric,
If I try to bind the 8 tx&rx queue to different NUMA Node to (core 3,7,=
11,15,19,23,27,31), looks doesn't help on the rx_missing_error anymore.

I still think the best performance would be binding NIC to one sock of =
CPU with it's local memory node.
I did a lot of combination on 2.6.32 kernel, by bind the eth10 to NODE2=
/3 could gain 20% more performance compare to NODE0/1.
So I guess the CPU Socket 2&3 was locally with the eth10.

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 08, 2011 11:07 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le vendredi 08 avril 2011 =E0 22:10 +0800, Wei Gu a =E9crit :
> Hi,
> Got you mean.
> But as I decribed before, I start the eth10 with 8 rx queues and 8 tx
> queues, and then I binding these 8 tx&rx queue each to CPU core 24-32
> (NUMA3), which I think could gain the best performance in my case
> (It's true on Linux 2.6.32) single queue ->single CPU

Try with other cpus ? Maybe a mix.

Maybe your thinking is not good, and you chose the cpus that were not t=
he best candidates. This was OK in 2.6.32 because you were lucky.

Using cpus from an unique NUMA node is not very good, since only one NU=
MA node is going to be used, and other NUMA nodes are idle.


NUMA binding is tricky. Linux try to use local node, hoping that all cp=
us are running and use local memory. In the end, global throughput is b=
etter.

But if your workload use cpus from one single node, then it means you l=
ose part of the memory bandwidth.


> Then I can descibe a little bit with packet generator, I config the
> IXIA to continues increase the dest ip address towards the test
> server, so the packet was evenly distributed to each receving queues
> of the eth10. And according the IXIA tools the transmit sharp was
> really good, no too much peaks
>
> What I observed on Linux 2.6.38 during the test, there is no softqd
> was stressed (< 03% on SI for each core(24-31)) while the packet lost
> happens, so we are not really stress the CPU:), It looks like we are
> limited on some memory bandwidth (DMA) on this release

That would mean you chose the wrong cpus to handle this load.


>
> And with same test case on 2.6.32, no such problem at all. It running
> pretty stable > 2Mpps without rx_missing_error. There is no HW
> limitation on this DL580
>
>
> BTW what is these "swapper"
> + 0.80% swapper [ixgbe] [k]
> ixgbe_poll
> + 0.79% perf [ixgbe] [k]
> ixgbe_poll
> Why the ixgbe_poll was on swapper/perf?
>

softirq are run behalf the current interrupted thread, unless you enter=
ksoftirqd if load is high.

It can be "idle task" or the "perf" task, or another ones...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-09 06:36:38 UTC
Permalink
Le samedi 09 avril 2011 =C3=A0 11:27 +0800, Wei Gu a =C3=A9crit :
> HI Eric,
> If I try to bind the 8 tx&rx queue to different NUMA Node to (core 3,=
7,11,15,19,23,27,31), looks doesn't help on the rx_missing_error anymor=
e.
>=20
> I still think the best performance would be binding NIC to one sock o=
f CPU with it's local memory node.
> I did a lot of combination on 2.6.32 kernel, by bind the eth10 to NOD=
E2/3 could gain 20% more performance compare to NODE0/1.
> So I guess the CPU Socket 2&3 was locally with the eth10.
>=20

Ideally, you would need to split memory loads on several nodes, because
you have a workload on a single NIC, located on a given node Nx.


1) Let the buffers where NIC performs DMA be on Nx,
so that DMA is fast.

2) And everything else on other nodes, so that cpus can steal some
memory bandwidth from other nodes, and free Nx memory bandwidth for NIC
use. (Processors only need to fetch first cache line of packets to
perform routing decision)

alloc_skb() would need to use memory from node Ny for "struct sk_buff",
and memory from node Nx for "skb->data" and skb frags
[ netdev_alloc_page() in ixgbe case]

In your case, you have 4 nodes, so Ny would be in a set of 3 nodes.

So commit 564824b0c52c34692d804b would need a litle tweak in your
case [ where your cpus need to bring only one cache line from the packe=
t payload ]

Please try following patch :



include/linux/skbuff.h | 14 +-------------
net/core/skbuff.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d0ae90a..b43626d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1567,19 +1567,7 @@ static inline struct sk_buff *netdev_alloc_skb_i=
p_align(struct net_device *dev,
return skb;
}
=20
-/**
- * __netdev_alloc_page - allocate a page for ps-rx on a specific devic=
e
- * @dev: network device to receive on
- * @gfp_mask: alloc_pages_node mask
- *
- * Allocate a new page. dev currently unused.
- *
- * %NULL is returned if there is no free memory.
- */
-static inline struct page *__netdev_alloc_page(struct net_device *dev,=
gfp_t gfp_mask)
-{
- return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
-}
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t =
gfp_mask);
=20
/**
* netdev_alloc_page - allocate a page for ps-rx on a specific device
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7ebeed0..877797e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -259,6 +259,25 @@ struct sk_buff *__netdev_alloc_skb(struct net_devi=
ce *dev,
}
EXPORT_SYMBOL(__netdev_alloc_skb);
=20
+/**
+ * __netdev_alloc_page - allocate a page for ps-rx on a specific devic=
e
+ * @dev: network device to receive on
+ * @gfp_mask: alloc_pages_node mask
+ *
+ * Allocate a new page. dev currently unused.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mas=
k)
+{
+ int node =3D dev->dev.parent ? dev_to_node(dev->dev.parent) : NUMA_NO=
_NODE;
+ struct page *page;
+
+ page =3D alloc_pages_node(node, gfp_mask, 0);
+ return page;
+}
+EXPORT_SYMBOL(__netdev_alloc_page);
+
void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, in=
t off,
int size)
{


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-10 07:02:53 UTC
Permalink
Hi ,
I haven't enable the packet header spilting, so I think no netdev_alloc_page() will be called in my case.

BTW, I also did the same test from 2.6.33 to 2.6.35 kernel, looks the problem happens from 2.6.35, cause 2.6.32/33/34 do not see this problem at all, they all works pretty good.

I modify the .configs base on the FC13/14, and also manully set the DMAR DEFAULT to off, chose the SLAB as the memory allocator (same as RHEL6 2.6.32). For more detail about the config, please check the attached file

Thanks
WeiGu

-----Original Message-----
From: Eric Dumazet [mailto:***@gmail.com]
Sent: Saturday, April 09, 2011 2:37 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le samedi 09 avril 2011 à 11:27 +0800, Wei Gu a écrit :
> HI Eric,
> If I try to bind the 8 tx&rx queue to different NUMA Node to (core 3,7,11,15,19,23,27,31), looks doesn't help on the rx_missing_error anymore.
>
> I still think the best performance would be binding NIC to one sock of CPU with it's local memory node.
> I did a lot of combination on 2.6.32 kernel, by bind the eth10 to NODE2/3 could gain 20% more performance compare to NODE0/1.
> So I guess the CPU Socket 2&3 was locally with the eth10.
>

Ideally, you would need to split memory loads on several nodes, because you have a workload on a single NIC, located on a given node Nx.


1) Let the buffers where NIC performs DMA be on Nx, so that DMA is fast.

2) And everything else on other nodes, so that cpus can steal some memory bandwidth from other nodes, and free Nx memory bandwidth for NIC use. (Processors only need to fetch first cache line of packets to perform routing decision)

alloc_skb() would need to use memory from node Ny for "struct sk_buff", and memory from node Nx for "skb->data" and skb frags [ netdev_alloc_page() in ixgbe case]

In your case, you have 4 nodes, so Ny would be in a set of 3 nodes.

So commit 564824b0c52c34692d804b would need a litle tweak in your case [ where your cpus need to bring only one cache line from the packet payload ]

Please try following patch :



include/linux/skbuff.h | 14 +-------------
net/core/skbuff.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d0ae90a..b43626d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1567,19 +1567,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
return skb;
}

-/**
- * __netdev_alloc_page - allocate a page for ps-rx on a specific device
- * @dev: network device to receive on
- * @gfp_mask: alloc_pages_node mask
- *
- * Allocate a new page. dev currently unused.
- *
- * %NULL is returned if there is no free memory.
- */
-static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) -{
- return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
-}
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t
+gfp_mask);

/**
* netdev_alloc_page - allocate a page for ps-rx on a specific device
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7ebeed0..877797e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -259,6 +259,25 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb);

+/**
+ * __netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @dev: network device to receive on
+ * @gfp_mask: alloc_pages_node mask
+ *
+ * Allocate a new page. dev currently unused.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t
+gfp_mask) {
+ int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : NUMA_NO_NODE;
+ struct page *page;
+
+ page = alloc_pages_node(node, gfp_mask, 0);
+ return page;
+}
+EXPORT_SYMBOL(__netdev_alloc_page);
+
void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size)
{
Alexander Duyck
2011-04-11 14:50:03 UTC
Permalink
On 4/10/2011 12:02 AM, Wei Gu wrote:
> Hi , I haven't enable the packet header spilting, so I think no
> netdev_alloc_page() will be called in my case.
>
> BTW, I also did the same test from 2.6.33 to 2.6.35 kernel, looks the
> problem happens from 2.6.35, cause 2.6.32/33/34 do not see this
> problem at all, they all works pretty good.
>
> I modify the .configs base on the FC13/14, and also manully set the
> DMAR DEFAULT to off, chose the SLAB as the memory allocator (same as
> RHEL6 2.6.32). For more detail about the config, please check the
> attached file
>
> Thanks WeiGu

WeiGu,

In order to try and isolate this issue I was wondering if you could try
our out-of-tree ixgbe driver from e1000.sf.net? The idea would be to
test it on 2.6.34 and 2.6.35 in order to determine if the issue is due
to a change in the kernel or a change in the driver. If the performance
is the same on these two kernels with our out-of-tree driver then we
know the issue is a change somewhere in the ixgbe driver in the kernel.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-11 15:00:49 UTC
Permalink
I will try the lastest driver 3.3.8 on e1000.sf.net:)

But I also tested the ixgbe-3.2.10 on the kernel 2.6.35.1 and 2.6.35.2,The problem start happens from 2.6.35.2, no problem at all with 2.6.35.1 kernel.
I don't know if some patch between 2.6.35.1 and 2.35.2 had some impact to this Intel 10GE ixgbe driver.

Thanks
WeiGu

-----Original Message-----
From: Alexander Duyck [mailto:***@intel.com]
Sent: Monday, April 11, 2011 10:50 PM
To: Wei Gu
Cc: Eric Dumazet; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On 4/10/2011 12:02 AM, Wei Gu wrote:
> Hi , I haven't enable the packet header spilting, so I think no
> netdev_alloc_page() will be called in my case.
>
> BTW, I also did the same test from 2.6.33 to 2.6.35 kernel, looks the
> problem happens from 2.6.35, cause 2.6.32/33/34 do not see this
> problem at all, they all works pretty good.
>
> I modify the .configs base on the FC13/14, and also manully set the
> DMAR DEFAULT to off, chose the SLAB as the memory allocator (same as
> RHEL6 2.6.32). For more detail about the config, please check the
> attached file
>
> Thanks WeiGu

WeiGu,

In order to try and isolate this issue I was wondering if you could try our out-of-tree ixgbe driver from e1000.sf.net? The idea would be to test it on 2.6.34 and 2.6.35 in order to determine if the issue is due to a change in the kernel or a change in the driver. If the performance is the same on these two kernels with our out-of-tree driver then we know the issue is a change somewhere in the ixgbe driver in the kernel.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-11 15:14:44 UTC
Permalink
I tried the ixgbe-3.3.8 (insmod ixgbe.ko RSS=8,8,8,8,8,8,8,8 FdirMode=0,0,0,0,0,0,0,0 Node=0,0,1,1,2,2,3,3) from e1000.sf.net both on 2.6.35.1 and 2.6.35.2, same observation as 3.2.10 ixgbe driver, On 2.6.35.2 it have high rx errors:
Ethtool -S eth10 |grep error
rx_errors: 0
tx_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 2263088
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_csum_offload_errors: 0
fcoe_last_errors: 0


-----Original Message-----
From: Wei Gu
Sent: Monday, April 11, 2011 11:01 PM
To: 'Alexander Duyck'
Cc: Eric Dumazet; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

I will try the lastest driver 3.3.8 on e1000.sf.net:)

But I also tested the ixgbe-3.2.10 on the kernel 2.6.35.1 and 2.6.35.2,The problem start happens from 2.6.35.2, no problem at all with 2.6.35.1 kernel.
I don't know if some patch between 2.6.35.1 and 2.35.2 had some impact to this Intel 10GE ixgbe driver.

Thanks
WeiGu

-----Original Message-----
From: Alexander Duyck [mailto:***@intel.com]
Sent: Monday, April 11, 2011 10:50 PM
To: Wei Gu
Cc: Eric Dumazet; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On 4/10/2011 12:02 AM, Wei Gu wrote:
> Hi , I haven't enable the packet header spilting, so I think no
> netdev_alloc_page() will be called in my case.
>
> BTW, I also did the same test from 2.6.33 to 2.6.35 kernel, looks the
> problem happens from 2.6.35, cause 2.6.32/33/34 do not see this
> problem at all, they all works pretty good.
>
> I modify the .configs base on the FC13/14, and also manully set the
> DMAR DEFAULT to off, chose the SLAB as the memory allocator (same as
> RHEL6 2.6.32). For more detail about the config, please check the
> attached file
>
> Thanks WeiGu

WeiGu,

In order to try and isolate this issue I was wondering if you could try our out-of-tree ixgbe driver from e1000.sf.net? The idea would be to test it on 2.6.34 and 2.6.35 in order to determine if the issue is due to a change in the kernel or a change in the driver. If the performance is the same on these two kernels with our out-of-tree driver then we know the issue is a change somewhere in the ixgbe driver in the kernel.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-11 15:42:57 UTC
Permalink
Le lundi 11 avril 2011 =C3=A0 23:14 +0800, Wei Gu a =C3=A9crit :
> I tried the ixgbe-3.3.8 (insmod ixgbe.ko RSS=3D8,8,8,8,8,8,8,8 FdirMo=
de=3D0,0,0,0,0,0,0,0 Node=3D0,0,1,1,2,2,3,3) from e1000.sf.net both on=
2.6.35.1 and 2.6.35.2, same observation as 3.2.10 ixgbe driver, On 2.6=
=2E35.2 it have high rx errors:
> Ethtool -S eth10 |grep error
> rx_errors: 0
> tx_errors: 0
> rx_over_errors: 0
> rx_crc_errors: 0
> rx_frame_errors: 0
> rx_fifo_errors: 0
> rx_missed_errors: 2263088
> tx_aborted_errors: 0
> tx_carrier_errors: 0
> tx_fifo_errors: 0
> tx_heartbeat_errors: 0
> rx_long_length_errors: 0
> rx_short_length_errors: 0
> rx_csum_offload_errors: 0
> fcoe_last_errors: 0
>=20

It would be nice you post perf record / perf report results

During your stress , do

perf record -a -g sleep 10
perf report

And post "top offenders"

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-12 01:22:59 UTC
Permalink
I was not stress the NIC/CPU, since I only send 290Kpps 400byte packets=
towards eth10. the CPU load almost 100%IDEL.

BTW, there are some problem with perf tool on 2.6.35.2, I will try to g=
et you the top offenders if possible.

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Monday, April 11, 2011 11:43 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le lundi 11 avril 2011 =E0 23:14 +0800, Wei Gu a =E9crit :
> I tried the ixgbe-3.3.8 (insmod ixgbe.ko RSS=3D8,8,8,8,8,8,8,8 FdirMo=
de=3D0,0,0,0,0,0,0,0 Node=3D0,0,1,1,2,2,3,3) from e1000.sf.net both on=
2.6.35.1 and 2.6.35.2, same observation as 3.2.10 ixgbe driver, On 2.6=
=2E35.2 it have high rx errors:
> Ethtool -S eth10 |grep error
> rx_errors: 0
> tx_errors: 0
> rx_over_errors: 0
> rx_crc_errors: 0
> rx_frame_errors: 0
> rx_fifo_errors: 0
> rx_missed_errors: 2263088
> tx_aborted_errors: 0
> tx_carrier_errors: 0
> tx_fifo_errors: 0
> tx_heartbeat_errors: 0
> rx_long_length_errors: 0
> rx_short_length_errors: 0
> rx_csum_offload_errors: 0
> fcoe_last_errors: 0
>

It would be nice you post perf record / perf report results

During your stress , do

perf record -a -g sleep 10
perf report

And post "top offenders"

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-12 04:40:06 UTC
Permalink
Hi,
I found the problem was introduced by this revert patch "2010-08-13 =
Peter Zijlstra sched: Revert nohz_ratelimit() for now"

I tried the remove this patch from 2.6.35.2 and then build the applicat=
ion again, then the ixgbe driver looks works fine.
I don't know why this time revert the nohz_ratelimit() will cause the =
problem on ixgbe driver, since this nohz_ratelimit was first introduce=
d "2010-03-11". And before that time with 2.6.32 kernel it also doesn't=
have this problem with ixgbe driver.


Some log from git:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
2.6.35.2
2010-08-13 Peter Zijlstra sched: Revert nohz_ratelimit() for now
2.6.35.1
2010-08-01 Linus Torvalds Linux 2.6.35 v2.6.35
2010-06-17 Peter Zijlstra nohz: Fix nohz ratelimit
2.6.35-rc3
2010-03-11 Mike Galbraith sched: Rate-limit nohz

Thanks
WeiGu

-----Original Message-----
=46rom: Wei Gu
Sent: Tuesday, April 12, 2011 9:23 AM
To: 'Eric Dumazet'
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

I was not stress the NIC/CPU, since I only send 290Kpps 400byte packets=
towards eth10. the CPU load almost 100%IDEL.

BTW, there are some problem with perf tool on 2.6.35.2, I will try to g=
et you the top offenders if possible.

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Monday, April 11, 2011 11:43 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le lundi 11 avril 2011 =E0 23:14 +0800, Wei Gu a =E9crit :
> I tried the ixgbe-3.3.8 (insmod ixgbe.ko RSS=3D8,8,8,8,8,8,8,8 FdirMo=
de=3D0,0,0,0,0,0,0,0 Node=3D0,0,1,1,2,2,3,3) from e1000.sf.net both on=
2.6.35.1 and 2.6.35.2, same observation as 3.2.10 ixgbe driver, On 2.6=
=2E35.2 it have high rx errors:
> Ethtool -S eth10 |grep error
> rx_errors: 0
> tx_errors: 0
> rx_over_errors: 0
> rx_crc_errors: 0
> rx_frame_errors: 0
> rx_fifo_errors: 0
> rx_missed_errors: 2263088
> tx_aborted_errors: 0
> tx_carrier_errors: 0
> tx_fifo_errors: 0
> tx_heartbeat_errors: 0
> rx_long_length_errors: 0
> rx_short_length_errors: 0
> rx_csum_offload_errors: 0
> fcoe_last_errors: 0
>

It would be nice you post perf record / perf report results

During your stress , do

perf record -a -g sleep 10
perf report

And post "top offenders"

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-12 04:56:41 UTC
Permalink
Le mardi 12 avril 2011 =C3=A0 12:40 +0800, Wei Gu a =C3=A9crit :
> Hi,
> I found the problem was introduced by this revert patch "2010-08-13
> Peter Zijlstra sched: Revert nohz_ratelimit() for now"
>=20
> I tried the remove this patch from 2.6.35.2 and then build the
> application again, then the ixgbe driver looks works fine.
> I don't know why this time revert the nohz_ratelimit() will cause th=
e
> problem on ixgbe driver, since this nohz_ratelimit was first
> introduced "2010-03-11". And before that time with 2.6.32 kernel it
> also doesn't have this problem with ixgbe driver.
>=20
>=20
> Some log from git:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 2.6.35.2
> 2010-08-13 Peter Zijlstra sched: Revert nohz_ratelimit() for no=
w
> 2.6.35.1
> 2010-08-01 Linus Torvalds Linux 2.6.35 v2.6.35
> 2010-06-17 Peter Zijlstra nohz: Fix nohz ratelimit
> 2.6.35-rc3
> 2010-03-11 Mike Galbraith sched: Rate-limit nohz
>=20
> Thanks
> WeiGu
>=20

Hmm...

Could you try to add "processor.max_cstate=3D1" to boot parameters ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-12 05:18:44 UTC
Permalink
Hi,
It doesn't looks any better by pass this param to kernel

kernel /vmlinuz-2.6.35.2 ro root=3DUUID=3De96f9df8-c28a-4ea8-ac26-64f=
bf948bce2 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=3Den_US.iso88591 =
SYSFONT=3Dlatarcyrheb-sun16 KEYBOARDTYPE=3Dpc KEYTABLE=3Dsv-latin1 cras=
hkernel=3Dauto pci=3Dbfsort rhgb quiet console=3Dtty0 console=3DttyS0,1=
15200 processor.max_cstate=3D1


-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Tuesday, April 12, 2011 12:57 PM
To: Wei Gu
Cc: Alexander Duyck; Peter Zijlstra; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le mardi 12 avril 2011 =E0 12:40 +0800, Wei Gu a =E9crit :
> Hi,
> I found the problem was introduced by this revert patch "2010-08-13
> Peter Zijlstra sched: Revert nohz_ratelimit() for now"
>
> I tried the remove this patch from 2.6.35.2 and then build the
> application again, then the ixgbe driver looks works fine.
> I don't know why this time revert the nohz_ratelimit() will cause th=
e
> problem on ixgbe driver, since this nohz_ratelimit was first
> introduced "2010-03-11". And before that time with 2.6.32 kernel it
> also doesn't have this problem with ixgbe driver.
>
>
> Some log from git:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 2.6.35.2
> 2010-08-13 Peter Zijlstra sched: Revert nohz_ratelimit() for no=
w
> 2.6.35.1
> 2010-08-01 Linus Torvalds Linux 2.6.35 v2.6.35
> 2010-06-17 Peter Zijlstra nohz: Fix nohz ratelimit
> 2.6.35-rc3
> 2010-03-11 Mike Galbraith sched: Rate-limit nohz
>
> Thanks
> WeiGu
>

Hmm...

Could you try to add "processor.max_cstate=3D1" to boot parameters ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-14 05:42:10 UTC
Permalink
Hi guys,
Do you think it was a bug in the kernel from 2.6.35.2 with Intel 10GE i=
xgbe driver?
If so shall I issue a Bug on the bugzilla, and which category? Cause I'=
m not sure it was driver problem Or sched problem.

Thans
WeiGu

-----Original Message-----
=46rom: Wei Gu
Sent: Tuesday, April 12, 2011 12:40 PM
To: 'Eric Dumazet'; 'Alexander Duyck'; 'Peter Zijlstra'
Cc: 'netdev'; 'Kirsher, Jeffrey T'
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Hi,
I found the problem was introduced by this revert patch "2010-08-13 =
Peter Zijlstra sched: Revert nohz_ratelimit() for now"

I tried the remove this patch from 2.6.35.2 and then build the applicat=
ion again, then the ixgbe driver looks works fine.
I don't know why this time revert the nohz_ratelimit() will cause the =
problem on ixgbe driver, since this nohz_ratelimit was first introduce=
d "2010-03-11". And before that time with 2.6.32 kernel it also doesn't=
have this problem with ixgbe driver.


Some log from git:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
2.6.35.2
2010-08-13 Peter Zijlstra sched: Revert nohz_ratelimit() for now
2.6.35.1
2010-08-01 Linus Torvalds Linux 2.6.35 v2.6.35
2010-06-17 Peter Zijlstra nohz: Fix nohz ratelimit
2.6.35-rc3
2010-03-11 Mike Galbraith sched: Rate-limit nohz

Thanks
WeiGu

-----Original Message-----
=46rom: Wei Gu
Sent: Tuesday, April 12, 2011 9:23 AM
To: 'Eric Dumazet'
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

I was not stress the NIC/CPU, since I only send 290Kpps 400byte packets=
towards eth10. the CPU load almost 100%IDEL.

BTW, there are some problem with perf tool on 2.6.35.2, I will try to g=
et you the top offenders if possible.

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Monday, April 11, 2011 11:43 PM
To: Wei Gu
Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le lundi 11 avril 2011 =E0 23:14 +0800, Wei Gu a =E9crit :
> I tried the ixgbe-3.3.8 (insmod ixgbe.ko RSS=3D8,8,8,8,8,8,8,8 FdirMo=
de=3D0,0,0,0,0,0,0,0 Node=3D0,0,1,1,2,2,3,3) from e1000.sf.net both on=
2.6.35.1 and 2.6.35.2, same observation as 3.2.10 ixgbe driver, On 2.6=
=2E35.2 it have high rx errors:
> Ethtool -S eth10 |grep error
> rx_errors: 0
> tx_errors: 0
> rx_over_errors: 0
> rx_crc_errors: 0
> rx_frame_errors: 0
> rx_fifo_errors: 0
> rx_missed_errors: 2263088
> tx_aborted_errors: 0
> tx_carrier_errors: 0
> tx_fifo_errors: 0
> tx_heartbeat_errors: 0
> rx_long_length_errors: 0
> rx_short_length_errors: 0
> rx_csum_offload_errors: 0
> fcoe_last_errors: 0
>

It would be nice you post perf record / perf report results

During your stress , do

perf record -a -g sleep 10
perf report

And post "top offenders"

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-14 06:07:31 UTC
Permalink
Le jeudi 14 avril 2011 =C3=A0 13:42 +0800, Wei Gu a =C3=A9crit :
> Hi guys,
> Do you think it was a bug in the kernel from 2.6.35.2 with Intel 10GE=
ixgbe driver?
> If so shall I issue a Bug on the bugzilla, and which category? Cause =
I'm not sure it was driver problem Or sched problem.

This makes no sense to me.

What is the maximum throughput you can get in pps before having packet
drops ?

Please try with a single flow (to hit one queue, one cpu)

Thanks
=20

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-14 06:33:30 UTC
Permalink
Le jeudi 14 avril 2011 =C3=A0 08:07 +0200, Eric Dumazet a =C3=A9crit :
> Le jeudi 14 avril 2011 =C3=A0 13:42 +0800, Wei Gu a =C3=A9crit :
> > Hi guys,
> > Do you think it was a bug in the kernel from 2.6.35.2 with Intel 10=
GE ixgbe driver?
> > If so shall I issue a Bug on the bugzilla, and which category? Caus=
e I'm not sure it was driver problem Or sched problem.
>=20
> This makes no sense to me.
>=20
> What is the maximum throughput you can get in pps before having packe=
t
> drops ?
>=20
> Please try with a single flow (to hit one queue, one cpu)
>=20
> Thanks
> =20

Also, please try to check if using smaller or bigger packets makes any
change in this max throughput


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-14 06:58:38 UTC
Permalink
I did the single flow test, it shows no rx error with 300kpps. While I =
was start multiple flow with same 300Kpps traffic, then it looks really=
bad with high rx_missing_error.

Multiple Flow:
SUM: 191925 ETH8: 0 ETH10: 191925 ETH6: 0 ETH4: 0
SUM: 214634 ETH8: 0 ETH10: 214634 ETH6: 0 ETH4: 0
SUM: 237600 ETH8: 0 ETH10: 237600 ETH6: 0 ETH4: 0
SUM: 198925 ETH8: 0 ETH10: 198925 ETH6: 0 ETH4: 0
SUM: 249290 ETH8: 0 ETH10: 249290 ETH6: 0 ETH4: 0

Single Flow:
SUM: 302018 ETH8: 0 ETH10: 302018 ETH6: 0 ETH4: 0
SUM: 301849 ETH8: 0 ETH10: 301849 ETH6: 0 ETH4: 0
SUM: 302163 ETH8: 0 ETH10: 302163 ETH6: 0 ETH4: 0

Thanks
WeiGu
-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Thursday, April 14, 2011 2:34 PM
To: Wei Gu
Cc: Alexander Duyck; Peter Zijlstra; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le jeudi 14 avril 2011 =E0 08:07 +0200, Eric Dumazet a =E9crit :
> Le jeudi 14 avril 2011 =E0 13:42 +0800, Wei Gu a =E9crit :
> > Hi guys,
> > Do you think it was a bug in the kernel from 2.6.35.2 with Intel 10=
GE ixgbe driver?
> > If so shall I issue a Bug on the bugzilla, and which category? Caus=
e I'm not sure it was driver problem Or sched problem.
>
> This makes no sense to me.
>
> What is the maximum throughput you can get in pps before having packe=
t
> drops ?
>
> Please try with a single flow (to hit one queue, one cpu)
>
> Thanks
>

Also, please try to check if using smaller or bigger packets makes any =
change in this max throughput


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2011-04-14 16:42:25 UTC
Permalink
On 4/13/2011 11:58 PM, Wei Gu wrote:
> I did the single flow test, it shows no rx error with 300kpps. While =
I was start multiple flow with same 300Kpps traffic, then it looks real=
ly bad with high rx_missing_error.
>
> Multiple Flow:
> SUM: 191925 ETH8: 0 ETH10: 191925 ETH6: 0 ETH4: 0
> SUM: 214634 ETH8: 0 ETH10: 214634 ETH6: 0 ETH4: 0
> SUM: 237600 ETH8: 0 ETH10: 237600 ETH6: 0 ETH4: 0
> SUM: 198925 ETH8: 0 ETH10: 198925 ETH6: 0 ETH4: 0
> SUM: 249290 ETH8: 0 ETH10: 249290 ETH6: 0 ETH4: 0
>
> Single Flow:
> SUM: 302018 ETH8: 0 ETH10: 302018 ETH6: 0 ETH4: 0
> SUM: 301849 ETH8: 0 ETH10: 301849 ETH6: 0 ETH4: 0
> SUM: 302163 ETH8: 0 ETH10: 302163 ETH6: 0 ETH4: 0
>
> Thanks
> WeiGu
> -----Original Message-----
> From: Eric Dumazet [mailto:***@gmail.com]
> Sent: Thursday, April 14, 2011 2:34 PM
> To: Wei Gu
> Cc: Alexander Duyck; Peter Zijlstra; netdev; Kirsher, Jeffrey T
> Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
>
> Le jeudi 14 avril 2011 =E0 08:07 +0200, Eric Dumazet a =E9crit :
>> Le jeudi 14 avril 2011 =E0 13:42 +0800, Wei Gu a =E9crit :
>>> Hi guys,
>>> Do you think it was a bug in the kernel from 2.6.35.2 with Intel 10=
GE ixgbe driver?
>>> If so shall I issue a Bug on the bugzilla, and which category? Caus=
e I'm not sure it was driver problem Or sched problem.
>>
>> This makes no sense to me.
>>
>> What is the maximum throughput you can get in pps before having pack=
et
>> drops ?
>>
>> Please try with a single flow (to hit one queue, one cpu)
>>
>> Thanks
>>
>
> Also, please try to check if using smaller or bigger packets makes an=
y change in this max throughput
>
>

The only issue I have found so far with the ixgbe driver is the fact=20
that apparently rx_no_buffer_count is apparently always going to be 0 o=
n=20
82599, and that isn't so much a driver problem as a hardware limitation=
=20
as the HW counter was removed in 82599. However since the hardware was=
=20
capable of going faster on the other kernels what this likely means is=20
that the rx_missed_errors are due to the driver not providing Rx buffer=
s=20
fast enough.

I'm doing some more digging into this now. One thought that occurred t=
o=20
me is that if the patch you mention is having some sort of effect this=20
could be a sign of perhaps a kernel timer or scheduling problem.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-14 16:45:24 UTC
Permalink
Le jeudi 14 avril 2011 =C3=A0 09:42 -0700, Alexander Duyck a =C3=A9crit=
:

> The only issue I have found so far with the ixgbe driver is the fact=20
> that apparently rx_no_buffer_count is apparently always going to be 0=
on=20
> 82599, and that isn't so much a driver problem as a hardware limitati=
on=20
> as the HW counter was removed in 82599. However since the hardware w=
as=20
> capable of going faster on the other kernels what this likely means i=
s=20
> that the rx_missed_errors are due to the driver not providing Rx buff=
ers=20
> fast enough.
>=20
> I'm doing some more digging into this now. One thought that occurred=
to=20
> me is that if the patch you mention is having some sort of effect thi=
s=20
> could be a sign of perhaps a kernel timer or scheduling problem.

We could try to instrument the delay betwen hardware IRQ and napi
handler called.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra
2011-04-14 16:56:42 UTC
Permalink
On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:

> I'm doing some more digging into this now. One thought that occurred to
> me is that if the patch you mention is having some sort of effect this
> could be a sign of perhaps a kernel timer or scheduling problem.

Right, so the removal of the NO_HZ throttle will allow the CPU to go
into C states more often, this could result in longer wake-up times for
IRQs.

We reverted because:
- it caused significant battery drain due to not going into C states
often enough, and
- its a much better idea to implement these things in the idle
governor since it already has the job of guestimating the idle
duration.

I really can't remember back far enough to even come up with a theory of
why kernels prior to merging the NO_HZ throttle would not exhibit this
problem.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-14 16:57:01 UTC
Permalink
Le jeudi 14 avril 2011 =C3=A0 18:56 +0200, Peter Zijlstra a =C3=A9crit =
:
> On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:
>=20
> > I'm doing some more digging into this now. One thought that occurr=
ed to=20
> > me is that if the patch you mention is having some sort of effect t=
his=20
> > could be a sign of perhaps a kernel timer or scheduling problem.
>=20
> Right, so the removal of the NO_HZ throttle will allow the CPU to go
> into C states more often, this could result in longer wake-up times f=
or
> IRQs.
>=20
> We reverted because:
> - it caused significant battery drain due to not going into C state=
s
> often enough, and
> - its a much better idea to implement these things in the idle
> governor since it already has the job of guestimating the idle
> duration.
>=20
> I really can't remember back far enough to even come up with a theory=
of
> why kernels prior to merging the NO_HZ throttle would not exhibit thi=
s
> problem.
>=20
>=20
>=20

Normally, Wei Gu already asked to not use C states.

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c0180=
4533.pdf

How can we/he check this ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2011-04-14 17:49:16 UTC
Permalink
Le jeudi 14 avril 2011 =C3=A0 18:57 +0200, Eric Dumazet a =C3=A9crit :
> Le jeudi 14 avril 2011 =C3=A0 18:56 +0200, Peter Zijlstra a =C3=A9cri=
t :
> > On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:
> >=20
> > > I'm doing some more digging into this now. One thought that occu=
rred to=20
> > > me is that if the patch you mention is having some sort of effect=
this=20
> > > could be a sign of perhaps a kernel timer or scheduling problem.
> >=20
> > Right, so the removal of the NO_HZ throttle will allow the CPU to g=
o
> > into C states more often, this could result in longer wake-up times=
for
> > IRQs.
> >=20
> > We reverted because:
> > - it caused significant battery drain due to not going into C sta=
tes
> > often enough, and
> > - its a much better idea to implement these things in the idle
> > governor since it already has the job of guestimating the idle
> > duration.
> >=20
> > I really can't remember back far enough to even come up with a theo=
ry of
> > why kernels prior to merging the NO_HZ throttle would not exhibit t=
his
> > problem.
> >=20
> >=20
> >=20
>=20
> Normally, Wei Gu already asked to not use C states.
>=20
> http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01=
804533.pdf
>=20
> How can we/he check this ?
>=20
>=20

Anyway, this could explain a latency problem, not packet drops.

With NAPI, we should get few hardware irqs under load.

Once softirq started, scheduler is out of the equation.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2011-04-14 19:08:59 UTC
Permalink
On 4/14/2011 10:49 AM, Eric Dumazet wrote:
> Le jeudi 14 avril 2011 =C3=A0 18:57 +0200, Eric Dumazet a =C3=A9crit =
:
>> Le jeudi 14 avril 2011 =C3=A0 18:56 +0200, Peter Zijlstra a =C3=A9cr=
it :
>>> On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:
>>>
>>>> I'm doing some more digging into this now. One thought that occur=
red to
>>>> me is that if the patch you mention is having some sort of effect =
this
>>>> could be a sign of perhaps a kernel timer or scheduling problem.
>>>
>>> Right, so the removal of the NO_HZ throttle will allow the CPU to g=
o
>>> into C states more often, this could result in longer wake-up times=
for
>>> IRQs.
>>>
>>> We reverted because:
>>> - it caused significant battery drain due to not going into C st=
ates
>>> often enough, and
>>> - its a much better idea to implement these things in the idle
>>> governor since it already has the job of guestimating the idle
>>> duration.
>>>
>>> I really can't remember back far enough to even come up with a theo=
ry of
>>> why kernels prior to merging the NO_HZ throttle would not exhibit t=
his
>>> problem.
>>>
>>>
>>>
>>
>> Normally, Wei Gu already asked to not use C states.
>>
>> http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c0=
1804533.pdf
>>
>> How can we/he check this ?
>>
>>
>
> Anyway, this could explain a latency problem, not packet drops.
>
> With NAPI, we should get few hardware irqs under load.
>
> Once softirq started, scheduler is out of the equation.

The problem is on these newer systems it is becoming significantly=20
harder to get locked into the polling only state. In many cases we wil=
l=20
just complete all of the RX work in a single poll and go back to=20
interrupts. This is especially true when traffic is spread out across=20
multiple queues and CPUs.

I'm thinking that maybe powertop results for before that patch and afte=
r=20
that patch should be pretty telling. It should tell us if C states are=
=20
active, and if so it will also tell us if we are being woken by=20
interrupts or if we are staying in the polling state.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-15 02:10:54 UTC
Permalink
Hi Eric,
I tried another parameter intel_idle.max_cstate=3D0 with 2.6.35.3
Looks like this setting make the a little better, less rx_error and muc=
h more stable.
However I haven't found this kernel param in the kernel document, but s=
upprise it still works somehow:)

I will check with the HP guys to see if I need to do extra configuratio=
n in BIOS to disable more ACPI related features on DL580.

Thanks
WeiGu

-----Original Message-----
=46rom: Eric Dumazet [mailto:***@gmail.com]
Sent: Friday, April 15, 2011 12:57 AM
To: Peter Zijlstra
Cc: Alexander Duyck; Wei Gu; netdev; Kirsher, Jeffrey T; Mike Galbraith
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Le jeudi 14 avril 2011 =E0 18:56 +0200, Peter Zijlstra a =E9crit :
> On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:
>
> > I'm doing some more digging into this now. One thought that
> > occurred to me is that if the patch you mention is having some sort
> > of effect this could be a sign of perhaps a kernel timer or schedul=
ing problem.
>
> Right, so the removal of the NO_HZ throttle will allow the CPU to go
> into C states more often, this could result in longer wake-up times
> for IRQs.
>
> We reverted because:
> - it caused significant battery drain due to not going into C state=
s
> often enough, and
> - its a much better idea to implement these things in the idle
> governor since it already has the job of guestimating the idle
> duration.
>
> I really can't remember back far enough to even come up with a theory
> of why kernels prior to merging the NO_HZ throttle would not exhibit
> this problem.
>
>
>

Normally, Wei Gu already asked to not use C states.

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c0180=
4533.pdf

How can we/he check this ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra
2011-04-15 08:57:37 UTC
Permalink
On Thu, 2011-04-14 at 18:57 +0200, Eric Dumazet wrote:
> Le jeudi 14 avril 2011 =C3=A0 18:56 +0200, Peter Zijlstra a =C3=A9cri=
t :
> > On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:
> >=20
> > > I'm doing some more digging into this now. One thought that occu=
rred to=20
> > > me is that if the patch you mention is having some sort of effect=
this=20
> > > could be a sign of perhaps a kernel timer or scheduling problem.
> >=20
> > Right, so the removal of the NO_HZ throttle will allow the CPU to g=
o
> > into C states more often, this could result in longer wake-up times=
for
> > IRQs.
> >=20
> > We reverted because:
> > - it caused significant battery drain due to not going into C sta=
tes
> > often enough, and
> > - its a much better idea to implement these things in the idle
> > governor since it already has the job of guestimating the idle
> > duration.
> >=20
> > I really can't remember back far enough to even come up with a theo=
ry of
> > why kernels prior to merging the NO_HZ throttle would not exhibit t=
his
> > problem.
> >=20
> >=20
> >=20
>=20
> Normally, Wei Gu already asked to not use C states.
>=20
> http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01=
804533.pdf
>=20
> How can we/he check this ?

Not quite sure, I think powertop has something that measures C-state
usage, but if the BIOS is lying to us (pretty good bet since he's using
HP crap^W) there's nothing much we can do about that.

Another thing that could be causing pain in NO_HZ transitions is if
we're having to switch to timer broadcast mode when we go into NO_HZ,
I'm not sure if HP systems are affected by this, nor am I exactly sure
which DL580 he's got.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-15 09:14:46 UTC
Permalink
Hi,

Is there something that I can provide in order to identify the problem?

-----Original Message-----
=46rom: Peter Zijlstra [mailto:***@chello.nl]
Sent: Friday, April 15, 2011 4:58 PM
To: Eric Dumazet
Cc: Alexander Duyck; Wei Gu; netdev; Kirsher, Jeffrey T; Mike Galbraith=
; Thomas Gleixner
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Thu, 2011-04-14 at 18:57 +0200, Eric Dumazet wrote:
> Le jeudi 14 avril 2011 =E0 18:56 +0200, Peter Zijlstra a =E9crit :
> > On Thu, 2011-04-14 at 09:42 -0700, Alexander Duyck wrote:
> >
> > > I'm doing some more digging into this now. One thought that
> > > occurred to me is that if the patch you mention is having some
> > > sort of effect this could be a sign of perhaps a kernel timer or =
scheduling problem.
> >
> > Right, so the removal of the NO_HZ throttle will allow the CPU to g=
o
> > into C states more often, this could result in longer wake-up times
> > for IRQs.
> >
> > We reverted because:
> > - it caused significant battery drain due to not going into C sta=
tes
> > often enough, and
> > - its a much better idea to implement these things in the idle
> > governor since it already has the job of guestimating the idle
> > duration.
> >
> > I really can't remember back far enough to even come up with a
> > theory of why kernels prior to merging the NO_HZ throttle would not
> > exhibit this problem.
> >
> >
> >
>
> Normally, Wei Gu already asked to not use C states.
>
> http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01=
8
> 04533.pdf
>
> How can we/he check this ?

Not quite sure, I think powertop has something that measures C-state us=
age, but if the BIOS is lying to us (pretty good bet since he's using H=
P crap^W) there's nothing much we can do about that.

Another thing that could be causing pain in NO_HZ transitions is if we'=
re having to switch to timer broadcast mode when we go into NO_HZ, I'm =
not sure if HP systems are affected by this, nor am I exactly sure whic=
h DL580 he's got.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jesse Brandeburg
2011-04-18 21:12:27 UTC
Permalink
On Fri, Apr 15, 2011 at 2:14 AM, Wei Gu <***@ericsson.com> wrote:
> Is there something that I can provide in order to identify the problem?

for power state concerns you may want to try running turbostat
(available in recent kernels, runs also on older kernels) during the
workload in question. Capture the results via something like:
cd /home/jbrandeb/linux-2.6.38.2/tools/power/x86/turbostat
make
for i in `seq 1 10` ; do ./turbostat -v sleep 5 >> turbostat.txt 2>&1 ; done

Jesse
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-19 04:09:01 UTC
Permalink
Hi,
This is the result that I running the turbostat via 2.6.35.3 and perform the same load test on eth10 (8 tx/rx queue binding to core 24-31) according you instruction.
I was add the processor.max_cstate=1 in the boot params, and also disabled the cstate in the BIOS. But looks like the kernel does take them.

Thanks
WeiGu

-----Original Message-----
From: Jesse Brandeburg [mailto:***@gmail.com]
Sent: Tuesday, April 19, 2011 5:12 AM
To: Wei Gu
Cc: Peter Zijlstra; Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T; Mike Galbraith; Thomas Gleixner
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Fri, Apr 15, 2011 at 2:14 AM, Wei Gu <***@ericsson.com> wrote:
> Is there something that I can provide in order to identify the problem?

for power state concerns you may want to try running turbostat (available in recent kernels, runs also on older kernels) during the workload in question. Capture the results via something like:
cd /home/jbrandeb/linux-2.6.38.2/tools/power/x86/turbostat
make
for i in `seq 1 10` ; do ./turbostat -v sleep 5 >> turbostat.txt 2>&1 ; done

Jesse
Wei Gu
2011-04-21 02:57:03 UTC
Permalink
A quick question, regarding the skb->queue_mapping?
Do you know who will put the queue number for the received skb on the receving path? Cause I found it has a value in the recevied skb, but it seems over the range of rx/tx queues. Like if only have 8 rx and 8 tx queues on this netdev, then I can see the queue_mapping in the receving skb will be in [1-8], which I expect is [0-7].

Thanks
WeiGu
-----Original Message-----
From: Wei Gu
Sent: Tuesday, April 19, 2011 12:09 PM
To: 'Jesse Brandeburg'
Cc: Peter Zijlstra; Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T; Mike Galbraith; Thomas Gleixner
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Hi,
This is the result that I running the turbostat via 2.6.35.3 and perform the same load test on eth10 (8 tx/rx queue binding to core 24-31) according you instruction.
I was add the processor.max_cstate=1 in the boot params, and also disabled the cstate in the BIOS. But looks like the kernel does take them.

Thanks
WeiGu

-----Original Message-----
From: Jesse Brandeburg [mailto:***@gmail.com]
Sent: Tuesday, April 19, 2011 5:12 AM
To: Wei Gu
Cc: Peter Zijlstra; Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T; Mike Galbraith; Thomas Gleixner
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Fri, Apr 15, 2011 at 2:14 AM, Wei Gu <***@ericsson.com> wrote:
> Is there something that I can provide in order to identify the problem?

for power state concerns you may want to try running turbostat (available in recent kernels, runs also on older kernels) during the workload in question. Capture the results via something like:
cd /home/jbrandeb/linux-2.6.38.2/tools/power/x86/turbostat
make
for i in `seq 1 10` ; do ./turbostat -v sleep 5 >> turbostat.txt 2>&1 ; done

Jesse
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-21 03:25:49 UTC
Permalink
Okay, I see the magic in the code:
static inline void skb_record_rx_queue(struct sk_buff *skb, u16 rx_queue)
{
skb->queue_mapping = rx_queue + 1;
}

static inline u16 skb_get_rx_queue(const struct sk_buff *skb)
{
return skb->queue_mapping - 1;
}

static inline bool skb_rx_queue_recorded(const struct sk_buff *skb)
{
return (skb->queue_mapping != 0);
}

Anyway it seems strange that we have different meaning of skb->queue_mapping for Tx and Rx. Is it better to using higher 4 bit Or even FFFF to indicate rx_queue not recorded?

-----Original Message-----
From: Wei Gu
Sent: Thursday, April 21, 2011 10:57 AM
To: 'netdev'
Cc: 'Peter Zijlstra'; 'Eric Dumazet'; 'Alexander Duyck'; 'Kirsher, Jeffrey T'; 'Mike Galbraith'; 'Thomas Gleixner'; 'Jesse Brandeburg'
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel


A quick question, regarding the skb->queue_mapping?
Do you know who will put the queue number for the received skb on the receving path? Cause I found it has a value in the recevied skb, but it seems over the range of rx/tx queues. Like if only have 8 rx and 8 tx queues on this netdev, then I can see the queue_mapping in the receving skb will be in [1-8], which I expect is [0-7].

Thanks
WeiGu
-----Original Message-----
From: Wei Gu
Sent: Tuesday, April 19, 2011 12:09 PM
To: 'Jesse Brandeburg'
Cc: Peter Zijlstra; Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T; Mike Galbraith; Thomas Gleixner
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Hi,
This is the result that I running the turbostat via 2.6.35.3 and perform the same load test on eth10 (8 tx/rx queue binding to core 24-31) according you instruction.
I was add the processor.max_cstate=1 in the boot params, and also disabled the cstate in the BIOS. But looks like the kernel does take them.

Thanks
WeiGu

-----Original Message-----
From: Jesse Brandeburg [mailto:***@gmail.com]
Sent: Tuesday, April 19, 2011 5:12 AM
To: Wei Gu
Cc: Peter Zijlstra; Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T; Mike Galbraith; Thomas Gleixner
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Fri, Apr 15, 2011 at 2:14 AM, Wei Gu <***@ericsson.com> wrote:
> Is there something that I can provide in order to identify the problem?

for power state concerns you may want to try running turbostat (available in recent kernels, runs also on older kernels) during the workload in question. Capture the results via something like:
cd /home/jbrandeb/linux-2.6.38.2/tools/power/x86/turbostat
make
for i in `seq 1 10` ; do ./turbostat -v sleep 5 >> turbostat.txt 2>&1 ; done

Jesse
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2011-04-08 16:22:01 UTC
Permalink
On 4/8/2011 5:19 AM, Wei Gu wrote:
> Hi again,
> I tried more testing with by disable this CONFIG_DMAR with shipped 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> All these test looks we can get>1Mpps 400bype packtes but not stable at all, there will huge number missing errors with 100% CPU IDLE:
> ethtool -S eth10 |grep rx_missed_errors
>
> rx_missed_errors: 76832040
>
> SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
>
>
> Do you know if there is other options in the kernel will cause high rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 with same test case)
>
> perf record:
> + 69.74% swapper [kernel.kallsyms] [k] poll_idle
> + 11.62% swapper [kernel.kallsyms] [k] intel_idle
> + 0.80% swapper [ixgbe] [k] ixgbe_poll
> + 0.79% perf [ixgbe] [k] ixgbe_poll
> + 0.77% perf [kernel.kallsyms] [k] skb_copy_bits
> + 0.64% swapper [kernel.kallsyms] [k] skb_copy_bits
> + 0.48% perf [kernel.kallsyms] [k] __kmalloc_node_track_caller
> + 0.44% swapper [kernel.kallsyms] [k] __kmalloc_node_track_caller
> + 0.36% swapper [kernel.kallsyms] [k] kmem_cache_alloc_node
> + 0.35% swapper [kernel.kallsyms] [k] kfree
> + 0.35% perf [kernel.kallsyms] [k] kmem_cache_alloc_node

I was wondering if you could dump all of your ethtool stats instead of
just the rx_missed_errors as this will provide us with much more info to
work with.

I'm mainly interested in seeing if the rx_no_buffer_count is
incrementing as well. If it is not then what you may be seeing is a bus
bandwidth issue depending on what slot you are in.

Also if you could provide an lspci dump for the part that would also
give us some additional information on your PCIe bus configuration.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-09 03:36:07 UTC
Permalink
Hi Alexander,
I do agree with you that if only the rx_missing_error (rx_no_buffer_count: 0) indicates the memory bandwidth issue. But the strange thing is I using the same test configuration on Linux 2.6.32, which looks no this problem at all. SO it not a HW setup problem at all, only difference in on the Kernel version, that's why I go back to you guys for this new Linux 2.6.38, if it will affact this memory bandwidth Or BIOS etc things?

The follow dump is done, while I was try to receive 290Kpps 400Byte pakets from IXIA, and drop them in the prerouting hook.
I bind the eth10 8 RX queue to CPU sock ID 3 ( core 24-31) on NUMA NODE3

ethtool -i eth10
driver: ixgbe
version: 3.2.10-NAPI
firmware-version: 0.9-3
bus-info: 0000:8d:00.0

ethtool -S eth10
NIC statistics:
rx_packets: 14222510
tx_packets: 109
rx_bytes: 5575223920
tx_bytes: 17790
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 0
collisions: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 15150244
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_pkts_nic: 14226186
tx_pkts_nic: 109
rx_bytes_nic: 11750580400
tx_bytes_nic: 18642
lsc_int: 2
tx_busy: 0
non_eop_descs: 0
broadcast: 0
rx_no_buffer_count: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
tx_flow_control_xon: 0
rx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_flow_control_xoff: 0
rx_csum_offload_errors: 0
low_latency_interrupt: 0
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
lro_aggregated: 0
lro_flushed: 0
lro_recycled: 0
rx_no_dma_resources: 0
hw_rsc_aggregated: 0
hw_rsc_flushed: 0
rx_flm: 0
fdir_match: 0
fdir_miss: 0
fdir_overflow: 0
fcoe_bad_fccrc: 0
fcoe_last_errors: 0
rx_fcoe_dropped: 0
rx_fcoe_packets: 0
rx_fcoe_dwords: 0
tx_fcoe_packets: 0
tx_fcoe_dwords: 0
tx_queue_0_packets: 0
tx_queue_0_bytes: 0
tx_queue_1_packets: 10
tx_queue_1_bytes: 540
tx_queue_2_packets: 0
tx_queue_2_bytes: 0
tx_queue_3_packets: 0
tx_queue_3_bytes: 0
tx_queue_4_packets: 30
tx_queue_4_bytes: 2340
tx_queue_5_packets: 4
tx_queue_5_bytes: 1368
tx_queue_6_packets: 65
tx_queue_6_bytes: 13542
tx_queue_7_packets: 0
tx_queue_7_bytes: 0
rx_queue_0_packets: 1777898
rx_queue_0_bytes: 696936016
rx_queue_1_packets: 1777207
rx_queue_1_bytes: 696665144
rx_queue_2_packets: 1778379
rx_queue_2_bytes: 697124568
rx_queue_3_packets: 1777891
rx_queue_3_bytes: 696933272
rx_queue_4_packets: 1777050
rx_queue_4_bytes: 696603600
rx_queue_5_packets: 1777915
rx_queue_5_bytes: 696942680
rx_queue_6_packets: 1778737
rx_queue_6_bytes: 697264904
rx_queue_7_packets: 1778391
rx_queue_7_bytes: 697129272

Lspci dump:

00:00.0 Host bridge: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 2 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 (rev 22)
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
00:06.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 6 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22)
00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
00:0a.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 10 (rev 22)
00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1
00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 5
00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.3 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller
00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1
01:03.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
02:00.0 System peripheral: Hewlett-Packard Company iLO3 Slave instrumentation & System support (rev 04)
02:00.2 System peripheral: Hewlett-Packard Company iLO3 Management Processor Support and Messaging (rev 04)
02:00.4 USB Controller: Hewlett-Packard Company Proliant iLO2/iLO3 virtual USB controller (rev 01)
03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers (rev 01)
04:00.0 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
04:00.1 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
04:00.2 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
04:00.3 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
0b:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
0b:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
11:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
11:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
80:00.0 PCI bridge: Intel Corporation 5500 Non-Legacy I/O Hub PCI Express Root Port 0 (rev 22)
80:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
80:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 2 (rev 22)
80:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
80:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 (rev 22)
80:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
80:06.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 6 (rev 22)
80:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
80:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22)
80:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
80:0a.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 10 (rev 22)
80:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 22)
80:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
80:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
8d:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
8d:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
90:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
90:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)

-----Original Message-----
From: Alexander Duyck [mailto:***@intel.com]
Sent: Saturday, April 09, 2011 12:22 AM
To: Wei Gu
Cc: Eric Dumazet; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On 4/8/2011 5:19 AM, Wei Gu wrote:
> Hi again,
> I tried more testing with by disable this CONFIG_DMAR with shipped 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> All these test looks we can get>1Mpps 400bype packtes but not stable at all, there will huge number missing errors with 100% CPU IDLE:
> ethtool -S eth10 |grep rx_missed_errors
>
> rx_missed_errors: 76832040
>
> SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
>
>
> Do you know if there is other options in the kernel will cause high
> rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 with
> same test case)
>
> perf record:
> + 69.74% swapper [kernel.kallsyms] [k] poll_idle
> + 11.62% swapper [kernel.kallsyms] [k] intel_idle
> + 0.80% swapper [ixgbe] [k] ixgbe_poll
> + 0.79% perf [ixgbe] [k] ixgbe_poll
> + 0.77% perf [kernel.kallsyms] [k] skb_copy_bits
> + 0.64% swapper [kernel.kallsyms] [k] skb_copy_bits
> + 0.48% perf [kernel.kallsyms] [k] __kmalloc_node_track_caller
> + 0.44% swapper [kernel.kallsyms] [k] __kmalloc_node_track_caller
> + 0.36% swapper [kernel.kallsyms] [k] kmem_cache_alloc_node
> + 0.35% swapper [kernel.kallsyms] [k] kfree
> + 0.35% perf [kernel.kallsyms] [k] kmem_cache_alloc_node

I was wondering if you could dump all of your ethtool stats instead of just the rx_missed_errors as this will provide us with much more info to work with.

I'm mainly interested in seeing if the rx_no_buffer_count is incrementing as well. If it is not then what you may be seeing is a bus bandwidth issue depending on what slot you are in.

Also if you could provide an lspci dump for the part that would also give us some additional information on your PCIe bus configuration.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander H Duyck
2011-04-09 04:40:57 UTC
Permalink
On Fri, 2011-04-08 at 20:36 -0700, Wei Gu wrote:
> Hi Alexander, I do agree with you that if only the rx_missing_error
> (rx_no_buffer_count: 0) indicates the memory bandwidth issue. But the
> strange thing is I using the same test configuration on Linux 2.6.32,
> which looks no this problem at all. SO it not a HW setup problem at
> all, only difference in on the Kernel version, that's why I go back to
> you guys for this new Linux 2.6.38, if it will affact this memory
> bandwidth Or BIOS etc things?

What were the numbers you were getting with 2.6.32? I would be
interested in seeing those number just to get an idea of how they
compare against the 2.6.38 kernel.

> The follow dump is done, while I was try to receive 290Kpps 400Byte
> pakets from IXIA, and drop them in the prerouting hook. I bind the
> eth10 8 RX queue to CPU sock ID 3 ( core 24-31) on NUMA NODE3

Just to confirm this is with DMAR off? I saw an earlier email that said
you were getting a variable amount that was over 1Mpps and just want to
confirm this is with the same config.

> ethtool -i eth10
> driver: ixgbe
> version: 3.2.10-NAPI
> firmware-version: 0.9-3
> bus-info: 0000:8d:00.0
>
> ethtool -S eth10
> NIC statistics:
> rx_packets: 14222510
> tx_packets: 109
> rx_bytes: 5575223920
> tx_bytes: 17790
> rx_missed_errors: 15150244
> rx_no_buffer_count: 0

I trimmed down your stats here pretty significantly. This isn't an
issue with the driver not keeping up. The problem here is memory and/or
bus bandwidth. Based on the info you provided I am assuming you have a
quad socket system. I'm curious how the memory is laid out. What is
the total memory size, memory per node, and do you have all of the
memory channels on each node populated? One common thing I've seen
cause these type of issues is an incorrect memory configuration.

Also if you could send me an lspci -vvv for 8d:00.0 specifically I would
appreciate it as I would like to look over the PCIe config just to make
sure the slot is a x8 PCIe gen 2.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wei Gu
2011-04-09 06:12:31 UTC
Permalink
Hi Alexander,
The total thruput with 400byte UDP receiving(terminate on prerouting hook) on 2.6.32 is over >1.5Mpps without packet lost.
I even tried forward this receved packets back on same NIC, I get >1.5Mpps Rx with same amount of Tx, no rx_missing_error at all. And even with some 68byte packets I was reach 5Mpps+/NIC on 2.6.32 kernel.

I was expect to gain even higher performance with this new linux kernel with same HW configuration.

Yes, the DMAR is off, I can get +1Mpps,but as I said not stable at all.(high rx_missing_error rate).

I'm sure the slot for eth10 was x8 Gen2:
[ixgbe: eth10: ixgbe_probe: (PCI Express:5.0Gb/s:Width x8) 00:1b:21:6b:45:cc]

For the memory configuration, I was using the same server as I was testing with 2.6.32. I have total 64G * 4 merory which is 100% memory bandwidth with 4 sock CPUs, recommended by HP expert( 8 DIMM's per processor in slot Cartridge).

Does anything from Linux kernel will affact these memory configuration thing?

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65525 MB
node 0 free: 63226 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 63292 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 63366 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65535 MB
node 3 free: 63345 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10

Lspci -vvv
8d:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 50
Region 0: Memory at f0200000 (64-bit, non-prefetchable) [size=512K]
Region 2: I/O ports at 8000 [size=32]
Region 4: Memory at f0284000 (64-bit, non-prefetchable) [size=16K]
[virtual] Expansion ROM at f0600000 [disabled] [size=512K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend+
LnkCap: Port #2, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 <32us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [100] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-6b-45-18
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 64, Total VFs: 64, Number of VFs: 64, Function Dependency Link: 00
VF offset: 128, stride: 2, Device ID: 10ed
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000000000000 (64-bit, non-prefetchable)
Region 3: Memory at 0000000000000000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: ixgbe
Kernel modules: ixgbe

Thanks
WeiGu

-----Original Message-----
From: Alexander H Duyck [mailto:***@intel.com]
Sent: Saturday, April 09, 2011 12:41 PM
To: Wei Gu
Cc: Eric Dumazet; netdev; Kirsher, Jeffrey T
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Fri, 2011-04-08 at 20:36 -0700, Wei Gu wrote:
> Hi Alexander, I do agree with you that if only the rx_missing_error
> (rx_no_buffer_count: 0) indicates the memory bandwidth issue. But the
> strange thing is I using the same test configuration on Linux 2.6.32,
> which looks no this problem at all. SO it not a HW setup problem at
> all, only difference in on the Kernel version, that's why I go back to
> you guys for this new Linux 2.6.38, if it will affact this memory
> bandwidth Or BIOS etc things?

What were the numbers you were getting with 2.6.32? I would be interested in seeing those number just to get an idea of how they compare against the 2.6.38 kernel.

> The follow dump is done, while I was try to receive 290Kpps 400Byte
> pakets from IXIA, and drop them in the prerouting hook. I bind the
> eth10 8 RX queue to CPU sock ID 3 ( core 24-31) on NUMA NODE3

Just to confirm this is with DMAR off? I saw an earlier email that said you were getting a variable amount that was over 1Mpps and just want to confirm this is with the same config.

> ethtool -i eth10
> driver: ixgbe
> version: 3.2.10-NAPI
> firmware-version: 0.9-3
> bus-info: 0000:8d:00.0
>
> ethtool -S eth10
> NIC statistics:
> rx_packets: 14222510
> tx_packets: 109
> rx_bytes: 5575223920
> tx_bytes: 17790
> rx_missed_errors: 15150244
> rx_no_buffer_count: 0

I trimmed down your stats here pretty significantly. This isn't an issue with the driver not keeping up. The problem here is memory and/or bus bandwidth. Based on the info you provided I am assuming you have a quad socket system. I'm curious how the memory is laid out. What is the total memory size, memory per node, and do you have all of the memory channels on each node populated? One common thing I've seen cause these type of issues is an incorrect memory configuration.

Also if you could send me an lspci -vvv for 8d:00.0 specifically I would appreciate it as I would like to look over the PCIe config just to make sure the slot is a x8 PCIe gen 2.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...