Discussion:
[PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
Alexander Duyck
2012-06-20 00:43:06 UTC
Permalink
This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1. I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start. The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.

My primary test case was using ixgbe in combination with TCP. With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.

I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing. In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.

Cc: Eric Dumazet <***@google.com>
Signed-off-by: Alexander Duyck <***@intel.com>
---

net/core/skbuff.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
if (unlikely(!nc->page)) {
refill:
nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
- nc->offset = 0;
}
if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
+ unsigned int offset = PAGE_SIZE;
+
+ if (page_count(nc->page) != 1)
+ offset = nc->offset;
+
+ if (offset < fragsz) {
put_page(nc->page);
goto refill;
}
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
+
+ offset -= fragsz;
+ nc->offset = offset;
+
+ data = page_address(nc->page) + offset;
get_page(nc->page);
}
local_irq_restore(flags);
Alexander Duyck
2012-06-20 01:49:04 UTC
Permalink
Post by Alexander Duyck
This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1. I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start. The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.
My primary test case was using ixgbe in combination with TCP. With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.
I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing. In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.
---
net/core/skbuff.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
if (unlikely(!nc->page)) {
nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
- nc->offset = 0;
}
if (likely(nc->page)) {
- if (nc->offset + fragsz> PAGE_SIZE) {
+ unsigned int offset = PAGE_SIZE;
+
+ if (page_count(nc->page) != 1)
+ offset = nc->offset;
+
+ if (offset< fragsz) {
put_page(nc->page);
goto refill;
}
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
+
+ offset -= fragsz;
+ nc->offset = offset;
+
+ data = page_address(nc->page) + offset;
get_page(nc->page);
}
local_irq_restore(flags);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
It looks like I forgot to add "--auto" to the command line when I sent
this out via stg mail so I am just adding Eric to the CC list on this
reply. Sorry for the extra noise.

Thanks,

Alex
Eric Dumazet
2012-06-20 05:36:30 UTC
Permalink
Post by Alexander Duyck
This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1. I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start. The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.
My primary test case was using ixgbe in combination with TCP. With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.
I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing. In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.
---
net/core/skbuff.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
if (unlikely(!nc->page)) {
nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
- nc->offset = 0;
}
if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
+ unsigned int offset = PAGE_SIZE;
+
+ if (page_count(nc->page) != 1)
+ offset = nc->offset;
+
+ if (offset < fragsz) {
put_page(nc->page);
goto refill;
}
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
+
+ offset -= fragsz;
+ nc->offset = offset;
+
+ data = page_address(nc->page) + offset;
get_page(nc->page);
}
local_irq_restore(flags);
I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.

The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.

For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.

Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.

No clear win either, but you might try this too.
Eric Dumazet
2012-06-20 08:17:03 UTC
Permalink
Post by Eric Dumazet
I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.
The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.
For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.
Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.
No clear win either, but you might try this too.
By the way, big cost in netdev_alloc_frag() is the irq masking/restore
We probably could have a version for softirq users...

Another idea would also use a pool of pages, instead of a single one,
if we want to play with the "clear the offset if page count is one"
idea.

Strange, I did again benchs with order-2 allocations and got good
results this time, but with latest net-next, maybe things have changed
since last time I did this.

(netdev_alloc_frag(), get_page_from_freelist() and put_page() less
prevalent in perf results)
Eric Dumazet
2012-06-20 08:44:07 UTC
Permalink
Post by Eric Dumazet
Strange, I did again benchs with order-2 allocations and got good
results this time, but with latest net-next, maybe things have changed
since last time I did this.
(netdev_alloc_frag(), get_page_from_freelist() and put_page() less
prevalent in perf results)
Oh well, I now remember why I abandoned this : machines dont have
infinite memory after all...

(put_page assumes order-0 page)
David Miller
2012-06-20 09:04:01 UTC
Permalink
From: Eric Dumazet <***@gmail.com>
Date: Wed, 20 Jun 2012 10:44:07 +0200
Post by Eric Dumazet
(put_page assumes order-0 page)
But it certainly can handle compound pages, I thought?
Eric Dumazet
2012-06-20 09:14:53 UTC
Permalink
Post by David Miller
Date: Wed, 20 Jun 2012 10:44:07 +0200
Post by Eric Dumazet
(put_page assumes order-0 page)
But it certainly can handle compound pages, I thought?
Yes, I forgot the GFP_COMP ;)

Oh well, time for a coffee
Eric Dumazet
2012-06-20 13:21:54 UTC
Permalink
Post by Eric Dumazet
Strange, I did again benchs with order-2 allocations and got good
results this time, but with latest net-next, maybe things have changed
since last time I did this.
(netdev_alloc_frag(), get_page_from_freelist() and put_page() less
prevalent in perf results)
In fact, since SLUB uses order-3 for kmalloc-2048, I felt lucky to try
this as well, and results are really good, on ixgbe at least.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..ffd2cba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -299,6 +299,9 @@ struct netdev_alloc_cache {
};
static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);

+#define MAX_NETDEV_FRAGSIZE max_t(unsigned int, PAGE_SIZE, 32768)
+#define NETDEV_FRAG_ORDER get_order(MAX_NETDEV_FRAGSIZE)
+
/**
* netdev_alloc_frag - allocate a page fragment
* @fragsz: fragment size
@@ -316,11 +319,13 @@ void *netdev_alloc_frag(unsigned int fragsz)
nc = &__get_cpu_var(netdev_alloc_cache);
if (unlikely(!nc->page)) {
refill:
- nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+ (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+ NETDEV_FRAG_ORDER);
nc->offset = 0;
}
if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
+ if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
put_page(nc->page);
goto refill;
}
@@ -353,7 +358,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));

- if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+ if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
void *data = netdev_alloc_frag(fragsz);

if (likely(data)) {
Alexander Duyck
2012-06-21 04:07:00 UTC
Permalink
Post by Eric Dumazet
Post by Eric Dumazet
Strange, I did again benchs with order-2 allocations and got good
results this time, but with latest net-next, maybe things have changed
since last time I did this.
(netdev_alloc_frag(), get_page_from_freelist() and put_page() less
prevalent in perf results)
In fact, since SLUB uses order-3 for kmalloc-2048, I felt lucky to try
this as well, and results are really good, on ixgbe at least.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..ffd2cba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -299,6 +299,9 @@ struct netdev_alloc_cache {
};
static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+#define MAX_NETDEV_FRAGSIZE max_t(unsigned int, PAGE_SIZE, 32768)
+#define NETDEV_FRAG_ORDER get_order(MAX_NETDEV_FRAGSIZE)
+
/**
* netdev_alloc_frag - allocate a page fragment
@@ -316,11 +319,13 @@ void *netdev_alloc_frag(unsigned int fragsz)
nc =&__get_cpu_var(netdev_alloc_cache);
if (unlikely(!nc->page)) {
- nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+ (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+ NETDEV_FRAG_ORDER);
nc->offset = 0;
}
I was wondering if you needed the check for NETDEV_FRAG_ORDER here.
From what I can tell setting __GFP_COMP for an order 0 page has no
effect since it only seems to get checked in prep_new_page and that is
after a check to verify if the page is order 0 or not.

Thanks,

Alex
Eric Dumazet
2012-06-21 05:07:06 UTC
Permalink
Post by Alexander Duyck
Post by Eric Dumazet
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+ (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+ NETDEV_FRAG_ORDER);
I was wondering if you needed the check for NETDEV_FRAG_ORDER here.
From what I can tell setting __GFP_COMP for an order 0 page has no
effect since it only seems to get checked in prep_new_page and that is
after a check to verify if the page is order 0 or not.
Good point, it seems some net drivers could be changed to remove
useless tests.

I'll post some performance data as well.
Eric Dumazet
2012-06-22 12:33:50 UTC
Permalink
Post by Eric Dumazet
Post by Alexander Duyck
Post by Eric Dumazet
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+ (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+ NETDEV_FRAG_ORDER);
I was wondering if you needed the check for NETDEV_FRAG_ORDER here.
From what I can tell setting __GFP_COMP for an order 0 page has no
effect since it only seems to get checked in prep_new_page and that is
after a check to verify if the page is order 0 or not.
Good point, it seems some net drivers could be changed to remove
useless tests.
I'll post some performance data as well.
Here is the patch I tested here.

Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
since we can fit 3 frames per 32KB instead of only 2 frames (using
kmalloc-16384 slab))

Also, I prefill page->_count with a high bias value, to avoid the
get_page() we did for each allocated frag.

In my profiles, the get_page() cost was dominant, because of false
sharing with skb consumers (as they might run on different cpus)

This way, when 32768 bytes are filled, we perform a single
atomic_sub_return() and can recycle the page if we find we are the last
user (this is what you did in your patch, when testing page->_count
being 1)

Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...


diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..d31efa2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
struct netdev_alloc_cache {
struct page *page;
unsigned int offset;
+ unsigned int pagecnt_bias;
};
static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);

+#if PAGE_SIZE > 32768
+#define MAX_NETDEV_FRAGSIZE PAGE_SIZE
+#else
+#define MAX_NETDEV_FRAGSIZE 32768
+#endif
+
+#define NETDEV_PAGECNT_BIAS (MAX_NETDEV_FRAGSIZE / \
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
/**
* netdev_alloc_frag - allocate a page fragment
* @fragsz: fragment size
@@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
nc = &__get_cpu_var(netdev_alloc_cache);
if (unlikely(!nc->page)) {
refill:
- nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
+ get_order(MAX_NETDEV_FRAGSIZE));
+ if (unlikely(!nc->page))
+ goto end;
+recycle:
+ atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+ nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
nc->offset = 0;
}
- if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
- put_page(nc->page);
- goto refill;
- }
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
- get_page(nc->page);
+ if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
+ if (!atomic_sub_return(nc->pagecnt_bias,
+ &nc->page->_count))
+ goto recycle;
+ goto refill;
}
+ data = page_address(nc->page) + nc->offset;
+ nc->offset += fragsz;
+ nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
+end:
local_irq_restore(flags);
return data;
}
@@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));

- if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+ if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
void *data = netdev_alloc_frag(fragsz);

if (likely(data)) {
Alexander Duyck
2012-06-23 00:17:26 UTC
Permalink
Post by Eric Dumazet
Post by Eric Dumazet
Post by Alexander Duyck
Post by Eric Dumazet
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+ (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+ NETDEV_FRAG_ORDER);
I was wondering if you needed the check for NETDEV_FRAG_ORDER here.
From what I can tell setting __GFP_COMP for an order 0 page has no
effect since it only seems to get checked in prep_new_page and that is
after a check to verify if the page is order 0 or not.
Good point, it seems some net drivers could be changed to remove
useless tests.
I'll post some performance data as well.
Here is the patch I tested here.
Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
since we can fit 3 frames per 32KB instead of only 2 frames (using
kmalloc-16384 slab))
Also, I prefill page->_count with a high bias value, to avoid the
get_page() we did for each allocated frag.
In my profiles, the get_page() cost was dominant, because of false
sharing with skb consumers (as they might run on different cpus)
This way, when 32768 bytes are filled, we perform a single
atomic_sub_return() and can recycle the page if we find we are the last
user (this is what you did in your patch, when testing page->_count
being 1)
This is working really nicely. On my system put_page dropped to
somewhere near the bottom of the perf top runs I was doing. In addition
netdev_alloc_frag dropped from about 4% CPU to 2%.
Post by Eric Dumazet
Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...
The issue is probably the type checking in the max macro. You might
have better luck using max_t and specifying a type.
Post by Eric Dumazet
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..d31efa2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
struct netdev_alloc_cache {
struct page *page;
unsigned int offset;
+ unsigned int pagecnt_bias;
};
static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+#if PAGE_SIZE > 32768
+#define MAX_NETDEV_FRAGSIZE PAGE_SIZE
+#else
+#define MAX_NETDEV_FRAGSIZE 32768
+#endif
+
+#define NETDEV_PAGECNT_BIAS (MAX_NETDEV_FRAGSIZE / \
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
/**
* netdev_alloc_frag - allocate a page fragment
I'm assuming the reason for using the size of skb_shared_info here is
because you don't expect any requests to be smaller than that? I
suppose that is reasonable, but is there any reason why this couldn't be
a smaller value such as SMP_CACHE_BYTES?
Post by Eric Dumazet
@@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
nc = &__get_cpu_var(netdev_alloc_cache);
if (unlikely(!nc->page)) {
- nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
+ get_order(MAX_NETDEV_FRAGSIZE));
+ if (unlikely(!nc->page))
+ goto end;
+ atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+ nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
nc->offset = 0;
}
- if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
- put_page(nc->page);
- goto refill;
- }
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
- get_page(nc->page);
+ if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
+ if (!atomic_sub_return(nc->pagecnt_bias,
+ &nc->page->_count))
+ goto recycle;
+ goto refill;
}
+ data = page_address(nc->page) + nc->offset;
+ nc->offset += fragsz;
+ nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
local_irq_restore(flags);
return data;
}
@@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
- if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+ if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
void *data = netdev_alloc_frag(fragsz);
if (likely(data)) {
One idea I had that I have been trying to figure out how make work would
be to actually run the offset in the reverse direction so that you start
it at MAX_NETDEV_FRAGSIZE and work your way back down to 0. The
advantage to that approach is that you then only have to perform one
check instead of two and you can drop a goto. The only problem I have
been running into is that it doesn't seem to perform as well as your
patch but I am still working the details out.

Thanks,

Alex
Alexander Duyck
2012-06-29 23:04:39 UTC
Permalink
Post by Eric Dumazet
Here is the patch I tested here.
Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
since we can fit 3 frames per 32KB instead of only 2 frames (using
kmalloc-16384 slab))
Also, I prefill page->_count with a high bias value, to avoid the
get_page() we did for each allocated frag.
In my profiles, the get_page() cost was dominant, because of false
sharing with skb consumers (as they might run on different cpus)
This way, when 32768 bytes are filled, we perform a single
atomic_sub_return() and can recycle the page if we find we are the last
user (this is what you did in your patch, when testing page->_count
being 1)
Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..d31efa2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
struct netdev_alloc_cache {
struct page *page;
unsigned int offset;
+ unsigned int pagecnt_bias;
};
static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+#if PAGE_SIZE > 32768
+#define MAX_NETDEV_FRAGSIZE PAGE_SIZE
+#else
+#define MAX_NETDEV_FRAGSIZE 32768
+#endif
+
+#define NETDEV_PAGECNT_BIAS (MAX_NETDEV_FRAGSIZE / \
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
/**
* netdev_alloc_frag - allocate a page fragment
@@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
nc = &__get_cpu_var(netdev_alloc_cache);
if (unlikely(!nc->page)) {
- nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
+ get_order(MAX_NETDEV_FRAGSIZE));
+ if (unlikely(!nc->page))
+ goto end;
+ atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+ nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
nc->offset = 0;
}
- if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
- put_page(nc->page);
- goto refill;
- }
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
- get_page(nc->page);
+ if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
+ if (!atomic_sub_return(nc->pagecnt_bias,
+ &nc->page->_count))
+ goto recycle;
+ goto refill;
}
+ data = page_address(nc->page) + nc->offset;
+ nc->offset += fragsz;
+ nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
local_irq_restore(flags);
return data;
}
@@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
- if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+ if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
void *data = netdev_alloc_frag(fragsz);
if (likely(data)) {
I was wondering if there were any plans to clean this patch up and
submit it to net-next? If not, I can probably work on that since this
addressed the concerns I had in my original patch.

Thanks,

Alex
Eric Dumazet
2012-06-30 08:39:40 UTC
Permalink
Post by Alexander Duyck
I was wondering if there were any plans to clean this patch up and
submit it to net-next? If not, I can probably work on that since this
addressed the concerns I had in my original patch.
I used this patch for a while on my machines, but I am working on
something allowing fallback to order-0 allocations if memory gets
fragmented. This fallback should almost never happen, but we should have
it just in case ?

David Miller
2012-06-21 05:56:53 UTC
Permalink
From: Eric Dumazet <***@gmail.com>
Date: Wed, 20 Jun 2012 10:17:03 +0200
Post by Eric Dumazet
By the way, big cost in netdev_alloc_frag() is the irq
masking/restore We probably could have a version for softirq
users...
That's an extremely disappointing way for us to be losing cycles.

Everyone pays this price purely because:

1) __netdev_alloc_skb() uses netdev_alloc_frag()

and:

2) #1 is invoked, either directly or indirectly, by tons
of slow non-NAPI drivers.

This got me looking into the plathora of interfaces we let drivers use
to allocate RX buffers. It's a big mess.

We have dev_alloc_skb() which essentially calls __netdev_alloc_skb()
with a NULL device argument. This is terrible because it means that
if we want to do something interesting on a per-device level we can't
rely upon the device being non-NULL in __netdev_alloc_skb().

I looked at the remaining dev_alloc_skb() users and these are in places
which are allocating packets in a module which is one level removed from
the netdevice level. For example, ATM and infiniband IPATH.

What these callers want is something more like:

static inline struct sk_buff *alloc_skb_and_reserve_pad(unsinged int length,
gfp_t gfp_mask)
{
struct sk_buff *skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
0, NUMA_NO_NODE);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
}

Then we won't have the NULL device case for __netdev_alloc_skb() any more.
Alexander Duyck
2012-06-20 16:30:55 UTC
Permalink
Post by Eric Dumazet
Post by Alexander Duyck
This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1. I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start. The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.
My primary test case was using ixgbe in combination with TCP. With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.
I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing. In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.
---
net/core/skbuff.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
if (unlikely(!nc->page)) {
nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
- nc->offset = 0;
}
if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
+ unsigned int offset = PAGE_SIZE;
+
+ if (page_count(nc->page) != 1)
+ offset = nc->offset;
+
+ if (offset < fragsz) {
put_page(nc->page);
goto refill;
}
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
+
+ offset -= fragsz;
+ nc->offset = offset;
+
+ data = page_address(nc->page) + offset;
get_page(nc->page);
}
local_irq_restore(flags);
I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.
The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.
Actually it is pretty much anywhere a copy-break type setup exists. I
think ixgbe and a few other drivers have this type of setup where
netdev_alloc_skb is called and the data is just copied into the buffer.
My thought was if that I can improve this one case without hurting the
other cases I should just go ahead and submit it since it is a net win
performance wise.

I think one of the biggest advantages of this for ixgbe is that it
allows the buffer to become cache warm so that writing the shared info
and copying the header contents becomes very cheap compared to accessing
a cache cold page.
Post by Eric Dumazet
For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.
Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.
No clear win either, but you might try this too.
The biggest issue I see with an order-2 page is that it means the memory
is going to take much longer to cycle out of a shared page. As a result
changes like the one I just came up with would likely have little to no
benefit because we would run out of room in the frags list before we
could start reusing a fresh page.

Thanks,

Alex
Alexander Duyck
2012-06-20 17:14:57 UTC
Permalink
Post by Alexander Duyck
Post by Eric Dumazet
Post by Alexander Duyck
This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1. I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start. The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.
My primary test case was using ixgbe in combination with TCP. With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.
I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing. In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.
---
net/core/skbuff.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
if (unlikely(!nc->page)) {
nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
- nc->offset = 0;
}
if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
+ unsigned int offset = PAGE_SIZE;
+
+ if (page_count(nc->page) != 1)
+ offset = nc->offset;
+
+ if (offset < fragsz) {
put_page(nc->page);
goto refill;
}
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
+
+ offset -= fragsz;
+ nc->offset = offset;
+
+ data = page_address(nc->page) + offset;
get_page(nc->page);
}
local_irq_restore(flags);
I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.
The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.
Actually it is pretty much anywhere a copy-break type setup exists. I
think ixgbe and a few other drivers have this type of setup where
netdev_alloc_skb is called and the data is just copied into the buffer.
My thought was if that I can improve this one case without hurting the
other cases I should just go ahead and submit it since it is a net win
performance wise.
I think one of the biggest advantages of this for ixgbe is that it
allows the buffer to become cache warm so that writing the shared info
and copying the header contents becomes very cheap compared to accessing
a cache cold page.
Post by Eric Dumazet
For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.
Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.
No clear win either, but you might try this too.
The biggest issue I see with an order-2 page is that it means the memory
is going to take much longer to cycle out of a shared page. As a result
changes like the one I just came up with would likely have little to no
benefit because we would run out of room in the frags list before we
could start reusing a fresh page.
Thanks,
Alex
Actually I think I just realized what the difference is. I was looking
at things with LRO disabled. With LRO enabled our hardware RSC feature
kind of defeats the whole point of the GRO or TCP coalescing anyway
since it will stuff 16 fragments into a single packet before we even
hand the packet off to the stack.

Thanks,

Alex
Eric Dumazet
2012-06-20 18:41:33 UTC
Permalink
Post by Alexander Duyck
Actually I think I just realized what the difference is. I was looking
at things with LRO disabled. With LRO enabled our hardware RSC feature
kind of defeats the whole point of the GRO or TCP coalescing anyway
since it will stuff 16 fragments into a single packet before we even
hand the packet off to the stack.
I noticed LRO was now 'off' by default on ixgbe (net-next tree), I am
pretty sure it was 'on' some months ago ?
Alexander Duyck
2012-06-20 20:10:05 UTC
Permalink
Post by Eric Dumazet
Post by Alexander Duyck
Actually I think I just realized what the difference is. I was looking
at things with LRO disabled. With LRO enabled our hardware RSC feature
kind of defeats the whole point of the GRO or TCP coalescing anyway
since it will stuff 16 fragments into a single packet before we even
hand the packet off to the stack.
I noticed LRO was now 'off' by default on ixgbe (net-next tree), I am
pretty sure it was 'on' some months ago ?
It should be on by default unless you are doing some routing. In that
case as soon as the interface comes up the LRO is disabled by the stack.

Thanks,

Alex
Loading...