Discussion:
[PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
(too old to reply)
Vladimir Davydov
2015-08-30 19:03:04 UTC
Permalink
Hi,

Tejun reported that sometimes memcg/memory.high threshold seems to be
silently ignored if kmem accounting is enabled:

http://www.spinics.net/lists/linux-mm/msg93613.html

It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.

This patch set attempts to fix this issue. For more details please see
comments to individual patches.

Thanks,

Vladimir Davydov (2):
mm/slab: skip memcg reclaim only if in atomic context
mm/slub: do not bypass memcg reclaim for high-order page allocation

mm/slab.c | 32 +++++++++++---------------------
mm/slub.c | 24 +++++++++++-------------
2 files changed, 22 insertions(+), 34 deletions(-)
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-30 19:03:05 UTC
Permalink
SLAB's implementation of kmem_cache_alloc() works as follows:
1. First, it tries to allocate from the preferred NUMA node without
issuing reclaim.
2. If step 1 fails, it tries all nodes in the order of preference,
again without invoking reclaimer
3. Only if steps 1 and 2 fails, it falls back on allocation from any
allowed node with reclaim enabled.

Before commit 4167e9b2cf10f ("mm: remove GFP_THISNODE"), GFP_THISNODE
combination, which equaled __GFP_THISNODE|__GFP_NOWARN|__GFP_NORETRY on
NUMA enabled builds, was used in order to avoid reclaim during steps 1
and 2. If __alloc_pages_slowpath() saw this combination in gfp flags, it
aborted immediately even if __GFP_WAIT flag was set. So there was no
need in clearing __GFP_WAIT flag while performing steps 1 and 2 and
hence we could invoke memcg reclaim when allocating a slab page if the
context allowed.

Commit 4167e9b2cf10f zapped GFP_THISNODE combination. Instead of OR-ing
the gfp mask with GFP_THISNODE, gfp_exact_node() helper should now be
used. The latter sets __GFP_THISNODE and __GFP_NOWARN flags and clears
__GFP_WAIT on the current gfp mask. As a result, it effectively
prohibits invoking memcg reclaim on steps 1 and 2. This breaks
memcg/memory.high logic when kmem accounting is enabled. The memory.high
threshold is supposed to work as a soft limit, i.e. it does not fail an
allocation on breaching it, but it still forces the caller to invoke
direct reclaim to compensate for the excess. Without __GFP_WAIT flag
direct reclaim is impossible so the caller will go on without being
pushed back to the threshold.

To fix this issue, we get rid of gfp_exact_node() helper and move gfp
flags filtering to kmem_getpages() after memcg_charge_slab() is called.

To understand the patch, note that:
- In fallback_alloc() the only effect of using gfp_exact_node() is
preventing recursion fallback_alloc() -> ____cache_alloc_node() ->
fallback_alloc().
- Aside from fallback_alloc(), gfp_exact_node() is only used along with
cache_grow(). Moreover, the only place where cache_grow() is used
without it is fallback_alloc(), which, in contrast to other
cache_grow() users, preallocates a page and passes it to cache_grow()
so that the latter does not need to invoke kmem_getpages() by itself.

Reported-by: Tejun Heo <***@kernel.org>
Signed-off-by: Vladimir Davydov <***@parallels.com>
---
mm/slab.c | 32 +++++++++++---------------------
1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index d890750ec31e..9ee809d2ed8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,11 +857,6 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep,
return NULL;
}

-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
- return flags;
-}
-
#else /* CONFIG_NUMA */

static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
@@ -1028,15 +1023,6 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)

return __cache_free_alien(cachep, objp, node, page_node);
}
-
-/*
- * Construct gfp mask to allocate from a specific node but do not invoke reclaim
- * or warn about failures.
- */
-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
- return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
-}
#endif

/*
@@ -1583,7 +1569,7 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid)
* would be relatively rare and ignorable.
*/
static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
- int nodeid)
+ int nodeid, bool fallback)
{
struct page *page;
int nr_pages;
@@ -1595,6 +1581,9 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;

+ if (!fallback)
+ flags = (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+
page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
@@ -2641,7 +2630,8 @@ static int cache_grow(struct kmem_cache *cachep,
* 'nodeid'.
*/
if (!page)
- page = kmem_getpages(cachep, local_flags, nodeid);
+ page = kmem_getpages(cachep, local_flags, nodeid,
+ !IS_ENABLED(CONFIG_NUMA));
if (!page)
goto failed;

@@ -2840,7 +2830,7 @@ alloc_done:
if (unlikely(!ac->avail)) {
int x;
force_grow:
- x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+ x = cache_grow(cachep, flags, node, NULL);

/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
@@ -3034,7 +3024,7 @@ retry:
get_node(cache, nid) &&
get_node(cache, nid)->free_objects) {
obj = ____cache_alloc_node(cache,
- gfp_exact_node(flags), nid);
+ flags | __GFP_THISNODE, nid);
if (obj)
break;
}
@@ -3052,7 +3042,7 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
- page = kmem_getpages(cache, local_flags, numa_mem_id());
+ page = kmem_getpages(cache, local_flags, numa_mem_id(), true);
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (page) {
@@ -3062,7 +3052,7 @@ retry:
nid = page_to_nid(page);
if (cache_grow(cache, flags, nid, page)) {
obj = ____cache_alloc_node(cache,
- gfp_exact_node(flags), nid);
+ flags | __GFP_THISNODE, nid);
if (!obj)
/*
* Another processor may allocate the
@@ -3133,7 +3123,7 @@ retry:

must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
+ x = cache_grow(cachep, flags, nodeid, NULL);
if (x)
goto retry;
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-30 19:03:10 UTC
Permalink
Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
without __GFP_WAIT in order to avoid invoking reclaim/compaction when we
can fall back on low order pages. However, it broke memcg/memory.high
logic in case kmem accounting is enabled. The memory.high threshold
works as a soft limit: an allocation does not fail if it is breached,
but we call direct reclaim to compensate for the excess. Without
__GFP_WAIT we cannot invoke reclaimer and therefore we will go on
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.

Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let us fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.

Reported-by: Tejun Heo <***@kernel.org>
Signed-off-by: Vladimir Davydov <***@parallels.com>
---
mm/slub.c | 24 +++++++++++-------------
1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..416a332277cb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,14 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (memcg_charge_slab(s, flags, order))
return NULL;

+ /*
+ * Let the initial higher-order allocation fail under memory pressure
+ * so we fall-back to the minimum order allocation.
+ */
+ if (oo_order(oo) > oo_order(s->min))
+ flags = (flags | __GFP_NOWARN | __GFP_NOMEMALLOC) &
+ ~(__GFP_NOFAIL | __GFP_WAIT);
+
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
@@ -1348,7 +1356,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
struct page *page;
struct kmem_cache_order_objects oo = s->oo;
- gfp_t alloc_gfp;
void *start, *p;
int idx, order;

@@ -1359,23 +1366,14 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)

flags |= s->allocflags;

- /*
- * Let the initial higher-order allocation fail under memory pressure
- * so we fall-back to the minimum order allocation.
- */
- alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
- if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
- alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
-
- page = alloc_slab_page(s, alloc_gfp, node, oo);
+ page = alloc_slab_page(s, flags, node, oo);
if (unlikely(!page)) {
oo = s->min;
- alloc_gfp = flags;
/*
* Allocation may have failed due to fragmentation.
* Try a lower order alloc if possible
*/
- page = alloc_slab_page(s, alloc_gfp, node, oo);
+ page = alloc_slab_page(s, flags, node, oo);
if (unlikely(!page))
goto out;
stat(s, ORDER_FALLBACK);
@@ -1385,7 +1383,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
!(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
int pages = 1 << oo_order(oo);

- kmemcheck_alloc_shadow(page, oo_order(oo), alloc_gfp, node);
+ kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);

/*
* Objects from caches that have a constructor don't get
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michal Hocko
2015-08-31 13:24:23 UTC
Permalink
Post by Vladimir Davydov
Hi,
Tejun reported that sometimes memcg/memory.high threshold seems to be
http://www.spinics.net/lists/linux-mm/msg93613.html
It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.
Right but isn't that what the caller explicitly asked for? Why should we
ignore that for kmem accounting? It seems like a fix at a wrong layer to
me. Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun. I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.
Post by Vladimir Davydov
This patch set attempts to fix this issue. For more details please see
comments to individual patches.
Thanks,
mm/slab: skip memcg reclaim only if in atomic context
mm/slub: do not bypass memcg reclaim for high-order page allocation
mm/slab.c | 32 +++++++++++---------------------
mm/slub.c | 24 +++++++++++-------------
2 files changed, 22 insertions(+), 34 deletions(-)
--
2.1.4
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-08-31 13:43:46 UTC
Permalink
Hello,
Post by Michal Hocko
Right but isn't that what the caller explicitly asked for? Why should we
ignore that for kmem accounting? It seems like a fix at a wrong layer to
me. Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun. I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.
Yeah, this is beginning to look like we're trying to solve the problem
at the wrong layer. slab/slub or whatever else should be able to use
GFP_NOWAIT in whatever frequency they want for speculative
allocations.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-31 14:30:29 UTC
Permalink
Post by Tejun Heo
Post by Michal Hocko
Right but isn't that what the caller explicitly asked for? Why should we
ignore that for kmem accounting? It seems like a fix at a wrong layer to
me. Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun. I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.
Yeah, this is beginning to look like we're trying to solve the problem
at the wrong layer. slab/slub or whatever else should be able to use
GFP_NOWAIT in whatever frequency they want for speculative
allocations.
slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-08-31 14:39:50 UTC
Permalink
Post by Vladimir Davydov
slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.
Hmmm? I meant the eventual calling into try_charge w/ GFP_NOWAIT.
Speculative usage of GFP_NOWAIT is bound to increase and we don't want
to put on extra restrictions from memcg side. For memory.high,
punting to the return path is a pretty stright-forward solution which
should make the problem go away almost entirely.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-31 15:18:37 UTC
Permalink
Post by Tejun Heo
Post by Vladimir Davydov
slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.
Hmmm? I meant the eventual calling into try_charge w/ GFP_NOWAIT.
Speculative usage of GFP_NOWAIT is bound to increase and we don't want
to put on extra restrictions from memcg side.
We already put restrictions on slab/slub from memcg side, because kmem
accounting is a part of slab/slub. They have to cooperate in order to
get things working. If slab/slub wants to make a speculative allocation
for some reason, it should just put memcg_charge out of this speculative
alloc section. This is what this patch set does.

We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.

SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.

Thanks,
Vladimir
Post by Tejun Heo
For memory.high,
punting to the return path is a pretty stright-forward solution which
should make the problem go away almost entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-08-31 15:48:06 UTC
Permalink
Hello,
Post by Vladimir Davydov
We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.
SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.
Maybe I'm missing something but aren't both issues caused by memcg
failing to provide headroom for NOWAIT allocations when the
consumption gets close to the max limit? Regardless of the specific
usage, !__GFP_WAIT means "give me memory if it can be spared w/o
inducing direct time-consuming maintenance work" and the contract
around it is that such requests will mostly succeed under nominal
conditions. Also, slab/slub might not stay as the only user of
try_charge(). I still think solving this from memcg side is the right
direction.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-31 16:51:54 UTC
Permalink
Post by Tejun Heo
Post by Vladimir Davydov
We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.
SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.
Maybe I'm missing something but aren't both issues caused by memcg
failing to provide headroom for NOWAIT allocations when the
consumption gets close to the max limit?
That's correct.
Post by Tejun Heo
Regardless of the specific usage, !__GFP_WAIT means "give me memory if
it can be spared w/o inducing direct time-consuming maintenance work"
and the contract around it is that such requests will mostly succeed
under nominal conditions. Also, slab/slub might not stay as the only
user of try_charge().
Indeed, there might be other users trying GFP_NOWAIT before falling back
to GFP_KERNEL, but they are not doing that constantly and hence cause no
problems. If SLAB/SLUB plays such tricks, the problem becomes massive:
under certain conditions *every* try_charge may be invoked w/o
__GFP_WAIT, resulting in memory.high breaching and hitting memory.max.

Generally speaking, handing over reclaim responsibility to task_work
won't help, because there might be cases when a process spends quite a
lot of time in kernel invoking lots of GFP_KERNEL allocations before
returning to userspace. Without fixing slab/slub, such a process will
charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
memory.max. If there are no other active processes in the cgroup, the
cgroup can stay with memory.high excess for a relatively long time
(suppose the process was throttled in kernel), possibly hurting the rest
of the system. What is worse, if the process happens to invoke a real
GFP_NOWAIT allocation when it's about to hit the limit, it will fail.

If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?

Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
try_charge I gave above:

- memcg knows nothing about NUMA nodes, so what's the point in failing
!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
- memcg knows nothing about high order pages, so what's the point in
failing !__GFP_WAIT allocations used by SLUB to try to allocate a
high order page?

Thanks,
Vladimir
Post by Tejun Heo
I still think solving this from memcg side is the right direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-08-31 17:03:19 UTC
Permalink
Hello,

On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
..
Post by Vladimir Davydov
If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
In the long term, I think this is the way to go.
Post by Vladimir Davydov
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?
Cuz we're now scattering workarounds to multiple places and I'm sure
we'll add more try_charge() users (e.g. we want to fold in tcp memcg
under the same knobs) and we'll have to worry about the same problem
all over again and will inevitably miss some cases leading to subtle
failures.
Post by Vladimir Davydov
Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
- memcg knows nothing about NUMA nodes, so what's the point in failing
!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
- memcg knows nothing about high order pages, so what's the point in
failing !__GFP_WAIT allocations used by SLUB to try to allocate a
high order page?
Both are optimistic speculative actions and as long as memcg can
guarantee that those requests will succeed under normal circumstances,
as does the system-wide mm does, it isn't a problem.

In general, we want to make sure inside-cgroup behaviors as close to
system-wide behaviors as possible, scoped but equivalent in kind.
Doing things differently, while inevitable in certain cases, is likely
to get messy in the long term.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-31 19:26:37 UTC
Permalink
Post by Tejun Heo
...
Post by Vladimir Davydov
If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
In the long term, I think this is the way to go.
Quite probably, or we can use task_work, or direct reclaim instead. It's
not that obvious to me yet which one is the best.
Post by Tejun Heo
Post by Vladimir Davydov
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?
Cuz we're now scattering workarounds to multiple places and I'm sure
we'll add more try_charge() users (e.g. we want to fold in tcp memcg
under the same knobs) and we'll have to worry about the same problem
all over again and will inevitably miss some cases leading to subtle
failures.
I don't think we will need to insert try_charge_kmem anywhere else,
because all kmem users either allocate memory using kmalloc and friends
or using alloc_pages. kmalloc is accounted. For those who prefer
alloc_pages, there is alloc_kmem_pages helper.
Post by Tejun Heo
Post by Vladimir Davydov
Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
- memcg knows nothing about NUMA nodes, so what's the point in failing
!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
- memcg knows nothing about high order pages, so what's the point in
failing !__GFP_WAIT allocations used by SLUB to try to allocate a
high order page?
Both are optimistic speculative actions and as long as memcg can
guarantee that those requests will succeed under normal circumstances,
as does the system-wide mm does, it isn't a problem.
In general, we want to make sure inside-cgroup behaviors as close to
system-wide behaviors as possible, scoped but equivalent in kind.
Doing things differently, while inevitable in certain cases, is likely
to get messy in the long term.
I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
RAM. There are two ways to achieve that:

1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.

I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Christoph Lameter
2015-08-31 20:22:30 UTC
Permalink
Post by Vladimir Davydov
I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.
I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.
The heuristics need to integrate well if its in a cgroup or not. In
general make use of cgroups as transparent as possible to the rest of the
code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-01 09:25:46 UTC
Permalink
Post by Christoph Lameter
Post by Vladimir Davydov
I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.
I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.
The heuristics need to integrate well if its in a cgroup or not. In
general make use of cgroups as transparent as possible to the rest of the
code.
Half of kmem accounting implementation resides in SLAB/SLUB. We can't
just make use of cgroups there transparent. For the rest of the code
using kmalloc, cgroups are transparent.

Indeed, we can make memcg_charge_slab behave exactly like alloc_pages,
we can even put it to alloc_pages (where it used to be), but why if the
only user of memcg_charge_slab is SLAB/SLUB core?

I think we'd have more space to manoeuvre if we just taught SLAB/SLUB to
use memcg_charge_slab wisely (as it used to until recently), because
memcg charge/reclaim is quite different from global alloc/reclaim:

- it isn't aware of NUMA nodes, so trying to charge w/o __GFP_WAIT
while inspecting nodes, like in case of SLAB, is meaningless

- it isn't aware of high order page allocations, so trying to charge
w/o __GFP_WAIT while trying optimistically to get a high order page,
like in case of SLUB, is meaningless too

- it can always let a high prio allocation go unaccounted, so IMO there
is no point in introducing emergency reserves (__GFP_MEMALLOC
handling)

- it can always charge a GFP_NOWAIT allocation even if it exceeds the
limit, issuing direct reclaim when a GFP_KERNEL allocation comes or
from a task work, because there is no risk of depleting memory
reserves; so it isn't obvious to me whether we really need an aync
thread handling memcg reclaim like kswapd

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-31 14:21:14 UTC
Permalink
Post by Michal Hocko
Post by Vladimir Davydov
Tejun reported that sometimes memcg/memory.high threshold seems to be
http://www.spinics.net/lists/linux-mm/msg93613.html
It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.
Right but isn't that what the caller explicitly asked for?
No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.
Post by Michal Hocko
Why should we ignore that for kmem accounting? It seems like a fix at
a wrong layer to me.
Let's forget about memory.high for a minute.

1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.

2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.

That being said, this is the fix at the right layer.
Post by Michal Hocko
Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun.
The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.

I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.

Thanks,
Vladimir
Post by Michal Hocko
I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-08-31 14:46:13 UTC
Permalink
Hello, Vladimir.

On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
..
Post by Vladimir Davydov
That being said, this is the fix at the right layer.
While this *might* be a necessary workaround for the hard limit case
right now, this is by no means the fix at the right layer. The
expectation is that mm keeps a reasonable amount of memory available
for allocations which can't block. These allocations may fail from
time to time depending on luck and under extreme memory pressure but
the caller should be able to depend on it as a speculative allocation
mechanism which doesn't fail willy-nilly.

Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
slab or slub.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-08-31 15:24:58 UTC
Permalink
Post by Tejun Heo
Hello, Vladimir.
...
Post by Vladimir Davydov
That being said, this is the fix at the right layer.
While this *might* be a necessary workaround for the hard limit case
right now, this is by no means the fix at the right layer. The
expectation is that mm keeps a reasonable amount of memory available
for allocations which can't block. These allocations may fail from
time to time depending on luck and under extreme memory pressure but
the caller should be able to depend on it as a speculative allocation
mechanism which doesn't fail willy-nilly.
Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
slab or slub.
I never denied that there is GFP_NOWAIT/GFP_NOFS problem in memcg. I
even proposed ways to cope with it in one of the previous e-mails.

Nevertheless, we just can't allow slab/slub internals call memcg_charge
whenever they want as I pointed out in a parallel thread.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michal Hocko
2015-09-01 12:36:22 UTC
Permalink
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Tejun reported that sometimes memcg/memory.high threshold seems to be
http://www.spinics.net/lists/linux-mm/msg93613.html
It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.
Right but isn't that what the caller explicitly asked for?
No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.
I was referring to the slab allocator as the caller. Sorry for not being
clear about that.
Post by Vladimir Davydov
Post by Michal Hocko
Why should we ignore that for kmem accounting? It seems like a fix at
a wrong layer to me.
Let's forget about memory.high for a minute.
1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.
I am not familiar with the SLAB internals much but how is it different
from the global case. If the preferred node is full then __GFP_THISNODE
request will make it fail early even without giving GFP_NOWAIT
additional access to atomic memory reserves. The fact that memcg case
fails earlier is perfectly expected because the restriction is tighter
than the global case.

How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.
Post by Vladimir Davydov
2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.
And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.
Post by Vladimir Davydov
That being said, this is the fix at the right layer.
Post by Michal Hocko
Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun.
The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.
Why would be that a problem. The _hard_ limit is reached and reclaim
cannot make any progress. An allocation failure is to be expected.
GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
before failing.
Post by Vladimir Davydov
To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.
Yes we can do better than we do right now. But that doesn't mean we
should put hacks all over the place and lie about the allocation
context.
Post by Vladimir Davydov
I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.
I am not against something like watermarks and doing more pro-active
reclaim but this is far from easy to do - which is one of the reason we
do not have it yet. The idea from Tejun about the return to userspace
reclaim is nice in that regards that it happens from a well defined
context and helps to keep memory.high behavior much saner.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-01 13:40:51 UTC
Permalink
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Tejun reported that sometimes memcg/memory.high threshold seems to be
http://www.spinics.net/lists/linux-mm/msg93613.html
It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.
Right but isn't that what the caller explicitly asked for?
No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.
I was referring to the slab allocator as the caller. Sorry for not being
clear about that.
Post by Vladimir Davydov
Post by Michal Hocko
Why should we ignore that for kmem accounting? It seems like a fix at
a wrong layer to me.
Let's forget about memory.high for a minute.
1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.
I am not familiar with the SLAB internals much but how is it different
from the global case. If the preferred node is full then __GFP_THISNODE
request will make it fail early even without giving GFP_NOWAIT
additional access to atomic memory reserves. The fact that memcg case
fails earlier is perfectly expected because the restriction is tighter
than the global case.
memcg restrictions are orthogonal to NUMA: failing an allocation from a
particular node does not mean failing memcg charge and vice versa.
Post by Michal Hocko
How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.
Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.

You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.
Post by Michal Hocko
Post by Vladimir Davydov
2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.
And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page. OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.
Post by Michal Hocko
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.
It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no difference between reclaiming 2Mb for
a huge page and 2Mb for 512 4Kb pages.
Post by Michal Hocko
Post by Vladimir Davydov
That being said, this is the fix at the right layer.
Post by Michal Hocko
Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun.
The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.
Why would be that a problem. The _hard_ limit is reached and reclaim
cannot make any progress. An allocation failure is to be expected.
GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
before failing.
Quoting my e-mail to Tejun explaining why using task_work won't help if
we don't fix SLAB/SLUB:

: Generally speaking, handing over reclaim responsibility to task_work
: won't help, because there might be cases when a process spends quite a
: lot of time in kernel invoking lots of GFP_KERNEL allocations before
: returning to userspace. Without fixing slab/slub, such a process will
: charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
: memory.max. If there are no other active processes in the cgroup, the
: cgroup can stay with memory.high excess for a relatively long time
: (suppose the process was throttled in kernel), possibly hurting the rest
: of the system. What is worse, if the process happens to invoke a real
: GFP_NOWAIT allocation when it's about to hit the limit, it will fail.

For a kmalloc user that's completely unexpected.
Post by Michal Hocko
Post by Vladimir Davydov
To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.
Yes we can do better than we do right now. But that doesn't mean we
should put hacks all over the place and lie about the allocation
context.
What do you mean by saying "all over the place"? It's a fix for kmem
implementation, to be more exact for the part of it residing in the slab
core. Everyone else, except a couple of kmem users issuing alloc_page
directly like threadinfo, will use kmalloc and know nothing what's going
on there and how all this accounting stuff is handled - they will just
use plain old convenient kmalloc, which works exactly as it does in the
root cgroup.
Post by Michal Hocko
Post by Vladimir Davydov
I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.
I am not against something like watermarks and doing more pro-active
reclaim but this is far from easy to do - which is one of the reason we
do not have it yet. The idea from Tejun about the return to userspace
reclaim is nice in that regards that it happens from a well defined
context and helps to keep memory.high behavior much saner.
I don't say what Tejun proposed is a crap. It might be a very good
lightweight alternative to per memcg kswapd. However, w/o fixing
SLAB/SLUB it's useless.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michal Hocko
2015-09-01 15:01:36 UTC
Permalink
{...}
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.
I am not familiar with the SLAB internals much but how is it different
from the global case. If the preferred node is full then __GFP_THISNODE
request will make it fail early even without giving GFP_NOWAIT
additional access to atomic memory reserves. The fact that memcg case
fails earlier is perfectly expected because the restriction is tighter
than the global case.
memcg restrictions are orthogonal to NUMA: failing an allocation from a
particular node does not mean failing memcg charge and vice versa.
Sure memcg doesn't care about NUMA it just puts an additional constrain
on top of all existing ones. The point I've tried to make is that the
logic is currently same whether it is page allocator (with the node
restriction) or memcg (cumulative amount restriction) are behaving
consistently. Neither of them try to reclaim in order to achieve its
goals. How conservative is memcg about allowing GFP_NOWAIT allocation
is a separate issue and all those details belong to memcg proper same
as the allocation strategy for these allocations belongs to the page
allocator.
Post by Vladimir Davydov
Post by Michal Hocko
How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.
Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.
So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
Post by Vladimir Davydov
You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.
I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
juggle with them. This is imho bad and hard to maintain long term.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.
And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page.
But this is an implementation details which might change anytime in
future.
Post by Vladimir Davydov
OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.
This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory restriction.
Post by Vladimir Davydov
Post by Michal Hocko
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.
It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no difference between reclaiming 2Mb for
a huge page and 2Mb for 512 4Kb pages.
Or maybe the whole reclaim just doesn't pay off because the TLB savings
will never compensate for the reclaim. The defrag knob basically says
that we shouldn't try to opportunistically prepare a room for the THP
page.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
That being said, this is the fix at the right layer.
Post by Michal Hocko
Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun.
The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.
Why would be that a problem. The _hard_ limit is reached and reclaim
cannot make any progress. An allocation failure is to be expected.
GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
before failing.
Quoting my e-mail to Tejun explaining why using task_work won't help if
: Generally speaking, handing over reclaim responsibility to task_work
: won't help, because there might be cases when a process spends quite a
: lot of time in kernel invoking lots of GFP_KERNEL allocations before
: returning to userspace. Without fixing slab/slub, such a process will
: charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
: memory.max. If there are no other active processes in the cgroup, the
: cgroup can stay with memory.high excess for a relatively long time
: (suppose the process was throttled in kernel), possibly hurting the rest
: of the system. What is worse, if the process happens to invoke a real
: GFP_NOWAIT allocation when it's about to hit the limit, it will fail.
For a kmalloc user that's completely unexpected.
We have the global reclaim which handles the global memory pressure. And
until the hard limit is enforced I do not see what is the huge problem
here. Sure we can have high limit in excess but that is to be expected.
Same as failing allocations for the hard limit enforcement.

Maybe moving whole high limit reclaim to the delayed context is not what
we will end up with and reduce this only for GFP_NOWAIT or other weak
reclaim contexts. This is to be discussed of course.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.
Yes we can do better than we do right now. But that doesn't mean we
should put hacks all over the place and lie about the allocation
context.
What do you mean by saying "all over the place"? It's a fix for kmem
implementation, to be more exact for the part of it residing in the slab
core.
I meant into two slab allocators currently because of the implementation
details which are spread into three different places - page allocator,
memcg charging code and the respective slab allocator specific details.
Post by Vladimir Davydov
Everyone else, except a couple of kmem users issuing alloc_page
directly like threadinfo, will use kmalloc and know nothing what's going
on there and how all this accounting stuff is handled - they will just
use plain old convenient kmalloc, which works exactly as it does in the
root cgroup.
If we ever grow more users and charge more kernel memory then they might
be doing similar assumptions and tweak allocation/charge context and we
would end up in a bigger mess. It makes much more sense to have
allocation and charge context consistent.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-01 16:56:45 UTC
Permalink
Post by Michal Hocko
{...}
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.
I am not familiar with the SLAB internals much but how is it different
from the global case. If the preferred node is full then __GFP_THISNODE
request will make it fail early even without giving GFP_NOWAIT
additional access to atomic memory reserves. The fact that memcg case
fails earlier is perfectly expected because the restriction is tighter
than the global case.
memcg restrictions are orthogonal to NUMA: failing an allocation from a
particular node does not mean failing memcg charge and vice versa.
Sure memcg doesn't care about NUMA it just puts an additional constrain
on top of all existing ones. The point I've tried to make is that the
logic is currently same whether it is page allocator (with the node
restriction) or memcg (cumulative amount restriction) are behaving
consistently. Neither of them try to reclaim in order to achieve its
goals. How conservative is memcg about allowing GFP_NOWAIT allocation
is a separate issue and all those details belong to memcg proper same
as the allocation strategy for these allocations belongs to the page
allocator.
Post by Vladimir Davydov
Post by Michal Hocko
How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.
Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.
So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I'm just pointing out some subtle behavior changes in slab you were
opposed to.
Post by Michal Hocko
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
Post by Vladimir Davydov
You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.
I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
What do you mean by saying "it understands the lower level allocator"?
AFAIK we have memcg callbacks only in special places, like page fault
handler or kmalloc.
Post by Michal Hocko
juggle with them. This is imho bad and hard to maintain long term.
We already juggle. Just grep where and how we insert
mem_cgroup_try_charge.
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.
And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page.
But this is an implementation details which might change anytime in
future.
The fact that memcg reclaim does not invoke compactor is indeed an
implementation detail, but how can it change?
Post by Michal Hocko
Post by Vladimir Davydov
OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.
This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory restriction.
So what? We shouldn't even try?
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.
It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no difference between reclaiming 2Mb for
a huge page and 2Mb for 512 4Kb pages.
Or maybe the whole reclaim just doesn't pay off because the TLB savings
will never compensate for the reclaim. The defrag knob basically says
that we shouldn't try to opportunistically prepare a room for the THP
page.
And why is it called "defrag" then?
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
That being said, this is the fix at the right layer.
Post by Michal Hocko
Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun.
The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.
Why would be that a problem. The _hard_ limit is reached and reclaim
cannot make any progress. An allocation failure is to be expected.
GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
before failing.
Quoting my e-mail to Tejun explaining why using task_work won't help if
: Generally speaking, handing over reclaim responsibility to task_work
: won't help, because there might be cases when a process spends quite a
: lot of time in kernel invoking lots of GFP_KERNEL allocations before
: returning to userspace. Without fixing slab/slub, such a process will
: charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
: memory.max. If there are no other active processes in the cgroup, the
: cgroup can stay with memory.high excess for a relatively long time
: (suppose the process was throttled in kernel), possibly hurting the rest
: of the system. What is worse, if the process happens to invoke a real
: GFP_NOWAIT allocation when it's about to hit the limit, it will fail.
For a kmalloc user that's completely unexpected.
We have the global reclaim which handles the global memory pressure. And
until the hard limit is enforced I do not see what is the huge problem
here. Sure we can have high limit in excess but that is to be expected.
What exactly is to be expected? Is it OK if memory.high is just ignored?
Post by Michal Hocko
Same as failing allocations for the hard limit enforcement.
If a kmem allocation fails, your app is likely to fail too. Nobody
expects write/read fail with ENOMEM if there seems to be enough
reclaimable memory. If we try to fix the GFP_NOWAIT problem only by
using task_work reclaim, it won't be a complete fix, because a failure
may still occur as I described above.
Post by Michal Hocko
Maybe moving whole high limit reclaim to the delayed context is not what
we will end up with and reduce this only for GFP_NOWAIT or other weak
reclaim contexts. This is to be discussed of course.
Yeah, but w/o fixing kmalloc it may happen that *every* allocation will
be GFP_NOWAIT. It'd complicate the implementation.
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.
Yes we can do better than we do right now. But that doesn't mean we
should put hacks all over the place and lie about the allocation
context.
What do you mean by saying "all over the place"? It's a fix for kmem
implementation, to be more exact for the part of it residing in the slab
core.
I meant into two slab allocators currently because of the implementation
details which are spread into three different places - page allocator,
memcg charging code and the respective slab allocator specific details.
If we remove kmem accounting, we will still have implementation details
spread over page allocator, reclaimer, rmap, memcg. Slab is not the
worst part of it IMO. Anyway, kmem accounting can't be implemented
solely in memcg.
Post by Michal Hocko
Post by Vladimir Davydov
Everyone else, except a couple of kmem users issuing alloc_page
directly like threadinfo, will use kmalloc and know nothing what's going
on there and how all this accounting stuff is handled - they will just
use plain old convenient kmalloc, which works exactly as it does in the
root cgroup.
If we ever grow more users and charge more kernel memory then they might
be doing similar assumptions and tweak allocation/charge context and we
would end up in a bigger mess. It makes much more sense to have
allocation and charge context consistent.
What new users? Why can't they just call kmalloc?

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michal Hocko
2015-09-01 18:39:00 UTC
Permalink
[...]
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.
Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.
So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I'm just pointing out some subtle behavior changes in slab you were
opposed to.
I guess we are still not at the same page here. If the slab has a subtle
behavior (and from what you are saying it seems it has the same behavior
at the global scope) then we should strive to fix it rather than making
it more obscure just to not expose GFP_NOWAIT to memcg which is not
handled properly currently wrt. high limit (more on that below) which
was the primary motivation for the patch AFAIU.
Post by Vladimir Davydov
Post by Michal Hocko
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
Post by Vladimir Davydov
You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.
I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
What do you mean by saying "it understands the lower level allocator"?
I mean it requires/abuses special behavior from the page allocator like
__GFP_THISNODE && !wait for the hot path.
Post by Vladimir Davydov
AFAIK we have memcg callbacks only in special places, like page fault
handler or kmalloc.
But anybody might opt-in to be charged. I can see some other buffers
which are even not accounted for right now will be charged in future.
Post by Vladimir Davydov
Post by Michal Hocko
juggle with them. This is imho bad and hard to maintain long term.
We already juggle. Just grep where and how we insert
mem_cgroup_try_charge.
We should always preserve the gfp context (at least its reclaim
part). If we are not then it is a bug.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.
And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page.
But this is an implementation details which might change anytime in
future.
The fact that memcg reclaim does not invoke compactor is indeed an
implementation detail, but how can it change?
Compaction is indeed not something memcg reclaim cares about right now
or will care in foreseeable future. I meant something else. order-1 vs.
ordern-N differ in the reclaim target which then controls the potential
latency of the reclaim. The fact that order-1 and order-4 do not really
make any difference _right now_ because of the large SWAP_CLUSTER_MAX is
the implementation detail I was referring to.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.
This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory restriction.
So what? We shouldn't even try?
Of course you can try. Then the question is what are costs/benefits
(both performance and maintainability). I didn't say those two patches
are incorrect (the original kmalloc gfp mask is obeyed).
They just seem targeting a wrong layer IMO. Alternative solutions were
not attempted and measured for typical workloads. If we find out that
addressing GFP_NOWAIT at memcg level will be viable for most reasonable
loads and corner cases are at least not causing runaways which would be
hard to address then let's put workarounds where they are necessary.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.
It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no difference between reclaiming 2Mb for
a huge page and 2Mb for 512 4Kb pages.
Or maybe the whole reclaim just doesn't pay off because the TLB savings
will never compensate for the reclaim. The defrag knob basically says
that we shouldn't try to opportunistically prepare a room for the THP
page.
And why is it called "defrag" then?
Do not ask me about the naming. If this was only about compaction then
the allocator might be told about that by a special GFP flag. The memcg
could be in line with that. But the point remains. If the defrag is
a knob to make the page fault THP path lighter then no memcg reclaim is a
reasonable to do.

[...]
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Quoting my e-mail to Tejun explaining why using task_work won't help if
: Generally speaking, handing over reclaim responsibility to task_work
: won't help, because there might be cases when a process spends quite a
: lot of time in kernel invoking lots of GFP_KERNEL allocations before
: returning to userspace. Without fixing slab/slub, such a process will
: charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
: memory.max. If there are no other active processes in the cgroup, the
: cgroup can stay with memory.high excess for a relatively long time
: (suppose the process was throttled in kernel), possibly hurting the rest
: of the system. What is worse, if the process happens to invoke a real
: GFP_NOWAIT allocation when it's about to hit the limit, it will fail.
For a kmalloc user that's completely unexpected.
We have the global reclaim which handles the global memory pressure. And
until the hard limit is enforced I do not see what is the huge problem
here. Sure we can have high limit in excess but that is to be expected.
What exactly is to be expected? Is it OK if memory.high is just ignored?
It is not OK to be ignored altogether. The high limit is where the
throttling should start. And we currently do not handle GFP_NOWAIT which
is something to be solved. We shouldn't remove GFP_NOWAIT callers as a
workaround.

There are more things to do here. We can perform the reclaim from the
delayed context where the direct reclaim is not allowed/requested. And
we can start failing GFP_NOWAIT on an excessive high limit breach when
the delayed reclaim doesn't catch up with the demand. This is basically
what we do on the global level.
If even this is not sufficient and the kernel allows for a lot of
allocations in the single run, which would be something to look at in
the first place, then we have global mechanisms to mitigate that.

memory.high is an opportunistic memory isolation. It doesn't guarantee a
complete isolation. The hard limit is for that purpose.
Post by Vladimir Davydov
Post by Michal Hocko
Same as failing allocations for the hard limit enforcement.
If a kmem allocation fails, your app is likely to fail too. Nobody
expects write/read fail with ENOMEM if there seems to be enough
reclaimable memory. If we try to fix the GFP_NOWAIT problem only by
using task_work reclaim, it won't be a complete fix, because a failure
may still occur as I described above.
You cannot have a system which cannot tolerate failures and require
memory restrictions. These two requirements simply go against each other.
Moreover GPF_NOWAIT context is really light and should always have a
fallback mode otherwise you get what you are saying - failures with
reclaimable memory. And this is very much the case for the global case
as well.
Post by Vladimir Davydov
Post by Michal Hocko
Maybe moving whole high limit reclaim to the delayed context is not what
we will end up with and reduce this only for GFP_NOWAIT or other weak
reclaim contexts. This is to be discussed of course.
Yeah, but w/o fixing kmalloc it may happen that *every* allocation will
be GFP_NOWAIT. It'd complicate the implementation.
OK, but that is the case for the global case already. MM resp. memcg has
to say at when to stop it. The global case handles that at the page
allocator layer and memcg should do something similar at the charge
level.

[...]
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
What do you mean by saying "all over the place"? It's a fix for kmem
implementation, to be more exact for the part of it residing in the slab
core.
I meant into two slab allocators currently because of the implementation
details which are spread into three different places - page allocator,
memcg charging code and the respective slab allocator specific details.
If we remove kmem accounting, we will still have implementation details
spread over page allocator, reclaimer, rmap, memcg. Slab is not the
worst part of it IMO. Anyway, kmem accounting can't be implemented
solely in memcg.
The current state is quite complex already and making it even more
complex by making allocation and charge context inconsistent is not really
desirable.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Everyone else, except a couple of kmem users issuing alloc_page
directly like threadinfo, will use kmalloc and know nothing what's going
on there and how all this accounting stuff is handled - they will just
use plain old convenient kmalloc, which works exactly as it does in the
root cgroup.
If we ever grow more users and charge more kernel memory then they might
be doing similar assumptions and tweak allocation/charge context and we
would end up in a bigger mess. It makes much more sense to have
allocation and charge context consistent.
What new users? Why can't they just call kmalloc?
What about direct users of the page allocator. Why should they pay cost
for more complex/expensive code paths when they do not need sub-page
sizes.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-02 09:31:08 UTC
Permalink
[
I'll try to summarize my point in one hunk instead of spreading it all
over the e-mail, because IMO it's becoming a kind of difficult to
follow. If you think that there's a question I dodge, please let me
now and I'll try to address it separately.

Also, adding Johannes to Cc (I noticed that I accidentally left him
out), because this discussion seems to be fundamental and may affect
our further steps dramatically.
]
Post by Michal Hocko
[...]
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.
Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.
So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I'm just pointing out some subtle behavior changes in slab you were
opposed to.
I guess we are still not at the same page here. If the slab has a subtle
behavior (and from what you are saying it seems it has the same behavior
at the global scope) then we should strive to fix it rather than making
it more obscure just to not expose GFP_NOWAIT to memcg which is not
handled properly currently wrt. high limit (more on that below) which
was the primary motivation for the patch AFAIU.
Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.

Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...

My point is that what slab does is a pretty low level thing, normal
users call alloc_pages or kmalloc with flags corresponding to their
context. Of course, there may be special users trying optimistically
GFP_NOWAIT, but they aren't massive, and that simplifies things for
memcg a lot. I mean if we can rely on the fact that the number of
GFP_NOWAIT allocations that can occur in a row is limited we can use
direct reclaim (like memory.high) and/or task_work reclaim to fix
GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
most its heuristics. I don't think that copying those heuristics is the
right thing to do, because in memcg case the same problems may be
resolved much easier, because we don't actually experience real memory
shortage when hitting the limit.

Moreover, we already treat some flags not in the same way as in case of
slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
instead of retrying infinitely. We ignore __GFP_THISNODE thing and we
just cannot take it into account. We ignore allocation order, because
that makes no sense for memcg.

To sum it up. Basically, there are two ways of handling kmemcg charges:

1. Make the memcg try_charge mimic alloc_pages behavior.
2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.

Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.

I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/put_kmem_cache
helpers, which exist solely for SLAB/SLUB. You need all this special
stuff for growing per-memcg array in list_lru and kmem_cache, which
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for performance reasons.

Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
for optimization, but their API is well defined, so we just make kmalloc
work as expected while providing inter-subsys calls, like
memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
mentioned kmem users that allocate memory using alloc_pages. There is an
API function for them too, alloc_kmem_pages. Everything behind the API
is hidden and may be done in such a way to achieve optimal performance.

Thanks,
Vladimir
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
Post by Vladimir Davydov
You are talking about memcg/kmem accounting as if it were done in the > > > > buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.
I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
What do you mean by saying "it understands the lower level allocator"?
I mean it requires/abuses special behavior from the page allocator like
__GFP_THISNODE && !wait for the hot path.
Post by Vladimir Davydov
AFAIK we have memcg callbacks only in special places, like page fault
handler or kmalloc.
But anybody might opt-in to be charged. I can see some other buffers
which are even not accounted for right now will be charged in future.
Post by Vladimir Davydov
Post by Michal Hocko
juggle with them. This is imho bad and hard to maintain long term.
We already juggle. Just grep where and how we insert
mem_cgroup_try_charge.
We should always preserve the gfp context (at least its reclaim
part). If we are not then it is a bug.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.
And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page.
But this is an implementation details which might change anytime in
future.
The fact that memcg reclaim does not invoke compactor is indeed an
implementation detail, but how can it change?
Compaction is indeed not something memcg reclaim cares about right now
or will care in foreseeable future. I meant something else. order-1 vs.
ordern-N differ in the reclaim target which then controls the potential
latency of the reclaim. The fact that order-1 and order-4 do not really
make any difference _right now_ because of the large SWAP_CLUSTER_MAX is
the implementation detail I was referring to.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.
This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory restriction.
So what? We shouldn't even try?
Of course you can try. Then the question is what are costs/benefits
(both performance and maintainability). I didn't say those two patches
are incorrect (the original kmalloc gfp mask is obeyed).
They just seem targeting a wrong layer IMO. Alternative solutions were
not attempted and measured for typical workloads. If we find out that
addressing GFP_NOWAIT at memcg level will be viable for most reasonable
loads and corner cases are at least not causing runaways which would be
hard to address then let's put workarounds where they are necessary.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Post by Michal Hocko
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.
It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no difference between reclaiming 2Mb for
a huge page and 2Mb for 512 4Kb pages.
Or maybe the whole reclaim just doesn't pay off because the TLB savings
will never compensate for the reclaim. The defrag knob basically says
that we shouldn't try to opportunistically prepare a room for the THP
page.
And why is it called "defrag" then?
Do not ask me about the naming. If this was only about compaction then
the allocator might be told about that by a special GFP flag. The memcg
could be in line with that. But the point remains. If the defrag is
a knob to make the page fault THP path lighter then no memcg reclaim is a
reasonable to do.
[...]
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Quoting my e-mail to Tejun explaining why using task_work won't help if
: Generally speaking, handing over reclaim responsibility to task_work
: won't help, because there might be cases when a process spends quite a
: lot of time in kernel invoking lots of GFP_KERNEL allocations before
: returning to userspace. Without fixing slab/slub, such a process will
: charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
: memory.max. If there are no other active processes in the cgroup, the
: cgroup can stay with memory.high excess for a relatively long time
: (suppose the process was throttled in kernel), possibly hurting the rest
: of the system. What is worse, if the process happens to invoke a real
: GFP_NOWAIT allocation when it's about to hit the limit, it will fail.
For a kmalloc user that's completely unexpected.
We have the global reclaim which handles the global memory pressure. And
until the hard limit is enforced I do not see what is the huge problem
here. Sure we can have high limit in excess but that is to be expected.
What exactly is to be expected? Is it OK if memory.high is just ignored?
It is not OK to be ignored altogether. The high limit is where the
throttling should start. And we currently do not handle GFP_NOWAIT which
is something to be solved. We shouldn't remove GFP_NOWAIT callers as a
workaround.
There are more things to do here. We can perform the reclaim from the
delayed context where the direct reclaim is not allowed/requested. And
we can start failing GFP_NOWAIT on an excessive high limit breach when
the delayed reclaim doesn't catch up with the demand. This is basically
what we do on the global level.
If even this is not sufficient and the kernel allows for a lot of
allocations in the single run, which would be something to look at in
the first place, then we have global mechanisms to mitigate that.
memory.high is an opportunistic memory isolation. It doesn't guarantee a
complete isolation. The hard limit is for that purpose.
Post by Vladimir Davydov
Post by Michal Hocko
Same as failing allocations for the hard limit enforcement.
If a kmem allocation fails, your app is likely to fail too. Nobody
expects write/read fail with ENOMEM if there seems to be enough
reclaimable memory. If we try to fix the GFP_NOWAIT problem only by
using task_work reclaim, it won't be a complete fix, because a failure
may still occur as I described above.
You cannot have a system which cannot tolerate failures and require
memory restrictions. These two requirements simply go against each other.
Moreover GPF_NOWAIT context is really light and should always have a
fallback mode otherwise you get what you are saying - failures with
reclaimable memory. And this is very much the case for the global case
as well.
Post by Vladimir Davydov
Post by Michal Hocko
Maybe moving whole high limit reclaim to the delayed context is not what
we will end up with and reduce this only for GFP_NOWAIT or other weak
reclaim contexts. This is to be discussed of course.
Yeah, but w/o fixing kmalloc it may happen that *every* allocation will
be GFP_NOWAIT. It'd complicate the implementation.
OK, but that is the case for the global case already. MM resp. memcg has
to say at when to stop it. The global case handles that at the page
allocator layer and memcg should do something similar at the charge
level.
[...]
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
What do you mean by saying "all over the place"? It's a fix for kmem
implementation, to be more exact for the part of it residing in the slab
core.
I meant into two slab allocators currently because of the implementation
details which are spread into three different places - page allocator,
memcg charging code and the respective slab allocator specific details.
If we remove kmem accounting, we will still have implementation details
spread over page allocator, reclaimer, rmap, memcg. Slab is not the
worst part of it IMO. Anyway, kmem accounting can't be implemented
solely in memcg.
The current state is quite complex already and making it even more
complex by making allocation and charge context inconsistent is not really
desirable.
Post by Vladimir Davydov
Post by Michal Hocko
Post by Vladimir Davydov
Everyone else, except a couple of kmem users issuing alloc_page
directly like threadinfo, will use kmalloc and know nothing what's going
on there and how all this accounting stuff is handled - they will just
use plain old convenient kmalloc, which works exactly as it does in the
root cgroup.
If we ever grow more users and charge more kernel memory then they might
be doing similar assumptions and tweak allocation/charge context and we
would end up in a bigger mess. It makes much more sense to have
allocation and charge context consistent.
What new users? Why can't they just call kmalloc?
What about direct users of the page allocator. Why should they pay cost
for more complex/expensive code paths when they do not need sub-page
sizes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Christoph Lameter
2015-09-02 18:16:56 UTC
Permalink
Post by Vladimir Davydov
Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.
Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...
Well yes it needs to do that due to the way NUMA support was designed in.
SLAB needs to check the per node caches if objects are present before
going to more remote nodes. Sorry about this. I realized the design issue
in 2006 and SLUB was the result in 2007 of an alternate design to let the
page allocator do its proper job.
Post by Vladimir Davydov
1. Make the memcg try_charge mimic alloc_pages behavior.
2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.
Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.
Would it be possible to have a special alloc_pages_memcg with different
semantics?

On the other hand alloc_pages() has grown to handle all the special cases.
Why cant it also handle the special memcg case? There are numerous other
allocators that cache memory in the kernel from networking to
the bizarre compressed swap approaches. How does memcg handle that? Isnt
that situation similar to what the slab allocators do?
Post by Vladimir Davydov
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for performance reasons.
Ugly. Internal allocator design impacts container handling.
Post by Vladimir Davydov
Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
for optimization, but their API is well defined, so we just make kmalloc
work as expected while providing inter-subsys calls, like
memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
mentioned kmem users that allocate memory using alloc_pages. There is an
API function for them too, alloc_kmem_pages. Everything behind the API
is hidden and may be done in such a way to achieve optimal performance.
Can we also hide cgroups memory handling behind the page based schemes
without having extra handling for the slab allocators?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-03 09:36:58 UTC
Permalink
Post by Christoph Lameter
Post by Vladimir Davydov
Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.
Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...
Well yes it needs to do that due to the way NUMA support was designed in.
SLAB needs to check the per node caches if objects are present before
going to more remote nodes. Sorry about this. I realized the design issue
in 2006 and SLUB was the result in 2007 of an alternate design to let the
page allocator do its proper job.
Yeah, SLUB is OK in this respect.
Post by Christoph Lameter
Post by Vladimir Davydov
1. Make the memcg try_charge mimic alloc_pages behavior.
2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.
Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.
Would it be possible to have a special alloc_pages_memcg with different
semantics?
On the other hand alloc_pages() has grown to handle all the special cases.
Why cant it also handle the special memcg case? There are numerous other
Because we don't want to place memcg handling in alloc_pages(). AFAIU
this is because memcg by its design works at a higher layer than buddy
alloc. We can't just charge a page on alloc and uncharge it on free.
Sometimes we need to charge a page to a memcg which is different from
the current one, sometimes we need to move a page charge between cgroups
adjusting lru in the meantime (e.g. for handling readahead or swapin).
Placing memcg charging in alloc_pages() would IMO only obscure memcg
logic, because handling of the same page would be spread over subsystems
at different layers. I may be completely wrong though.
Post by Christoph Lameter
allocators that cache memory in the kernel from networking to
the bizarre compressed swap approaches. How does memcg handle that? Isnt
Frontswap/zswap entries are accounted to memsw counter like conventional
swap. I don't think we need to charge them to mem, because zswap size is
limited. The user allows to use some RAM as swap transparently to
running processes, so charging them to mem would be unexpected IMO.

Skbs are charged to a different counter, but not charged to kmem for
now. It is to be fixed.
Post by Christoph Lameter
that situation similar to what the slab allocators do?
I wouldn't say so. Other users just use kmalloc or alloc_pages to grow
their buffers. kmalloc is accounted. For those who work at page
granularity and hence call alloc_pages directly, there is
alloc_kmem_pages helper.
Post by Christoph Lameter
Post by Vladimir Davydov
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for performance reasons.
Ugly. Internal allocator design impacts container handling.
The point is that memcg charges pages, while kmalloc works at a finer
level of granularity. As a result, we have two orthogonal strategies for
charging kmalloc:

1. Teach memcg charge arbitrarily sized chunks and store info about
memcg near each active object in slab.
2. Create per memcg copy of each kmem cache (this is the scheme that is
in use currently).

Whichever way we choose, memcg and slab have to cooperate and so slab
internal design impacts memcg handling.
Post by Christoph Lameter
Post by Vladimir Davydov
Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
for optimization, but their API is well defined, so we just make kmalloc
work as expected while providing inter-subsys calls, like
memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
mentioned kmem users that allocate memory using alloc_pages. There is an
API function for them too, alloc_kmem_pages. Everything behind the API
is hidden and may be done in such a way to achieve optimal performance.
Can we also hide cgroups memory handling behind the page based schemes
without having extra handling for the slab allocators?
I doubt so - see above.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-09-03 16:32:52 UTC
Permalink
Hello, Vladimir.

On Wed, Sep 02, 2015 at 12:30:39PM +0300, Vladimir Davydov wrote:
..
Post by Vladimir Davydov
1. Make the memcg try_charge mimic alloc_pages behavior.
2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.
Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.
Maybe this is from inexperience but wouldn't 1 also be simpler than
the global case for the same reasons that doing 2 is simpler? It's
not like the fact that memory shortage inside memcg usually doesn't
mean global shortage goes away depending on whether we take 1 or 2.

That said, it is true that slab is an integral part of kmemcg and I
can't see how it can be made oblivious of memcg operations, so yeah
one way or the other slab has to know the details and we may have to
do some unusual things at that layer.
Post by Vladimir Davydov
I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/put_kmem_cache
helpers, which exist solely for SLAB/SLUB. You need all this special
stuff for growing per-memcg array in list_lru and kmem_cache, which
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for performance reasons.
It isn't a black or white thing. Sure, slab should be involved in
kmemcg but at the same time if we can keep the amount of exposure in
check, that's the better way to go.
Post by Vladimir Davydov
Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
for optimization, but their API is well defined, so we just make kmalloc
work as expected while providing inter-subsys calls, like
memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
mentioned kmem users that allocate memory using alloc_pages. There is an
API function for them too, alloc_kmem_pages. Everything behind the API
is hidden and may be done in such a way to achieve optimal performance.
Ditto. Nobody is arguing that we can get it out completely but at the
same time handling of GFP_NOWAIT seems like a pretty fundamental
proprety that we'd wanna maintain at memcg boundary.

You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
I'm not sure how much we can commit to that statement. GFP_KERNEL
allocating huge amount of memory in a single go is a kernel bug.
GFP_NOWAIT optimization in a hot path which is accessible to userland
isn't and we'll be growing more and more of them. We need to be
protected against back-to-back GFP_NOWAIT allocations.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-04 11:16:16 UTC
Permalink
Post by Tejun Heo
...
Post by Vladimir Davydov
1. Make the memcg try_charge mimic alloc_pages behavior.
2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.
Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.
Maybe this is from inexperience but wouldn't 1 also be simpler than
the global case for the same reasons that doing 2 is simpler? It's
not like the fact that memory shortage inside memcg usually doesn't
mean global shortage goes away depending on whether we take 1 or 2.
That said, it is true that slab is an integral part of kmemcg and I
can't see how it can be made oblivious of memcg operations, so yeah
one way or the other slab has to know the details and we may have to
do some unusual things at that layer.
Post by Vladimir Davydov
I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/put_kmem_cache
helpers, which exist solely for SLAB/SLUB. You need all this special
stuff for growing per-memcg array in list_lru and kmem_cache, which
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for performance reasons.
It isn't a black or white thing. Sure, slab should be involved in
kmemcg but at the same time if we can keep the amount of exposure in
check, that's the better way to go.
Post by Vladimir Davydov
Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
for optimization, but their API is well defined, so we just make kmalloc
work as expected while providing inter-subsys calls, like
memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
mentioned kmem users that allocate memory using alloc_pages. There is an
API function for them too, alloc_kmem_pages. Everything behind the API
is hidden and may be done in such a way to achieve optimal performance.
Ditto. Nobody is arguing that we can get it out completely but at the
same time handling of GFP_NOWAIT seems like a pretty fundamental
proprety that we'd wanna maintain at memcg boundary.
Agree, but SLAB/SLUB aren't just calling GFP_NOWAIT. They're doing
pretty low level tricks, which aren't common for the rest of the system.

Inspecting all nodes with __GFP_THISNODE and w/o __GFP_WAIT before
calling reclaimer is what can and should be done by buddy allocator.
I've never seen anyone doing things like this apart from SLAB (note SLUB
doesn't do this). SLAB does this for historical reasons. We could fix
it, but that would require rewriting SLAB code to a great extent, which
isn't preferable, because we can easily break something.

Trying a high-order page before falling back on lower order is not
something really common. It implicitly relies on the fact that
reclaiming memory for a new continuous high-order page is much more
expensive than getting the same amount of order-1 pages. This is true
for buddy alloc, but not for memcg. That's why playing such a trick with
try_charge is wrong IMO. If such a trick becomes common, I think we will
have to introduce a helper for it, because otherwise a change in buddy
alloc internal logic (e.g. a defrag optimization making high order pages
cheaper) may affect its users.

That said, I totally agree that memcg should handle GFP_NOWAIT, but I'm
opposed to the idea that it should handle the tricks that rely on
internal buddy alloc logic similar to those used by SLAB and SLUB. We'd
better strive to hide these tricks in buddy alloc helpers and never use
them directly.

That's why I think we need these patches and they aren't workarounds
that can be reverted once try_charge has been taught to handle
GFP_NOWAIT properly.
Post by Tejun Heo
You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
I'm not sure how much we can commit to that statement. GFP_KERNEL
allocating huge amount of memory in a single go is a kernel bug.
GFP_NOWAIT optimization in a hot path which is accessible to userland
isn't and we'll be growing more and more of them. We need to be
protected against back-to-back GFP_NOWAIT allocations.
AFAIU if someone tries to allocate with GFP_NOWAIT (i.e. w/o
__GFP_NOFAIL or __GFP_HIGH), he/she must be prepared to allocation
failures, so there should be a safe fall back path, which fixes things
in normal context. It doesn't mean we shouldn't do anything to satisfy
such optimistic requests from memcg, but we may occasionally fail them.

OTOH if someone allocates with GFP_KERNEL, he/she should be prepared to
get NULL, but in this case the whole operation will usually be aborted.
Therefore with the possibility of all GFP_KERNEL being transformed to
GFP_NOWAIT inside slab, memcg has to be extra cautious, because failing
a usual GFP_NOWAIT in such a case may result not in falling back on slow
path, but in user-visible effects like failing to open a file with
ENOMEM. This is really difficult to achieve and I doubt it's worth
complicating memcg code, because we can just fix SLAB/SLUB.

Regarding __GFP_NOFAIL and __GFP_HIGH, IMO we can let them go uncharged
or charge them forcefully even if they breach the limit, because there
shouldn't be many of them (if there were really a lot of them, they
could deplete memory reserves and hang the system).

If all these assumptions are true, we don't need to do anything (apart
from forcefully charging high prio allocations may be) for kmemcg to
work satisfactory. For optimizing optimistic GFP_NOWAIT callers one can
use memory.high instead or along with memory.max. Reclaiming memory.high
in kernel while holding various locks can result in prio inversions
though, but that's a different story, which could be fixed by task_work
reclaim.

I admit I may be mistaken, but if I'm right, we may end up with really
complex memcg reclaim logic trying to closely mimic behavior of buddy
alloc with all its historic peculiarities. That's why I don't want to
rush ahead "fixing" memcg reclaim before an agreement among all
interested people is reached...

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-09-04 15:44:58 UTC
Permalink
Hello, Vladimir.
Post by Vladimir Davydov
Trying a high-order page before falling back on lower order is not
something really common. It implicitly relies on the fact that
reclaiming memory for a new continuous high-order page is much more
expensive than getting the same amount of order-1 pages. This is true
for buddy alloc, but not for memcg. That's why playing such a trick with
try_charge is wrong IMO. If such a trick becomes common, I think we will
have to introduce a helper for it, because otherwise a change in buddy
alloc internal logic (e.g. a defrag optimization making high order pages
cheaper) may affect its users.
I'm having trouble following why this matters. The layering here is
pretty clear regardless of how slab is trespassing into page
allocator's role. memcg of course doesn't care whether an allocation
is high-order or order-1. All it does is imposing extra restrictions
when allocating memory and all that's necessary is reasonably
satisfying the expectations expressed by the specified gfp mask.
Post by Vladimir Davydov
That said, I totally agree that memcg should handle GFP_NOWAIT, but I'm
opposed to the idea that it should handle the tricks that rely on
internal buddy alloc logic similar to those used by SLAB and SLUB. We'd
better strive to hide these tricks in buddy alloc helpers and never use
them directly.
All these don't really matter once memcg handles GFP_NOWAIT in a
reasonable manner, right? memcg doesn't need all the fancy tricks of
the page allocator. All it needs is honoring the intentions expressed
by the gfp mask in a reasonable way w/o systematic failures.
Post by Vladimir Davydov
That's why I think we need these patches and they aren't workarounds
that can be reverted once try_charge has been taught to handle
GFP_NOWAIT properly.
So, if this is separate slab improvements, I have no objections but
independent of that, we need to be able to handle back-to-back
GFP_NOWAIT cases and w/ the high limit punting to the return path
should work well enough.
Post by Vladimir Davydov
Post by Tejun Heo
You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
I'm not sure how much we can commit to that statement. GFP_KERNEL
allocating huge amount of memory in a single go is a kernel bug.
GFP_NOWAIT optimization in a hot path which is accessible to userland
isn't and we'll be growing more and more of them. We need to be
protected against back-to-back GFP_NOWAIT allocations.
AFAIU if someone tries to allocate with GFP_NOWAIT (i.e. w/o
__GFP_NOFAIL or __GFP_HIGH), he/she must be prepared to allocation
failures, so there should be a safe fall back path, which fixes things
in normal context. It doesn't mean we shouldn't do anything to satisfy
such optimistic requests from memcg, but we may occasionally fail them.
Yes, it can fail under stress or if unluckly; however, it shouldn't
fail consistently under nominal conditions or be able to run over high
limit unchecked.
Post by Vladimir Davydov
OTOH if someone allocates with GFP_KERNEL, he/she should be prepared to
get NULL, but in this case the whole operation will usually be aborted.
Therefore with the possibility of all GFP_KERNEL being transformed to
GFP_NOWAIT inside slab, memcg has to be extra cautious, because failing
a usual GFP_NOWAIT in such a case may result not in falling back on slow
path, but in user-visible effects like failing to open a file with
ENOMEM. This is really difficult to achieve and I doubt it's worth
complicating memcg code, because we can just fix SLAB/SLUB.
I'm not following you at all here. slab too of course should fall
back to more robust gfp mask if NOWAIT fails and as long as those
failures are exceptions, it's fine.
Post by Vladimir Davydov
Regarding __GFP_NOFAIL and __GFP_HIGH, IMO we can let them go uncharged
or charge them forcefully even if they breach the limit, because there
shouldn't be many of them (if there were really a lot of them, they
could deplete memory reserves and hang the system).
If all these assumptions are true, we don't need to do anything (apart
from forcefully charging high prio allocations may be) for kmemcg to
work satisfactory. For optimizing optimistic GFP_NOWAIT callers one can
use memory.high instead or along with memory.max. Reclaiming memory.high
in kernel while holding various locks can result in prio inversions
though, but that's a different story, which could be fixed by task_work
reclaim.
GFP_NOWAIT has a systematic problem which needs to be fixed.
Post by Vladimir Davydov
I admit I may be mistaken, but if I'm right, we may end up with really
complex memcg reclaim logic trying to closely mimic behavior of buddy
alloc with all its historic peculiarities. That's why I don't want to
rush ahead "fixing" memcg reclaim before an agreement among all
interested people is reached...
I think that's a bit out of proportion. I'm not suggesting bringing
in all complexities of global reclaim. There's no reason to and what
memcg deals with is inherently way simpler than actual memory
allocation. The original patch was about fixing systematic failure
around GFP_NOWAIT close to the high limit. We might want to do
background reclaim close to max but as long as high limit functions
correctly, that's much less of a problem at least on the v2 interface.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vladimir Davydov
2015-09-04 18:21:54 UTC
Permalink
Hi Tejun, Michal

On Fri, Sep 04, 2015 at 11:44:48AM -0400, Tejun Heo wrote:
..
Post by Tejun Heo
Post by Vladimir Davydov
I admit I may be mistaken, but if I'm right, we may end up with really
complex memcg reclaim logic trying to closely mimic behavior of buddy
alloc with all its historic peculiarities. That's why I don't want to
rush ahead "fixing" memcg reclaim before an agreement among all
interested people is reached...
I think that's a bit out of proportion. I'm not suggesting bringing
in all complexities of global reclaim. There's no reason to and what
memcg deals with is inherently way simpler than actual memory
allocation. The original patch was about fixing systematic failure
around GFP_NOWAIT close to the high limit. We might want to do
background reclaim close to max but as long as high limit functions
correctly, that's much less of a problem at least on the v2 interface.
Looking through this thread once again and weighting my arguments vs
yours, I start to understand that I'm totally wrong and these patches
are not proper fixes for the problem.

Having these patches in the kernel only helps when we are hitting the
hard limit, which shouldn't occur often if memory.high works properly.
Even if memory.high is not used, the only negative effect we would get
w/o them is allocating a slab from a wrong node or getting a low order
page where we could get a high order one. Both should be rare and both
aren't critical. I think I got carried away with all those obscure
"reclaimer peculiarities" at some point.

Now I think task_work reclaim initially proposed by Tejun would be a
much better fix.

I'm terribly sorry for being so annoying and stubborn and want to thank
you for all your feedback!

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2015-09-04 19:30:12 UTC
Permalink
Hello, Vladimir.
Post by Vladimir Davydov
Now I think task_work reclaim initially proposed by Tejun would be a
much better fix.
Cool, I'll update the patch.
Post by Vladimir Davydov
I'm terribly sorry for being so annoying and stubborn and want to thank
you for all your feedback!
Heh, I'm not all that confident about my position. A lot of it could
be from lack of experience and failing to see the gradients. Please
keep me in check if I get lost.

Thanks a lot!
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michal Hocko
2015-09-04 14:38:55 UTC
Permalink
Post by Vladimir Davydov
[
I'll try to summarize my point in one hunk instead of spreading it all
over the e-mail, because IMO it's becoming a kind of difficult to
follow. If you think that there's a question I dodge, please let me
now and I'll try to address it separately.
Also, adding Johannes to Cc (I noticed that I accidentally left him
out), because this discussion seems to be fundamental and may affect
our further steps dramatically.
]
[...]
Post by Vladimir Davydov
Post by Michal Hocko
I guess we are still not at the same page here. If the slab has a subtle
behavior (and from what you are saying it seems it has the same behavior
at the global scope) then we should strive to fix it rather than making
it more obscure just to not expose GFP_NOWAIT to memcg which is not
handled properly currently wrt. high limit (more on that below) which
was the primary motivation for the patch AFAIU.
Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.
Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...
Yes, I understand what you are saying. The way how SLAB does its thing
is really subtle. The special combination of flags even prevents the
background reclaim which is weird. There was probably a good reason for
that but the point I've tried to make is that if the heuristic relies on
non-reclaiming behavior for the global case then the memcg should copy
that as much as possible. The allocator has to be prepared for the
non-sleeping allocation failure and the fact that memcg causes it sooner
is just natural because that is what the memcg is used for.

I see how you try to optimize around this subtle behavior but that only
makes it even more subtle long term.
Post by Vladimir Davydov
My point is that what slab does is a pretty low level thing, normal
users call alloc_pages or kmalloc with flags corresponding to their
context. Of course, there may be special users trying optimistically
GFP_NOWAIT, but they aren't massive, and that simplifies things for
memcg a lot.
memcg code _absolutely_ has to deal with NOWAIT requests somehow. I can
see more and more of them coming long term. Because it makes a lot of
sense to do an opportunistic allocation with a fallback. And that was
the whole point. You have started by tweaking SL.B whereas memcg is
where we should start see the resulting behavior and then think about
SL.B specific fix.
Post by Vladimir Davydov
I mean if we can rely on the fact that the number of
GFP_NOWAIT allocations that can occur in a row is limited we can use
direct reclaim (like memory.high) and/or task_work reclaim to fix
GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
most its heuristics. I don't think that copying those heuristics is the
right thing to do, because in memcg case the same problems may be
resolved much easier, because we don't actually experience real memory
shortage when hitting the limit.
I am not really sure I understand what you mean here. What kind of
heuristics you have in mind? All that memcg code cares about is the keep
high limit contained and converge as much as possible.
Post by Vladimir Davydov
Moreover, we already treat some flags not in the same way as in case of
slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
instead of retrying infinitely.
Yes we rely on the global MM to handle those. Which is a reasonable
compromise IMO. Such a strong liability cannot realistically be handled
inside memcg without causing more problems.
Post by Vladimir Davydov
We ignore __GFP_THISNODE thing and we just cannot take it into account.
yes because it is allocation and not reclaim related mode. There is a
reason it is not part of GFP_RECLAIM_MASK.
Post by Vladimir Davydov
We ignore allocation order, because that makes no sense for memcg.
We are not ignoring it completely because we base our reclaim target on
it.
Post by Vladimir Davydov
1. Make the memcg try_charge mimic alloc_pages behavior.
2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.
Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.
Because the more consistent allocation and charging paths are in the
reclaim behavior the easier will be the system to understand and maintain.
Post by Vladimir Davydov
I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/put_kmem_cache
helpers, which exist solely for SLAB/SLUB. You need all this special
stuff for growing per-memcg array in list_lru and kmem_cache, which
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for performance reasons.
Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
for optimization, but their API is well defined, so we just make kmalloc
work as expected while providing inter-subsys calls, like
memcg_charge_slab, for SLAB/SLUB that have their own conventions.
I do agree that we might end up needing SL.B specific hacks but, again,
let's get there only when we see that the memcg code cannot cope with
it by default. E.g. our currently non-existing NOWAIT logic would fail
too often because of the high limit which would lead to non optimal NUMA
behavior of SLAB.
Post by Vladimir Davydov
You mentioned kmem users that allocate memory using alloc_pages. There
is an API function for them too, alloc_kmem_pages. Everything behind
the API is hidden and may be done in such a way to achieve optimal
performance.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Loading...