Discussion:
[patch 0/7] improve memcg oom killer robustness v2
Johannes Weiner
2013-08-03 16:59:53 UTC
Permalink
Changes in version 2:
o use user_mode() instead of open coding it on s390 (Heiko Carstens)
o clean up memcg OOM enable/disable toggling (Michal Hocko & KOSAKI
Motohiro)
o add a separate patch to rework and document OOM locking
o fix a problem with lost wakeups when sleeping on the OOM lock
o fix OOM unlocking & wakeups with userspace OOM handling

The memcg code can trap tasks in the context of the failing allocation
until an OOM situation is resolved. They can hold all kinds of locks
(fs, mm) at this point, which makes it prone to deadlocking.

This series converts memcg OOM handling into a two step process that
is started in the charge context, but any waiting is done after the
fault stack is fully unwound.

Patches 1-4 prepare architecture handlers to support the new memcg
requirements, but in doing so they also remove old cruft and unify
out-of-memory behavior across architectures.

Patch 5 disables the memcg OOM handling for syscalls, readahead,
kernel faults, because they can gracefully unwind the stack with
-ENOMEM. OOM handling is restricted to user triggered faults that
have no other option.

Patch 6 reworks memcg's hierarchical OOM locking to make it a little
more obvious wth is going on in there: reduce locked regions, rename
locking functions, reorder and document.

Patch 7 implements the two-part OOM handling such that tasks are never
trapped with the full charge stack in an OOM situation.

arch/alpha/mm/fault.c | 7 +-
arch/arc/mm/fault.c | 11 +--
arch/arm/mm/fault.c | 23 +++--
arch/arm64/mm/fault.c | 23 +++--
arch/avr32/mm/fault.c | 4 +-
arch/cris/mm/fault.c | 6 +-
arch/frv/mm/fault.c | 10 +-
arch/hexagon/mm/vm_fault.c | 6 +-
arch/ia64/mm/fault.c | 6 +-
arch/m32r/mm/fault.c | 10 +-
arch/m68k/mm/fault.c | 2 +
arch/metag/mm/fault.c | 6 +-
arch/microblaze/mm/fault.c | 7 +-
arch/mips/mm/fault.c | 8 +-
arch/mn10300/mm/fault.c | 2 +
arch/openrisc/mm/fault.c | 1 +
arch/parisc/mm/fault.c | 7 +-
arch/powerpc/mm/fault.c | 7 +-
arch/s390/mm/fault.c | 2 +
arch/score/mm/fault.c | 13 ++-
arch/sh/mm/fault.c | 9 +-
arch/sparc/mm/fault_32.c | 12 ++-
arch/sparc/mm/fault_64.c | 8 +-
arch/tile/mm/fault.c | 13 +--
arch/um/kernel/trap.c | 22 +++--
arch/unicore32/mm/fault.c | 22 +++--
arch/x86/mm/fault.c | 43 ++++-----
arch/xtensa/mm/fault.c | 2 +
include/linux/memcontrol.h | 65 +++++++++++++
include/linux/mm.h | 1 +
include/linux/sched.h | 7 ++
mm/filemap.c | 11 ++-
mm/memcontrol.c | 229 +++++++++++++++++++++++++++++----------------
mm/memory.c | 43 +++++++--
mm/oom_kill.c | 7 +-
35 files changed, 444 insertions(+), 211 deletions(-)
Johannes Weiner
2013-08-03 16:59:55 UTC
Permalink
Kernel faults are expected to handle OOM conditions gracefully (gup,
uaccess etc.), so they should never invoke the OOM killer. Reserve
this for faults triggered in user context when it is the only option.

Most architectures already do this, fix up the remaining few.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Michal Hocko <***@suse.cz>
Acked-by: KOSAKI Motohiro <***@jp.fujitsu.com>
---
arch/arm/mm/fault.c | 14 +++++++-------
arch/arm64/mm/fault.c | 14 +++++++-------
arch/avr32/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 ++
arch/um/kernel/trap.c | 2 ++
arch/unicore32/mm/fault.c | 14 +++++++-------
6 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c97f794..217bcbf 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -349,6 +349,13 @@ retry:
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
return 0;

+ /*
+ * If we are in kernel mode at this point, we
+ * have no context to handle this fault with.
+ */
+ if (!user_mode(regs))
+ goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
@@ -359,13 +366,6 @@ retry:
return 0;
}

- /*
- * If we are in kernel mode at this point, we
- * have no context to handle this fault with.
- */
- if (!user_mode(regs))
- goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0ecac89..dab1cfd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -294,6 +294,13 @@ retry:
VM_FAULT_BADACCESS))))
return 0;

+ /*
+ * If we are in kernel mode at this point, we have no context to
+ * handle this fault with.
+ */
+ if (!user_mode(regs))
+ goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
@@ -304,13 +311,6 @@ retry:
return 0;
}

- /*
- * If we are in kernel mode at this point, we have no context to
- * handle this fault with.
- */
- if (!user_mode(regs))
- goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to successfully fix up
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index b2f2d2d..2ca27b0 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -228,9 +228,9 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- pagefault_out_of_memory();
if (!user_mode(regs))
goto no_context;
+ pagefault_out_of_memory();
return;

do_sigbus:
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 85df1cd..94d3a31 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -241,6 +241,8 @@ out_of_memory:
* (which will retry the fault, or kill us if we got oom-killed).
*/
up_read(&mm->mmap_sem);
+ if (!user_mode(regs))
+ goto no_context;
pagefault_out_of_memory();
return;

diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 089f398..b2f5adf 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -124,6 +124,8 @@ out_of_memory:
* (which will retry the fault, or kill us if we got oom-killed).
*/
up_read(&mm->mmap_sem);
+ if (!is_user)
+ goto out_nosemaphore;
pagefault_out_of_memory();
return 0;
}
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index f9b5c10..8ed3c45 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -278,6 +278,13 @@ retry:
(VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
return 0;

+ /*
+ * If we are in kernel mode at this point, we
+ * have no context to handle this fault with.
+ */
+ if (!user_mode(regs))
+ goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
@@ -288,13 +295,6 @@ retry:
return 0;
}

- /*
- * If we are in kernel mode at this point, we
- * have no context to handle this fault with.
- */
- if (!user_mode(regs))
- goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-03 16:59:54 UTC
Permalink
Back before smart OOM killing, when faulting tasks where killed
directly on allocation failures, the arch-specific fault handlers
needed special protection for the init process.

Now that all fault handlers call into the generic OOM killer (609838c
"mm: invoke oom-killer from remaining unconverted page fault
handlers"), which already provides init protection, the arch-specific
leftovers can be removed.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Michal Hocko <***@suse.cz>
Acked-by: KOSAKI Motohiro <***@jp.fujitsu.com>
---
arch/arc/mm/fault.c | 5 -----
arch/score/mm/fault.c | 6 ------
arch/tile/mm/fault.c | 6 ------
3 files changed, 17 deletions(-)

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 0fd1f0d..6b0bb41 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -122,7 +122,6 @@ good_area:
goto bad_area;
}

-survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -201,10 +200,6 @@ no_context:
die("Oops", regs, address);

out_of_memory:
- if (is_global_init(tsk)) {
- yield();
- goto survive;
- }
up_read(&mm->mmap_sem);

if (user_mode(regs)) {
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 6b18fb0..4b71a62 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -100,7 +100,6 @@ good_area:
goto bad_area;
}

-survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -167,11 +166,6 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
if (!user_mode(regs))
goto no_context;
pagefault_out_of_memory();
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index f7f99f9..ac553ee 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -430,7 +430,6 @@ good_area:
goto bad_area;
}

- survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -568,11 +567,6 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
if (is_kernel_mode)
goto no_context;
pagefault_out_of_memory();
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Vineet Gupta
2013-08-06 06:34:48 UTC
Permalink
Hi Johannes,

Thk for the cleanup.

On 08/03/2013 10:29 PM, Johannes Weiner wrote:
> Back before smart OOM killing, when faulting tasks where killed
> directly on allocation failures, the arch-specific fault handlers
> needed special protection for the init process.
>
> Now that all fault handlers call into the generic OOM killer (609838c
> "mm: invoke oom-killer from remaining unconverted page fault
> handlers"), which already provides init protection, the arch-specific
> leftovers can be removed.
>
> Signed-off-by: Johannes Weiner <***@cmpxchg.org>
> Reviewed-by: Michal Hocko <***@suse.cz>
> Acked-by: KOSAKI Motohiro <***@jp.fujitsu.com>
> ---
> arch/arc/mm/fault.c | 5 -----
> arch/score/mm/fault.c | 6 ------
> arch/tile/mm/fault.c | 6 ------
> 3 files changed, 17 deletions(-)

Acked-by: Vineet Gupta <***@synopsys.com> [arch/arc bits]

-Vineet

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-03 16:59:56 UTC
Permalink
Unlike global OOM handling, memory cgroup code will invoke the OOM
killer in any OOM situation because it has no way of telling faults
occuring in kernel context - which could be handled more gracefully -
from user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Michal Hocko <***@suse.cz>
---
arch/alpha/mm/fault.c | 7 ++++---
arch/arc/mm/fault.c | 6 ++++--
arch/arm/mm/fault.c | 9 ++++++---
arch/arm64/mm/fault.c | 9 ++++++---
arch/avr32/mm/fault.c | 2 ++
arch/cris/mm/fault.c | 6 ++++--
arch/frv/mm/fault.c | 10 ++++++----
arch/hexagon/mm/vm_fault.c | 6 ++++--
arch/ia64/mm/fault.c | 6 ++++--
arch/m32r/mm/fault.c | 10 ++++++----
arch/m68k/mm/fault.c | 2 ++
arch/metag/mm/fault.c | 6 ++++--
arch/microblaze/mm/fault.c | 7 +++++--
arch/mips/mm/fault.c | 6 ++++--
arch/mn10300/mm/fault.c | 2 ++
arch/openrisc/mm/fault.c | 1 +
arch/parisc/mm/fault.c | 7 +++++--
arch/powerpc/mm/fault.c | 7 ++++---
arch/s390/mm/fault.c | 2 ++
arch/score/mm/fault.c | 7 ++++++-
arch/sh/mm/fault.c | 9 ++++++---
arch/sparc/mm/fault_32.c | 12 +++++++++---
arch/sparc/mm/fault_64.c | 8 +++++---
arch/tile/mm/fault.c | 7 +++++--
arch/um/kernel/trap.c | 20 ++++++++++++--------
arch/unicore32/mm/fault.c | 8 ++++++--
arch/x86/mm/fault.c | 8 +++++---
arch/xtensa/mm/fault.c | 2 ++
include/linux/mm.h | 1 +
29 files changed, 132 insertions(+), 61 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 0c4132d..98838a0 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,8 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
const struct exception_table_entry *fixup;
int fault, si_code = SEGV_MAPERR;
siginfo_t info;
- unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (cause > 0 ? FAULT_FLAG_WRITE : 0));
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
(or is suppressed by the PALcode). Support that for older CPUs
@@ -115,7 +114,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
if (address >= TASK_SIZE)
goto vmalloc_fault;
#endif
-
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -142,6 +142,7 @@ retry:
} else {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
}

/* If for any reason at all we couldn't handle the fault,
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6b0bb41..d63f3de 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -60,8 +60,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address)
siginfo_t info;
int fault, ret;
int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/*
* We fault-in kernel-space virtual memory on-demand. The
@@ -89,6 +88,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address)
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -117,6 +118,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 217bcbf..eb8830a 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -261,9 +261,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
- int write = fsr & FSR_WRITE;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

if (notify_page_fault(regs, fsr))
return 0;
@@ -282,6 +280,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (fsr & FSR_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index dab1cfd..12205b4 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -208,9 +208,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
- bool write = (esr & ESR_WRITE) && !(esr & ESR_CM);
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -226,6 +224,11 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if ((esr & ESR_WRITE) && !(esr & ESR_CM))
+ flags |= FAULT_FLAG_WRITE;
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index 2ca27b0..0eca933 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -86,6 +86,8 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)

local_irq_enable();

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 73312ab..1790f22 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -58,8 +58,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
struct vm_area_struct * vma;
siginfo_t info;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- ((writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

D(printk(KERN_DEBUG
"Page fault for %lX on %X at %lX, prot %d write %d\n",
@@ -117,6 +116,8 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -155,6 +156,7 @@ retry:
} else if (writeaccess == 1) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 331c1e2..9a66372 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -34,11 +34,11 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long _pme, lrai, lrad, fixup;
+ unsigned long flags = 0;
siginfo_t info;
pgd_t *pge;
pud_t *pue;
pte_t *pte;
- int write;
int fault;

#if 0
@@ -81,6 +81,9 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(__frame))
+ flags |= FAULT_FLAG_USER;
+
down_read(&mm->mmap_sem);

vma = find_vma(mm, ear0);
@@ -129,7 +132,6 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
*/
good_area:
info.si_code = SEGV_ACCERR;
- write = 0;
switch (esr0 & ESR0_ATXC) {
default:
/* handle write to write protected page */
@@ -140,7 +142,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
#endif
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
- write = 1;
+ flags |= FAULT_FLAG_WRITE;
break;

/* handle read from protected page */
@@ -162,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 1bd276d..8704c93 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -53,8 +53,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
int si_code = SEGV_MAPERR;
int fault;
const struct exception_table_entry *fixup;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (cause > 0 ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/*
* If we're in an interrupt or have no user context,
@@ -65,6 +64,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)

local_irq_enable();

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -96,6 +97,7 @@ good_area:
case FLT_STORE:
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
break;
}

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 6cf0341..7225dad 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -90,8 +90,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
| (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT));

- flags |= ((mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
-
/* mmap_sem is performance critical.... */
prefetchw(&mm->mmap_sem);

@@ -119,6 +117,10 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
if (notify_page_fault(regs, TRAP_BRKPT))
return;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (mask & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 3cdfa9c..e9c6a80 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
struct mm_struct *mm;
struct vm_area_struct * vma;
unsigned long page, addr;
- int write;
+ unsigned long flags = 0;
int fault;
siginfo_t info;

@@ -117,6 +117,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
if (in_atomic() || !mm)
goto bad_area_nosemaphore;

+ if (error_code & ACE_USERMODE)
+ flags |= FAULT_FLAG_USER;
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -166,14 +169,13 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
*/
good_area:
info.si_code = SEGV_ACCERR;
- write = 0;
switch (error_code & (ACE_WRITE|ACE_PROTECTION)) {
default: /* 3: write, present */
/* fall through */
case ACE_WRITE: /* write, not present */
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
- write++;
+ flags |= FAULT_FLAG_WRITE;
break;
case ACE_PROTECTION: /* read, present */
case 0: /* read, not present */
@@ -194,7 +196,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index a563727..eb1d61f 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -88,6 +88,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index 8fddf46..332680e 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -53,8 +53,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
struct vm_area_struct *vma, *prev_vma;
siginfo_t info;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write_access ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;

@@ -109,6 +108,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

@@ -121,6 +122,7 @@ good_area:
if (write_access) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 731f739..fa4cf52 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -92,8 +92,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
int code = SEGV_MAPERR;
int is_write = error_code & ESR_S;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (is_write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

regs->ear = address;
regs->esr = error_code;
@@ -121,6 +120,9 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
die("Weird page fault", regs, SIGSEGV);
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -199,6 +201,7 @@ good_area:
if (unlikely(is_write)) {
if (unlikely(!(vma->vm_flags & VM_WRITE)))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
/* protection fault */
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 94d3a31..becc42b 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -42,8 +42,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
const int field = sizeof(unsigned long) * 2;
siginfo_t info;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

#if 0
printk("Cpu%d[%s:%d:%0*lx:%ld:%0*lx]\n", raw_smp_processor_id(),
@@ -93,6 +92,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
if (in_atomic() || !mm)
goto bad_area_nosemaphore;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -114,6 +115,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (cpu_has_rixi) {
if (address == regs->cp0_epc && !(vma->vm_flags & VM_EXEC)) {
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 8a2e6de..3516cbd 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -171,6 +171,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
if (in_atomic() || !mm)
goto no_context;

+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 4a41f84..0703acf 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -86,6 +86,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
if (user_mode(regs)) {
/* Exception was in userspace: reenable interrupts */
local_irq_enable();
+ flags |= FAULT_FLAG_USER;
} else {
/* If exception was in a syscall, then IRQ's may have
* been enabled or disabled. If they were enabled,
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index f247a34..d10d27a 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -180,6 +180,10 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (acc_type & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
retry:
down_read(&mm->mmap_sem);
vma = find_vma_prev(mm, address, &prev_vma);
@@ -203,8 +207,7 @@ good_area:
* fault.
*/

- fault = handle_mm_fault(mm, vma, address,
- flags | ((acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0));
+ fault = handle_mm_fault(mm, vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8726779..d9196c9 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -223,9 +223,6 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
is_write = error_code & ESR_DST;
#endif /* CONFIG_4xx || CONFIG_BOOKE */

- if (is_write)
- flags |= FAULT_FLAG_WRITE;
-
#ifdef CONFIG_PPC_ICSWX
/*
* we need to do this early because this "data storage
@@ -280,6 +277,9 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -408,6 +408,7 @@ good_area:
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
/* protection fault */
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index f00aefb..35b81d6 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -302,6 +302,8 @@ static inline int do_exception(struct pt_regs *regs, int access)
address = trans_exc_code & __FAIL_ADDR_MASK;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
flags |= FAULT_FLAG_WRITE;
down_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 4b71a62..52238983 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -75,6 +76,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
if (in_atomic() || !mm)
goto bad_area_nosemaphore;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
if (!vma)
@@ -95,6 +99,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
goto bad_area;
@@ -105,7 +110,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 1f49c28..541dc61 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -400,9 +400,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
struct mm_struct *mm;
struct vm_area_struct * vma;
int fault;
- int write = error_code & FAULT_CODE_WRITE;
- unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0));
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -476,6 +474,11 @@ good_area:

set_thread_fault_code(error_code);

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (error_code & FAULT_CODE_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index e98bfda..59dbd46 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -177,8 +177,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
unsigned long g2;
int from_user = !(regs->psr & PSR_PS);
int fault, code;
- unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0));
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

if (text_fault)
address = regs->pc;
@@ -235,6 +234,11 @@ good_area:
goto bad_area;
}

+ if (from_user)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -383,6 +387,7 @@ static void force_user_fault(unsigned long address, int write)
struct vm_area_struct *vma;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ unsigned int flags = FAULT_FLAG_USER;
int code;

code = SEGV_MAPERR;
@@ -402,11 +407,12 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}
- switch (handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0)) {
+ switch (handle_mm_fault(mm, vma, address, flags)) {
case VM_FAULT_SIGBUS:
case VM_FAULT_OOM:
goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 5062ff3..c08b9bb 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -314,8 +314,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
} else {
bad_kernel_pc(regs, address);
return;
- }
- }
+ }
+ } else
+ flags |= FAULT_FLAG_USER;

/*
* If we're in an interrupt or have no user
@@ -418,13 +419,14 @@ good_area:
vma->vm_file != NULL)
set_thread_fault_code(fault_code |
FAULT_CODE_BLKCOMMIT);
+
+ flags |= FAULT_FLAG_WRITE;
} else {
/* Allow reads even for write-only mappings */
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}

- flags |= ((fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
fault = handle_mm_fault(mm, vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index ac553ee..3ff289f 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -280,8 +280,7 @@ static int handle_page_fault(struct pt_regs *regs,
if (!is_page_fault)
write = 1;

- flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0));
+ flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

is_kernel_mode = (EX1_PL(regs->ex1) != USER_PL);

@@ -365,6 +364,9 @@ static int handle_page_fault(struct pt_regs *regs,
goto bad_area_nosemaphore;
}

+ if (!is_kernel_mode)
+ flags |= FAULT_FLAG_USER;
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
@@ -425,6 +427,7 @@ good_area:
#endif
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!is_page_fault || !(vma->vm_flags & VM_READ))
goto bad_area;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index b2f5adf..5c3aef7 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -30,8 +30,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
pmd_t *pmd;
pte_t *pte;
int err = -EFAULT;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (is_write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

*code_out = SEGV_MAPERR;

@@ -42,6 +41,8 @@ int handle_page_fault(unsigned long address, unsigned long ip,
if (in_atomic())
goto out_nosemaphore;

+ if (is_user)
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -58,12 +59,15 @@ retry:

good_area:
*code_out = SEGV_ACCERR;
- if (is_write && !(vma->vm_flags & VM_WRITE))
- goto out;
-
- /* Don't require VM_READ|VM_EXEC for write faults! */
- if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
- goto out;
+ if (is_write) {
+ if (!(vma->vm_flags & VM_WRITE))
+ goto out;
+ flags |= FAULT_FLAG_WRITE;
+ } else {
+ /* Don't require VM_READ|VM_EXEC for write faults! */
+ if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
+ goto out;
+ }

do {
int fault;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 8ed3c45..0dc922d 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -209,8 +209,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- ((!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -222,6 +221,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (!(fsr ^ 0x12))
+ flags |= FAULT_FLAG_WRITE;
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 654be4a..6d77c38 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1011,9 +1011,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
unsigned long address;
struct mm_struct *mm;
int fault;
- int write = error_code & PF_WRITE;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -1083,6 +1081,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1109,6 +1108,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
return;
}

+ if (error_code & PF_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 4b7bc8d..70fa7bc 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -72,6 +72,8 @@ void do_page_fault(struct pt_regs *regs)
address, exccause, regs->pc, is_write? "w":"", is_exec? "x":"");
#endif

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5c82dc..c51fc32 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -170,6 +170,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */

/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2013-08-05 22:06:18 UTC
Permalink
On Sat, 3 Aug 2013 12:59:56 -0400 Johannes Weiner <***@cmpxchg.org> wrote:

> Unlike global OOM handling, memory cgroup code will invoke the OOM
> killer in any OOM situation because it has no way of telling faults
> occuring in kernel context - which could be handled more gracefully -
> from user-triggered faults.
>
> Pass a flag that identifies faults originating in user space from the
> architecture-specific fault handlers to generic code so that memcg OOM
> handling can be improved.

arch/arm64/mm/fault.c has changed. Here's what I came up with:

--- a/arch/arm64/mm/fault.c~arch-mm-pass-userspace-fault-flag-to-generic-fault-handler
+++ a/arch/arm64/mm/fault.c
@@ -199,13 +199,6 @@ static int __kprobes do_page_fault(unsig
unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

- if (esr & ESR_LNX_EXEC) {
- vm_flags = VM_EXEC;
- } else if ((esr & ESR_WRITE) && !(esr & ESR_CM)) {
- vm_flags = VM_WRITE;
- mm_flags |= FAULT_FLAG_WRITE;
- }
-
tsk = current;
mm = tsk->mm;

@@ -220,6 +213,16 @@ static int __kprobes do_page_fault(unsig
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ mm_flags |= FAULT_FLAG_USER;
+
+ if (esr & ESR_LNX_EXEC) {
+ vm_flags = VM_EXEC;
+ } else if ((esr & ESR_WRITE) && !(esr & ESR_CM)) {
+ vm_flags = VM_WRITE;
+ mm_flags |= FAULT_FLAG_WRITE;
+ }
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,

But I'm not terribly confident in it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-05 22:25:39 UTC
Permalink
On Mon, Aug 05, 2013 at 03:06:18PM -0700, Andrew Morton wrote:
> On Sat, 3 Aug 2013 12:59:56 -0400 Johannes Weiner <***@cmpxchg.org> wrote:
>
> > Unlike global OOM handling, memory cgroup code will invoke the OOM
> > killer in any OOM situation because it has no way of telling faults
> > occuring in kernel context - which could be handled more gracefully -
> > from user-triggered faults.
> >
> > Pass a flag that identifies faults originating in user space from the
> > architecture-specific fault handlers to generic code so that memcg OOM
> > handling can be improved.
>
> arch/arm64/mm/fault.c has changed. Here's what I came up with:
>
> --- a/arch/arm64/mm/fault.c~arch-mm-pass-userspace-fault-flag-to-generic-fault-handler
> +++ a/arch/arm64/mm/fault.c
> @@ -199,13 +199,6 @@ static int __kprobes do_page_fault(unsig
> unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
> unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> - if (esr & ESR_LNX_EXEC) {
> - vm_flags = VM_EXEC;
> - } else if ((esr & ESR_WRITE) && !(esr & ESR_CM)) {
> - vm_flags = VM_WRITE;
> - mm_flags |= FAULT_FLAG_WRITE;
> - }
> -
> tsk = current;
> mm = tsk->mm;
>
> @@ -220,6 +213,16 @@ static int __kprobes do_page_fault(unsig
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + mm_flags |= FAULT_FLAG_USER;
> +
> + if (esr & ESR_LNX_EXEC) {
> + vm_flags = VM_EXEC;
> + } else if ((esr & ESR_WRITE) && !(esr & ESR_CM)) {
> + vm_flags = VM_WRITE;
> + mm_flags |= FAULT_FLAG_WRITE;
> + }
> +
> /*
> * As per x86, we may deadlock here. However, since the kernel only
> * validly references user space from well defined areas of the code,
>
> But I'm not terribly confident in it.

It looks good to me. They added the vm_flags but they are not used
any earlier than the mm_flags (__do_page_fault), which I moved to the
same location as you did in this fixup.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-03 16:59:57 UTC
Permalink
The x86 fault handler bails in the middle of error handling when the
task has a fatal signal pending. For a subsequent patch this is a
problem in OOM situations because it relies on
pagefault_out_of_memory() being called even when the task has been
killed, to perform proper per-task OOM state unwinding.

Shortcutting the fault like this is a rather minor optimization that
saves a few instructions in rare cases. Just remove it for
user-triggered faults.

Use the opportunity to split the fault retry handling from actual
fault errors and add locking documentation that reads suprisingly
similar to ARM's.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Michal Hocko <***@suse.cz>
Acked-by: KOSAKI Motohiro <***@jp.fujitsu.com>
---
arch/x86/mm/fault.c | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6d77c38..3aaeffc 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,23 +842,15 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address, 0, 0);
- return 1;
+ if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+ up_read(&current->mm->mmap_sem);
+ no_context(regs, error_code, address, 0, 0);
+ return;
}
- if (!(fault & VM_FAULT_ERROR))
- return 0;

if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
@@ -866,7 +858,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address,
SIGSEGV, SEGV_MAPERR);
- return 1;
+ return;
}

up_read(&current->mm->mmap_sem);
@@ -884,7 +876,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1189,9 +1180,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-03 16:59:58 UTC
Permalink
System calls and kernel faults (uaccess, gup) can handle an out of
memory situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really
the only option available.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
include/linux/memcontrol.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 3 +++
mm/filemap.c | 11 ++++++++++-
mm/memcontrol.c | 2 +-
mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
5 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..9c449c1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -125,6 +125,37 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);

+/**
+ * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
+ * @new: true to enable, false to disable
+ *
+ * Toggle whether a failed memcg charge should invoke the OOM killer
+ * or just return -ENOMEM. Returns the previous toggle state.
+ */
+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+ bool old;
+
+ old = current->memcg_oom.may_oom;
+ current->memcg_oom.may_oom = new;
+
+ return old;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+ bool old = mem_cgroup_toggle_oom(true);
+
+ WARN_ON(old == true);
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+ bool old = mem_cgroup_toggle_oom(false);
+
+ WARN_ON(old == false);
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -348,6 +379,19 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
{
}

+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+ return false;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc09d21..4b3effc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1398,6 +1398,9 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ unsigned int may_oom:1;
+ } memcg_oom;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/filemap.c b/mm/filemap.c
index a6981fe..4a73e1a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1618,6 +1618,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
struct page *page;
+ bool memcg_oom;
pgoff_t size;
int ret = 0;

@@ -1626,7 +1627,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already?
+ * Do we have something in the page cache already? Either
+ * way, try readahead, but disable the memcg OOM killer for it
+ * as readahead is optional and no errors are propagated up
+ * the fault stack. The OOM killer is enabled while trying to
+ * instantiate the faulting page individually below.
*/
page = find_get_page(mapping, offset);
if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
@@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
+ memcg_oom = mem_cgroup_toggle_oom(false);
do_async_mmap_readahead(vma, ra, file, page, offset);
+ mem_cgroup_toggle_oom(memcg_oom);
} else if (!page) {
/* No page in the page cache at all */
+ memcg_oom = mem_cgroup_toggle_oom(false);
do_sync_mmap_readahead(vma, ra, file, offset);
+ mem_cgroup_toggle_oom(memcg_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00a7a66..30ae46a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2614,7 +2614,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
return CHARGE_RETRY;

/* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
+ if (!oom_check || !current->memcg_oom.may_oom)
return CHARGE_NOMEM;
/* check OOM */
if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
diff --git a/mm/memory.c b/mm/memory.c
index f2ab2a8..58ef726 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3752,22 +3752,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3851,6 +3843,34 @@ retry:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ /*
+ * Enable the memcg OOM handling for faults triggered in user
+ * space. Kernel faults are handled more gracefully.
+ */
+ if (flags & FAULT_FLAG_USER)
+ mem_cgroup_enable_oom();
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (flags & FAULT_FLAG_USER)
+ mem_cgroup_disable_oom();
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-08-05 09:18:02 UTC
Permalink
On Sat 03-08-13 12:59:58, Johannes Weiner wrote:
> System calls and kernel faults (uaccess, gup) can handle an out of
> memory situation gracefully and just return -ENOMEM.
>
> Enable the memcg OOM killer only for user faults, where it's really
> the only option available.
>
> Signed-off-by: Johannes Weiner <***@cmpxchg.org>

Looks better
Acked-by: Michal Hocko <***@suse.cz>

Thanks
> ---
> include/linux/memcontrol.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/sched.h | 3 +++
> mm/filemap.c | 11 ++++++++++-
> mm/memcontrol.c | 2 +-
> mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
> 5 files changed, 88 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7b4d9d7..9c449c1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -125,6 +125,37 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> struct page *newpage);
>
> +/**
> + * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
> + * @new: true to enable, false to disable
> + *
> + * Toggle whether a failed memcg charge should invoke the OOM killer
> + * or just return -ENOMEM. Returns the previous toggle state.
> + */
> +static inline bool mem_cgroup_toggle_oom(bool new)
> +{
> + bool old;
> +
> + old = current->memcg_oom.may_oom;
> + current->memcg_oom.may_oom = new;
> +
> + return old;
> +}
> +
> +static inline void mem_cgroup_enable_oom(void)
> +{
> + bool old = mem_cgroup_toggle_oom(true);
> +
> + WARN_ON(old == true);
> +}
> +
> +static inline void mem_cgroup_disable_oom(void)
> +{
> + bool old = mem_cgroup_toggle_oom(false);
> +
> + WARN_ON(old == false);
> +}
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -348,6 +379,19 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> {
> }
>
> +static inline bool mem_cgroup_toggle_oom(bool new)
> +{
> + return false;
> +}
> +
> +static inline void mem_cgroup_enable_oom(void)
> +{
> +}
> +
> +static inline void mem_cgroup_disable_oom(void)
> +{
> +}
> +
> static inline void mem_cgroup_inc_page_stat(struct page *page,
> enum mem_cgroup_page_stat_item idx)
> {
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fc09d21..4b3effc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1398,6 +1398,9 @@ struct task_struct {
> unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> } memcg_batch;
> unsigned int memcg_kmem_skip_account;
> + struct memcg_oom_info {
> + unsigned int may_oom:1;
> + } memcg_oom;
> #endif
> #ifdef CONFIG_UPROBES
> struct uprobe_task *utask;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a6981fe..4a73e1a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1618,6 +1618,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> struct inode *inode = mapping->host;
> pgoff_t offset = vmf->pgoff;
> struct page *page;
> + bool memcg_oom;
> pgoff_t size;
> int ret = 0;
>
> @@ -1626,7 +1627,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
>
> /*
> - * Do we have something in the page cache already?
> + * Do we have something in the page cache already? Either
> + * way, try readahead, but disable the memcg OOM killer for it
> + * as readahead is optional and no errors are propagated up
> + * the fault stack. The OOM killer is enabled while trying to
> + * instantiate the faulting page individually below.
> */
> page = find_get_page(mapping, offset);
> if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
> @@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> * We found the page, so try async readahead before
> * waiting for the lock.
> */
> + memcg_oom = mem_cgroup_toggle_oom(false);
> do_async_mmap_readahead(vma, ra, file, page, offset);
> + mem_cgroup_toggle_oom(memcg_oom);
> } else if (!page) {
> /* No page in the page cache at all */
> + memcg_oom = mem_cgroup_toggle_oom(false);
> do_sync_mmap_readahead(vma, ra, file, offset);
> + mem_cgroup_toggle_oom(memcg_oom);
> count_vm_event(PGMAJFAULT);
> mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> ret = VM_FAULT_MAJOR;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 00a7a66..30ae46a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2614,7 +2614,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> return CHARGE_RETRY;
>
> /* If we don't need to call oom-killer at el, return immediately */
> - if (!oom_check)
> + if (!oom_check || !current->memcg_oom.may_oom)
> return CHARGE_NOMEM;
> /* check OOM */
> if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
> diff --git a/mm/memory.c b/mm/memory.c
> index f2ab2a8..58ef726 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3752,22 +3752,14 @@ unlock:
> /*
> * By the time we get here, we already hold the mm semaphore
> */
> -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> {
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> - __set_current_state(TASK_RUNNING);
> -
> - count_vm_event(PGFAULT);
> - mem_cgroup_count_vm_event(mm, PGFAULT);
> -
> - /* do counter updates before entering really critical section. */
> - check_sync_rss_stat(current);
> -
> if (unlikely(is_vm_hugetlb_page(vma)))
> return hugetlb_fault(mm, vma, address, flags);
>
> @@ -3851,6 +3843,34 @@ retry:
> return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> }
>
> +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + int ret;
> +
> + __set_current_state(TASK_RUNNING);
> +
> + count_vm_event(PGFAULT);
> + mem_cgroup_count_vm_event(mm, PGFAULT);
> +
> + /* do counter updates before entering really critical section. */
> + check_sync_rss_stat(current);
> +
> + /*
> + * Enable the memcg OOM handling for faults triggered in user
> + * space. Kernel faults are handled more gracefully.
> + */
> + if (flags & FAULT_FLAG_USER)
> + mem_cgroup_enable_oom();
> +
> + ret = __handle_mm_fault(mm, vma, address, flags);
> +
> + if (flags & FAULT_FLAG_USER)
> + mem_cgroup_disable_oom();
> +
> + return ret;
> +}
> +
> #ifndef __PAGETABLE_PUD_FOLDED
> /*
> * Allocate page upper directory.
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-03 16:59:59 UTC
Permalink
The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies. However, it would be nice if this construct would
be clearly recognizable and not be as obfuscated as it is right now.
Clean up as follows:

1. Remove the return value of mem_cgroup_oom_unlock()

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
makes it more obvious that the task has to be on the waitqueue
before attempting to OOM-trylock the hierarchy, to not miss any
wakeups before going to sleep. It just didn't matter until now
because it was all lumped together into the global memcg_oom_lock
spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
lock in any given hierarchy atomically. Restrict its scope to
mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
function. Only the lockholder has to wake up the next in line
after releasing the lock.

Note that the lockholder kicks off the OOM-killer, which in turn
leads to wakeups from the uncharges of the exiting task. But a
contender is not guaranteed to see them if it enters the OOM path
after the OOM kills but before the lockholder releases the lock.
Thus there has to be an explicit wakeup after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
under OOM as that is the point where we start to receive wakeups.
No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
symmetry.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Acked-by: Michal Hocko <***@suse.cz>
---
mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++--------------------------
1 file changed, 47 insertions(+), 38 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30ae46a..3d0c1d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2076,15 +2076,18 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
return total;
}

+static DEFINE_SPINLOCK(memcg_oom_lock);
+
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
- * Has to be called with memcg_oom_lock
*/
-static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter, *failed = NULL;

+ spin_lock(&memcg_oom_lock);
+
for_each_mem_cgroup_tree(iter, memcg) {
if (iter->oom_lock) {
/*
@@ -2098,33 +2101,33 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
iter->oom_lock = true;
}

- if (!failed)
- return true;
-
- /*
- * OK, we failed to lock the whole subtree so we have to clean up
- * what we set up to the failing subtree
- */
- for_each_mem_cgroup_tree(iter, memcg) {
- if (iter == failed) {
- mem_cgroup_iter_break(memcg, iter);
- break;
+ if (failed) {
+ /*
+ * OK, we failed to lock the whole subtree so we have
+ * to clean up what we set up to the failing subtree
+ */
+ for_each_mem_cgroup_tree(iter, memcg) {
+ if (iter == failed) {
+ mem_cgroup_iter_break(memcg, iter);
+ break;
+ }
+ iter->oom_lock = false;
}
- iter->oom_lock = false;
- }
- return false;
+ }
+
+ spin_unlock(&memcg_oom_lock);
+
+ return !failed;
}

-/*
- * Has to be called with memcg_oom_lock
- */
-static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter;

+ spin_lock(&memcg_oom_lock);
for_each_mem_cgroup_tree(iter, memcg)
iter->oom_lock = false;
- return 0;
+ spin_unlock(&memcg_oom_lock);
}

static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
@@ -2148,7 +2151,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
atomic_add_unless(&iter->under_oom, -1, 0);
}

-static DEFINE_SPINLOCK(memcg_oom_lock);
static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);

struct oom_wait_info {
@@ -2195,45 +2197,52 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
int order)
{
struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked;

owait.memcg = memcg;
owait.wait.flags = 0;
owait.wait.func = memcg_oom_wake_function;
owait.wait.private = current;
INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);

- /* At first, try to OOM lock hierarchy under memcg.*/
- spin_lock(&memcg_oom_lock);
- locked = mem_cgroup_oom_lock(memcg);
/*
+ * As with any blocking lock, a contender needs to start
+ * listening for wakeups before attempting the trylock,
+ * otherwise it can miss the wakeup from the unlock and sleep
+ * indefinitely. This is just open-coded because our locking
+ * is so particular to memcg hierarchies.
+ *
* Even if signal_pending(), we can't quit charge() loop without
* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
* under OOM is always welcomed, use TASK_KILLABLE here.
*/
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
- need_to_kill = false;
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
if (locked)
mem_cgroup_oom_notify(memcg);
- spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
+ if (locked && !memcg->oom_kill_disable) {
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
} else {
schedule();
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
}
- spin_lock(&memcg_oom_lock);
- if (locked)
- mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);

- mem_cgroup_unmark_under_oom(memcg);
+ if (locked) {
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that an OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
+ }

if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
return false;
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-08-03 17:00:00 UTC
Permalink
The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex. The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations. In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit. But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM
and makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.

Reported-by: Reported-by: azurIt <***@pobox.sk>
Debugged-by: Michal Hocko <***@suse.cz>
Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
include/linux/memcontrol.h | 21 +++++++
include/linux/sched.h | 4 ++
mm/memcontrol.c | 154 +++++++++++++++++++++++++++++++--------------
mm/memory.c | 3 +
mm/oom_kill.c | 7 ++-
5 files changed, 140 insertions(+), 49 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9c449c1..cb84058 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -131,6 +131,10 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
*
* Toggle whether a failed memcg charge should invoke the OOM killer
* or just return -ENOMEM. Returns the previous toggle state.
+ *
+ * NOTE: Any path that enables the OOM killer before charging must
+ * call mem_cgroup_oom_synchronize() afterward to finalize the
+ * OOM handling and clean up.
*/
static inline bool mem_cgroup_toggle_oom(bool new)
{
@@ -156,6 +160,13 @@ static inline void mem_cgroup_disable_oom(void)
WARN_ON(old == false);
}

+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return p->memcg_oom.in_memcg_oom;
+}
+
+bool mem_cgroup_oom_synchronize(void);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -392,6 +403,16 @@ static inline void mem_cgroup_disable_oom(void)
{
}

+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4b3effc..4593e27 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,10 @@ struct task_struct {
unsigned int memcg_kmem_skip_account;
struct memcg_oom_info {
unsigned int may_oom:1;
+ unsigned int in_memcg_oom:1;
+ unsigned int oom_locked:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
} memcg_oom;
#endif
#ifdef CONFIG_UPROBES
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3d0c1d3..b30c67a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -280,6 +280,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

int swappiness;
/* OOM-Killer disable */
@@ -2180,6 +2181,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -2191,19 +2193,17 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- struct oom_wait_info owait;
bool locked;
+ int wakeups;

- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
+ if (!current->memcg_oom.may_oom)
+ return;
+
+ current->memcg_oom.in_memcg_oom = 1;

/*
* As with any blocking lock, a contender needs to start
@@ -2211,12 +2211,8 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
* otherwise it can miss the wakeup from the unlock and sleep
* indefinitely. This is just open-coded because our locking
* is so particular to memcg hierarchies.
- *
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
*/
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ wakeups = atomic_read(&memcg->oom_wakeups);
mem_cgroup_mark_under_oom(memcg);

locked = mem_cgroup_oom_trylock(memcg);
@@ -2226,15 +2222,95 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,

if (locked && !memcg->oom_kill_disable) {
mem_cgroup_unmark_under_oom(memcg);
- finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that an OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
} else {
- schedule();
- mem_cgroup_unmark_under_oom(memcg);
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this
+ * is a page fault and somebody else is handling the
+ * OOM already, we need to sleep on the OOM waitqueue
+ * for this memcg until the situation is resolved.
+ * Which can take some time because it might be
+ * handled by a userspace task.
+ *
+ * However, this is the charge context, which means
+ * that we may sit on a large call stack and hold
+ * various filesystem locks, the mmap_sem etc. and we
+ * don't want the OOM handler to deadlock on them
+ * while we sit here and wait. Store the current OOM
+ * context in the task_struct, then return -ENOMEM.
+ * At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check
+ * back with us by calling
+ * mem_cgroup_oom_synchronize(), possibly putting the
+ * task to sleep.
+ */
+ current->memcg_oom.oom_locked = locked;
+ current->memcg_oom.wakeups = wakeups;
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
}
+}
+
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ *
+ * This has to be called at the end of a page fault if the the memcg
+ * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
+ *
+ * Memcg supports userspace OOM handling, so failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to put the task to sleep and clean up the
+ * OOM state.
+ *
+ * Returns %true if an ongoing memcg OOM situation was detected and
+ * finalized, %false otherwise.
+ */
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;
+
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
+ return false;
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_memcg;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);

- if (locked) {
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_memcg:
+ mem_cgroup_unmark_under_oom(memcg);
+ if (current->memcg_oom.oom_locked) {
mem_cgroup_oom_unlock(memcg);
/*
* There is no guarantee that an OOM-lock contender
@@ -2243,11 +2319,10 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
*/
memcg_oom_recover(memcg);
}
-
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
- return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2560,12 +2635,11 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
- bool oom_check)
+ bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2622,14 +2696,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check || !current->memcg_oom.may_oom)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2732,7 +2802,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2740,14 +2810,8 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
- oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+ nr_pages, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2760,16 +2824,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 58ef726..91da6fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3868,6 +3868,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_disable_oom();

+ if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)))
+ mem_cgroup_oom_synchronize();
+
return ret;
}

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 98e75f2..314e9d2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;

+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-08-05 09:54:29 UTC
Permalink
On Sat 03-08-13 13:00:00, Johannes Weiner wrote:
> The memcg OOM handling is incredibly fragile and can deadlock. When a
> task fails to charge memory, it invokes the OOM killer and loops right
> there in the charge code until it succeeds. Comparably, any other
> task that enters the charge path at this point will go to a waitqueue
> right then and there and sleep until the OOM situation is resolved.
> The problem is that these tasks may hold filesystem locks and the
> mmap_sem; locks that the selected OOM victim may need to exit.
>
> For example, in one reported case, the task invoking the OOM killer
> was about to charge a page cache page during a write(), which holds
> the i_mutex. The OOM killer selected a task that was just entering
> truncate() and trying to acquire the i_mutex:
>
> OOM invoking task:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> OOM kill victim:
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The OOM handling task will retry the charge indefinitely while the OOM
> killed task is not releasing any resources.
>
> A similar scenario can happen when the kernel OOM killer for a memcg
> is disabled and a userspace task is in charge of resolving OOM
> situations. In this case, ALL tasks that enter the OOM path will be
> made to sleep on the OOM waitqueue and wait for userspace to free
> resources or increase the group's limit. But a userspace OOM handler
> is prone to deadlock itself on the locks held by the waiting tasks.
> For example one of the sleeping tasks may be stuck in a brk() call
> with the mmap_sem held for writing but the userspace handler, in order
> to pick an optimal victim, may need to read files from /proc/<pid>,
> which tries to acquire the same mmap_sem for reading and deadlocks.
>
> This patch changes the way tasks behave after detecting a memcg OOM
> and makes sure nobody loops or sleeps with locks held:
>
> 1. When OOMing in a user fault, invoke the OOM killer and restart the
> fault instead of looping on the charge attempt. This way, the OOM
> victim can not get stuck on locks the looping task may hold.
>
> 2. When OOMing in a user fault but somebody else is handling it
> (either the kernel OOM killer or a userspace handler), don't go to
> sleep in the charge context. Instead, remember the OOMing memcg in
> the task struct and then fully unwind the page fault stack with
> -ENOMEM. pagefault_out_of_memory() will then call back into the
> memcg code to check if the -ENOMEM came from the memcg, and then
> either put the task to sleep on the memcg's OOM waitqueue or just
> restart the fault. The OOM victim can no longer get stuck on any
> lock a sleeping task may hold.
>
> Reported-by: Reported-by: azurIt <azurit-***@public.gmane.org>
> Debugged-by: Michal Hocko <mhocko-***@public.gmane.org>
> Signed-off-by: Johannes Weiner <hannes-***@public.gmane.org>

I was thinking whether we should add task_in_memcg_oom into return to
the userspace path just in case but this should be OK for now and new
users of mem_cgroup_enable_oom will be fought against hard.

Acked-by: Michal Hocko <mhocko-***@public.gmane.org>

Thanks

> ---
> include/linux/memcontrol.h | 21 +++++++
> include/linux/sched.h | 4 ++
> mm/memcontrol.c | 154 +++++++++++++++++++++++++++++++--------------
> mm/memory.c | 3 +
> mm/oom_kill.c | 7 ++-
> 5 files changed, 140 insertions(+), 49 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 9c449c1..cb84058 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -131,6 +131,10 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> *
> * Toggle whether a failed memcg charge should invoke the OOM killer
> * or just return -ENOMEM. Returns the previous toggle state.
> + *
> + * NOTE: Any path that enables the OOM killer before charging must
> + * call mem_cgroup_oom_synchronize() afterward to finalize the
> + * OOM handling and clean up.
> */
> static inline bool mem_cgroup_toggle_oom(bool new)
> {
> @@ -156,6 +160,13 @@ static inline void mem_cgroup_disable_oom(void)
> WARN_ON(old == false);
> }
>
> +static inline bool task_in_memcg_oom(struct task_struct *p)
> +{
> + return p->memcg_oom.in_memcg_oom;
> +}
> +
> +bool mem_cgroup_oom_synchronize(void);
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -392,6 +403,16 @@ static inline void mem_cgroup_disable_oom(void)
> {
> }
>
> +static inline bool task_in_memcg_oom(struct task_struct *p)
> +{
> + return false;
> +}
> +
> +static inline bool mem_cgroup_oom_synchronize(void)
> +{
> + return false;
> +}
> +
> static inline void mem_cgroup_inc_page_stat(struct page *page,
> enum mem_cgroup_page_stat_item idx)
> {
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4b3effc..4593e27 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1400,6 +1400,10 @@ struct task_struct {
> unsigned int memcg_kmem_skip_account;
> struct memcg_oom_info {
> unsigned int may_oom:1;
> + unsigned int in_memcg_oom:1;
> + unsigned int oom_locked:1;
> + int wakeups;
> + struct mem_cgroup *wait_on_memcg;
> } memcg_oom;
> #endif
> #ifdef CONFIG_UPROBES
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3d0c1d3..b30c67a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -280,6 +280,7 @@ struct mem_cgroup {
>
> bool oom_lock;
> atomic_t under_oom;
> + atomic_t oom_wakeups;
>
> int swappiness;
> /* OOM-Killer disable */
> @@ -2180,6 +2181,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
>
> static void memcg_wakeup_oom(struct mem_cgroup *memcg)
> {
> + atomic_inc(&memcg->oom_wakeups);
> /* for filtering, pass "memcg" as argument. */
> __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
> }
> @@ -2191,19 +2193,17 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> }
>
> /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> + * try to call OOM killer
> */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> - int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - struct oom_wait_info owait;
> bool locked;
> + int wakeups;
>
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> - INIT_LIST_HEAD(&owait.wait.task_list);
> + if (!current->memcg_oom.may_oom)
> + return;
> +
> + current->memcg_oom.in_memcg_oom = 1;
>
> /*
> * As with any blocking lock, a contender needs to start
> @@ -2211,12 +2211,8 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> * otherwise it can miss the wakeup from the unlock and sleep
> * indefinitely. This is just open-coded because our locking
> * is so particular to memcg hierarchies.
> - *
> - * Even if signal_pending(), we can't quit charge() loop without
> - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> - * under OOM is always welcomed, use TASK_KILLABLE here.
> */
> - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> + wakeups = atomic_read(&memcg->oom_wakeups);
> mem_cgroup_mark_under_oom(memcg);
>
> locked = mem_cgroup_oom_trylock(memcg);
> @@ -2226,15 +2222,95 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
>
> if (locked && !memcg->oom_kill_disable) {
> mem_cgroup_unmark_under_oom(memcg);
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> mem_cgroup_out_of_memory(memcg, mask, order);
> + mem_cgroup_oom_unlock(memcg);
> + /*
> + * There is no guarantee that an OOM-lock contender
> + * sees the wakeups triggered by the OOM kill
> + * uncharges. Wake any sleepers explicitely.
> + */
> + memcg_oom_recover(memcg);
> } else {
> - schedule();
> - mem_cgroup_unmark_under_oom(memcg);
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> + /*
> + * A system call can just return -ENOMEM, but if this
> + * is a page fault and somebody else is handling the
> + * OOM already, we need to sleep on the OOM waitqueue
> + * for this memcg until the situation is resolved.
> + * Which can take some time because it might be
> + * handled by a userspace task.
> + *
> + * However, this is the charge context, which means
> + * that we may sit on a large call stack and hold
> + * various filesystem locks, the mmap_sem etc. and we
> + * don't want the OOM handler to deadlock on them
> + * while we sit here and wait. Store the current OOM
> + * context in the task_struct, then return -ENOMEM.
> + * At the end of the page fault handler, with the
> + * stack unwound, pagefault_out_of_memory() will check
> + * back with us by calling
> + * mem_cgroup_oom_synchronize(), possibly putting the
> + * task to sleep.
> + */
> + current->memcg_oom.oom_locked = locked;
> + current->memcg_oom.wakeups = wakeups;
> + css_get(&memcg->css);
> + current->memcg_oom.wait_on_memcg = memcg;
> }
> +}
> +
> +/**
> + * mem_cgroup_oom_synchronize - complete memcg OOM handling
> + *
> + * This has to be called at the end of a page fault if the the memcg
> + * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
> + *
> + * Memcg supports userspace OOM handling, so failed allocations must
> + * sleep on a waitqueue until the userspace task resolves the
> + * situation. Sleeping directly in the charge context with all kinds
> + * of locks held is not a good idea, instead we remember an OOM state
> + * in the task and mem_cgroup_oom_synchronize() has to be called at
> + * the end of the page fault to put the task to sleep and clean up the
> + * OOM state.
> + *
> + * Returns %true if an ongoing memcg OOM situation was detected and
> + * finalized, %false otherwise.
> + */
> +bool mem_cgroup_oom_synchronize(void)
> +{
> + struct oom_wait_info owait;
> + struct mem_cgroup *memcg;
> +
> + /* OOM is global, do not handle */
> + if (!current->memcg_oom.in_memcg_oom)
> + return false;
> +
> + /*
> + * We invoked the OOM killer but there is a chance that a kill
> + * did not free up any charges. Everybody else might already
> + * be sleeping, so restart the fault and keep the rampage
> + * going until some charges are released.
> + */
> + memcg = current->memcg_oom.wait_on_memcg;
> + if (!memcg)
> + goto out;
> +
> + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> + goto out_memcg;
> +
> + owait.memcg = memcg;
> + owait.wait.flags = 0;
> + owait.wait.func = memcg_oom_wake_function;
> + owait.wait.private = current;
> + INIT_LIST_HEAD(&owait.wait.task_list);
>
> - if (locked) {
> + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> + /* Only sleep if we didn't miss any wakeups since OOM */
> + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> + schedule();
> + finish_wait(&memcg_oom_waitq, &owait.wait);
> +out_memcg:
> + mem_cgroup_unmark_under_oom(memcg);
> + if (current->memcg_oom.oom_locked) {
> mem_cgroup_oom_unlock(memcg);
> /*
> * There is no guarantee that an OOM-lock contender
> @@ -2243,11 +2319,10 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> */
> memcg_oom_recover(memcg);
> }
> -
> - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> - return false;
> - /* Give chance to dying process */
> - schedule_timeout_uninterruptible(1);
> + css_put(&memcg->css);
> + current->memcg_oom.wait_on_memcg = NULL;
> +out:
> + current->memcg_oom.in_memcg_oom = 0;
> return true;
> }
>
> @@ -2560,12 +2635,11 @@ enum {
> CHARGE_RETRY, /* need to retry but retry is not bad */
> CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
> CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
> - CHARGE_OOM_DIE, /* the current is killed because of OOM */
> };
>
> static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages, unsigned int min_pages,
> - bool oom_check)
> + bool invoke_oom)
> {
> unsigned long csize = nr_pages * PAGE_SIZE;
> struct mem_cgroup *mem_over_limit;
> @@ -2622,14 +2696,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (mem_cgroup_wait_acct_move(mem_over_limit))
> return CHARGE_RETRY;
>
> - /* If we don't need to call oom-killer at el, return immediately */
> - if (!oom_check || !current->memcg_oom.may_oom)
> - return CHARGE_NOMEM;
> - /* check OOM */
> - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
> - return CHARGE_OOM_DIE;
> + if (invoke_oom)
> + mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
>
> - return CHARGE_RETRY;
> + return CHARGE_NOMEM;
> }
>
> /*
> @@ -2732,7 +2802,7 @@ again:
> }
>
> do {
> - bool oom_check;
> + bool invoke_oom = oom && !nr_oom_retries;
>
> /* If killed, bypass charge */
> if (fatal_signal_pending(current)) {
> @@ -2740,14 +2810,8 @@ again:
> goto bypass;
> }
>
> - oom_check = false;
> - if (oom && !nr_oom_retries) {
> - oom_check = true;
> - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> - }
> -
> - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
> - oom_check);
> + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
> + nr_pages, invoke_oom);
> switch (ret) {
> case CHARGE_OK:
> break;
> @@ -2760,16 +2824,12 @@ again:
> css_put(&memcg->css);
> goto nomem;
> case CHARGE_NOMEM: /* OOM routine works */
> - if (!oom) {
> + if (!oom || invoke_oom) {
> css_put(&memcg->css);
> goto nomem;
> }
> - /* If oom, we never return -ENOMEM */
> nr_oom_retries--;
> break;
> - case CHARGE_OOM_DIE: /* Killed by OOM Killer */
> - css_put(&memcg->css);
> - goto bypass;
> }
> } while (ret != CHARGE_OK);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 58ef726..91da6fb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3868,6 +3868,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (flags & FAULT_FLAG_USER)
> mem_cgroup_disable_oom();
>
> + if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)))
> + mem_cgroup_oom_synchronize();
> +
> return ret;
> }
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 98e75f2..314e9d2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -678,9 +678,12 @@ out:
> */
> void pagefault_out_of_memory(void)
> {
> - struct zonelist *zonelist = node_zonelist(first_online_node,
> - GFP_KERNEL);
> + struct zonelist *zonelist;
>
> + if (mem_cgroup_oom_synchronize())
> + return;
> +
> + zonelist = node_zonelist(first_online_node, GFP_KERNEL);
> if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
> out_of_memory(NULL, 0, 0, NULL, false);
> clear_zonelist_oom(zonelist, GFP_KERNEL);
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs
Johannes Weiner
2013-08-05 20:56:04 UTC
Permalink
On Mon, Aug 05, 2013 at 11:54:29AM +0200, Michal Hocko wrote:
> On Sat 03-08-13 13:00:00, Johannes Weiner wrote:
> > The memcg OOM handling is incredibly fragile and can deadlock. When a
> > task fails to charge memory, it invokes the OOM killer and loops right
> > there in the charge code until it succeeds. Comparably, any other
> > task that enters the charge path at this point will go to a waitqueue
> > right then and there and sleep until the OOM situation is resolved.
> > The problem is that these tasks may hold filesystem locks and the
> > mmap_sem; locks that the selected OOM victim may need to exit.
> >
> > For example, in one reported case, the task invoking the OOM killer
> > was about to charge a page cache page during a write(), which holds
> > the i_mutex. The OOM killer selected a task that was just entering
> > truncate() and trying to acquire the i_mutex:
> >
> > OOM invoking task:
> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > [<ffffffff81112381>] sys_write+0x51/0x90
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > OOM kill victim:
> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> > [<ffffffff81121c90>] do_last+0x250/0xa30
> > [<ffffffff81122547>] path_openat+0xd7/0x440
> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > [<ffffffff8110f950>] sys_open+0x20/0x30
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > The OOM handling task will retry the charge indefinitely while the OOM
> > killed task is not releasing any resources.
> >
> > A similar scenario can happen when the kernel OOM killer for a memcg
> > is disabled and a userspace task is in charge of resolving OOM
> > situations. In this case, ALL tasks that enter the OOM path will be
> > made to sleep on the OOM waitqueue and wait for userspace to free
> > resources or increase the group's limit. But a userspace OOM handler
> > is prone to deadlock itself on the locks held by the waiting tasks.
> > For example one of the sleeping tasks may be stuck in a brk() call
> > with the mmap_sem held for writing but the userspace handler, in order
> > to pick an optimal victim, may need to read files from /proc/<pid>,
> > which tries to acquire the same mmap_sem for reading and deadlocks.
> >
> > This patch changes the way tasks behave after detecting a memcg OOM
> > and makes sure nobody loops or sleeps with locks held:
> >
> > 1. When OOMing in a user fault, invoke the OOM killer and restart the
> > fault instead of looping on the charge attempt. This way, the OOM
> > victim can not get stuck on locks the looping task may hold.
> >
> > 2. When OOMing in a user fault but somebody else is handling it
> > (either the kernel OOM killer or a userspace handler), don't go to
> > sleep in the charge context. Instead, remember the OOMing memcg in
> > the task struct and then fully unwind the page fault stack with
> > -ENOMEM. pagefault_out_of_memory() will then call back into the
> > memcg code to check if the -ENOMEM came from the memcg, and then
> > either put the task to sleep on the memcg's OOM waitqueue or just
> > restart the fault. The OOM victim can no longer get stuck on any
> > lock a sleeping task may hold.
> >
> > Reported-by: Reported-by: azurIt <***@pobox.sk>
> > Debugged-by: Michal Hocko <***@suse.cz>
> > Signed-off-by: Johannes Weiner <***@cmpxchg.org>
>
> I was thinking whether we should add task_in_memcg_oom into return to
> the userspace path just in case but this should be OK for now and new
> users of mem_cgroup_enable_oom will be fought against hard.

Absolutely, I would have liked it to be at the lowest possible point
in the stack as well, but this seemed like a good trade off. And I
expect the sites enabling and disabling memcg OOM killing to be fairly
static.

> Acked-by: Michal Hocko <***@suse.cz>

Thanks!
Johannes Weiner
2013-08-03 17:08:31 UTC
Permalink
Hi azur,

here is the x86-only rollup of the series for 3.2.

Thanks!
Johannes
---

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..314fe53 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,30 +842,22 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
+ if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+ up_read(&current->mm->mmap_sem);
+ no_context(regs, error_code, address);
+ return;
}
- if (!(fault & VM_FAULT_ERROR))
- return 0;

if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address);
- return 1;
+ return;
}

out_of_memory(regs, error_code, address);
@@ -876,7 +868,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1070,6 +1061,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1167,9 +1159,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..b113c0f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,48 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+/**
+ * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
+ * @new: true to enable, false to disable
+ *
+ * Toggle whether a failed memcg charge should invoke the OOM killer
+ * or just return -ENOMEM. Returns the previous toggle state.
+ *
+ * NOTE: Any path that enables the OOM killer before charging must
+ * call mem_cgroup_oom_synchronize() afterward to finalize the
+ * OOM handling and clean up.
+ */
+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+ bool old;
+
+ old = current->memcg_oom.may_oom;
+ current->memcg_oom.may_oom = new;
+
+ return old;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+ bool old = mem_cgroup_toggle_oom(true);
+
+ WARN_ON(old == true);
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+ bool old = mem_cgroup_toggle_oom(false);
+
+ WARN_ON(old == false);
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return p->memcg_oom.in_memcg_oom;
+}
+
+bool mem_cgroup_oom_synchronize(void);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +375,29 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+ return false;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..3f2562c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
#include <linux/latencytop.h>
#include <linux/cred.h>
#include <linux/llist.h>
+#include <linux/stacktrace.h>

#include <asm/processor.h>

@@ -1568,6 +1569,15 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ unsigned int may_oom:1;
+ unsigned int in_memcg_oom:1;
+ unsigned int oom_locked:1;
+ struct stack_trace trace;
+ unsigned long trace_entries[16];
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f0a3c9..030774a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1661,6 +1661,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
struct page *page;
+ bool memcg_oom;
pgoff_t size;
int ret = 0;

@@ -1669,7 +1670,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already?
+ * Do we have something in the page cache already? Either
+ * way, try readahead, but disable the memcg OOM killer for it
+ * as readahead is optional and no errors are propagated up
+ * the fault stack. The OOM killer is enabled while trying to
+ * instantiate the faulting page individually below.
*/
page = find_get_page(mapping, offset);
if (likely(page)) {
@@ -1677,10 +1682,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
+ memcg_oom = mem_cgroup_toggle_oom(false);
do_async_mmap_readahead(vma, ra, file, page, offset);
+ mem_cgroup_toggle_oom(memcg_oom);
} else {
/* No page in the page cache at all */
+ memcg_oom = mem_cgroup_toggle_oom(false);
do_sync_mmap_readahead(vma, ra, file, offset);
+ mem_cgroup_toggle_oom(memcg_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..83acd11 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/stacktrace.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -249,6 +250,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

atomic_t refcnt;

@@ -1743,16 +1745,19 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
return total;
}

+static DEFINE_SPINLOCK(memcg_oom_lock);
+
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
- * Has to be called with memcg_oom_lock
*/
-static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter, *failed = NULL;
bool cond = true;

+ spin_lock(&memcg_oom_lock);
+
for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
if (iter->oom_lock) {
/*
@@ -1765,34 +1770,34 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
iter->oom_lock = true;
}

- if (!failed)
- return true;
-
- /*
- * OK, we failed to lock the whole subtree so we have to clean up
- * what we set up to the failing subtree
- */
- cond = true;
- for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
- if (iter == failed) {
- cond = false;
- continue;
+ if (failed) {
+ /*
+ * OK, we failed to lock the whole subtree so we have
+ * to clean up what we set up to the failing subtree
+ */
+ cond = true;
+ for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+ if (iter == failed) {
+ cond = false;
+ continue;
+ }
+ iter->oom_lock = false;
}
- iter->oom_lock = false;
}
- return false;
+
+ spin_unlock(&memcg_oom_lock);
+
+ return !failed;
}

-/*
- * Has to be called with memcg_oom_lock
- */
-static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter;

+ spin_lock(&memcg_oom_lock);
for_each_mem_cgroup_tree(iter, memcg)
iter->oom_lock = false;
- return 0;
+ spin_unlock(&memcg_oom_lock);
}

static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
@@ -1816,7 +1821,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
atomic_add_unless(&iter->under_oom, -1, 0);
}

-static DEFINE_SPINLOCK(memcg_oom_lock);
static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);

struct oom_wait_info {
@@ -1846,6 +1850,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -1857,55 +1862,142 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked;
+ int wakeups;

- owait.mem = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ if (!current->memcg_oom.may_oom)
+ return;
+
+ current->memcg_oom.in_memcg_oom = 1;
+
+ current->memcg_oom.trace.nr_entries = 0;
+ current->memcg_oom.trace.max_entries = 16;
+ current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
+ current->memcg_oom.trace.skip = 1;
+ save_stack_trace(&current->memcg_oom.trace);

- /* At first, try to OOM lock hierarchy under memcg.*/
- spin_lock(&memcg_oom_lock);
- locked = mem_cgroup_oom_lock(memcg);
/*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
+ * As with any blocking lock, a contender needs to start
+ * listening for wakeups before attempting the trylock,
+ * otherwise it can miss the wakeup from the unlock and sleep
+ * indefinitely. This is just open-coded because our locking
+ * is so particular to memcg hierarchies.
*/
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
- need_to_kill = false;
+ wakeups = atomic_read(&memcg->oom_wakeups);
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
if (locked)
mem_cgroup_oom_notify(memcg);
- spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ if (locked && !memcg->oom_kill_disable) {
+ mem_cgroup_unmark_under_oom(memcg);
mem_cgroup_out_of_memory(memcg, mask);
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that an OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
} else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this
+ * is a page fault and somebody else is handling the
+ * OOM already, we need to sleep on the OOM waitqueue
+ * for this memcg until the situation is resolved.
+ * Which can take some time because it might be
+ * handled by a userspace task.
+ *
+ * However, this is the charge context, which means
+ * that we may sit on a large call stack and hold
+ * various filesystem locks, the mmap_sem etc. and we
+ * don't want the OOM handler to deadlock on them
+ * while we sit here and wait. Store the current OOM
+ * context in the task_struct, then return -ENOMEM.
+ * At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check
+ * back with us by calling
+ * mem_cgroup_oom_synchronize(), possibly putting the
+ * task to sleep.
+ */
+ current->memcg_oom.oom_locked = locked;
+ current->memcg_oom.wakeups = wakeups;
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
}
- spin_lock(&memcg_oom_lock);
- if (locked)
- mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+}

- mem_cgroup_unmark_under_oom(memcg);
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ *
+ * This has to be called at the end of a page fault if the the memcg
+ * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
+ *
+ * Memcg supports userspace OOM handling, so failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to put the task to sleep and clean up the
+ * OOM state.
+ *
+ * Returns %true if an ongoing memcg OOM situation was detected and
+ * finalized, %false otherwise.
+ */
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_memcg;
+
+ owait.mem = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_memcg:
+ mem_cgroup_unmark_under_oom(memcg);
+ if (current->memcg_oom.oom_locked) {
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that an OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
+ }
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2195,11 +2287,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2348,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2349,7 +2436,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2444,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2457,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..cdbe41b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/stacktrace.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3439,22 +3440,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3496,40 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ /*
+ * Enable the memcg OOM handling for faults triggered in user
+ * space. Kernel faults are handled more gracefully.
+ */
+ if (flags & FAULT_FLAG_USER)
+ mem_cgroup_enable_oom();
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (flags & FAULT_FLAG_USER)
+ mem_cgroup_disable_oom();
+
+ if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) {
+ printk("Fixing unhandled memcg OOM context set up from:\n");
+ print_stack_trace(&current->memcg_oom.trace, 0);
+ mem_cgroup_oom_synchronize();
+ }
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize())
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-08-09 09:06:11 UTC
Permalink
>Hi azur,
>
>here is the x86-only rollup of the series for 3.2.
>
>Thanks!
>Johannes



Hi Johannes,

i'm running kernel with this new patch for 1 day now without any problems! Will report back in few weeks or months or in case of any problems occures. Thank you!

azur
azurIt
2013-08-30 19:58:52 UTC
Permalink
>Hi azur,
>
>here is the x86-only rollup of the series for 3.2.
>
>Thanks!
>Johannes
>---


Johannes,

unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-03 20:48:50 UTC
Permalink
Hello azur,

On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >>Hi azur,
> >>
> >>here is the x86-only rollup of the series for 3.2.
> >>
> >>Thanks!
> >>Johannes
> >>---
> >
> >
> >Johannes,
> >
> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?

Did the OOM killer go off in this group?

Was there a warning in the syslog ("Fixing unhandled memcg OOM
context")?

If it happens again, could you check if there are tasks left in the
cgroup? And provide /proc/<pid>/stack of the hung task trying to
delete the cgroup?

> Now i can definitely confirm that problem is NOT fixed :( it happened again but i don't have any data because i already disabled all debug output.

Which debug output?

Do you still have access to the syslog?

It's possible that, as your system does not deadlock on the OOMing
cgroup anymore, you hit a separate bug...

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-04 07:53:51 UTC
Permalink
>On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >>Hi azur,
>> >>
>> >>here is the x86-only rollup of the series for 3.2.
>> >>
>> >>Thanks!
>> >>Johannes
>> >>---
>> >
>> >
>> >Johannes,
>> >
>> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>
>Did the OOM killer go off in this group?
>



# cat /cgroups/cannot_rm_01/memory.oom_control
oom_kill_disable 0
under_oom 1
#




>Was there a warning in the syslog ("Fixing unhandled memcg OOM
>context")?



Really don't know cos i don't know the exact day when it happens. I just find that out on 30.8. but it could happen anytime before. Uptime on that server is 27 days so maybe i can grep all syslog logs i have if it helps. I just need to find out the original name of that cgroup cos i renamed it to 'cannot_rm_01' so my software will ignore it.



>If it happens again, could you check if there are tasks left in the
>cgroup? And provide /proc/<pid>/stack of the hung task trying to
>delete the cgroup?



# cat /cgroups/cannot_rm_01/tasks
#



>> Now i can definitely confirm that problem is NOT fixed :( it happened again but i don't have any data because i already disabled all debug output.
>
>Which debug output?



Debug output from my own scripts which are suppose to handle this situation and kill frozen processes. I already reactivated it, it is grabbing content of 'stacks' from all processes before killing them.



>Do you still have access to the syslog?



>From that day (30.8.)? Yes.


>It's possible that, as your system does not deadlock on the OOMing
>cgroup anymore, you hit a separate bug...
>
>Thanks!
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-04 08:18:52 UTC
Permalink
> CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>Hello azur,
>
>On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >>Hi azur,
>> >>
>> >>here is the x86-only rollup of the series for 3.2.
>> >>
>> >>Thanks!
>> >>Johannes
>> >>---
>> >
>> >
>> >Johannes,
>> >
>> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>
>Did the OOM killer go off in this group?
>
>Was there a warning in the syslog ("Fixing unhandled memcg OOM
>context")?



Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
http://watchdog.sk/lkml/oom_syslog.gz

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-05 11:54:30 UTC
Permalink
Hi azur,

On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >Hello azur,
> >
> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >> >>Hi azur,
> >> >>
> >> >>here is the x86-only rollup of the series for 3.2.
> >> >>
> >> >>Thanks!
> >> >>Johannes
> >> >>---
> >> >
> >> >
> >> >Johannes,
> >> >
> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
> >
> >Did the OOM killer go off in this group?
> >
> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
> >context")?
>
>
>
> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
> http://watchdog.sk/lkml/oom_syslog.gz

There is an unfinished OOM invocation here:

Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940

__getblk() has this weird loop where it tries to instantiate the page,
frees memory on failure, then retries. If the memcg goes OOM, the OOM
path might be entered multiple times and each time leak the memcg
reference of the respective previous OOM invocation.

There are a few more find_or_create() sites that do not propagate an
error and it's incredibly hard to find out whether they are even taken
during a page fault. It's not practical to annotate them all with
memcg OOM toggles, so let's just catch all OOM contexts at the end of
handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
this like an error.

azur, here is a patch on top of your modified 3.2. Note that Michal
might be onto something and we are looking at multiple issues here,
but the log excert above suggests this fix is required either way.

---
From: Johannes Weiner <***@cmpxchg.org>
Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully

Many places that can trigger a memcg OOM situation return gracefully
and don't propagate VM_FAULT_OOM up the fault stack.

It's not practical to annotate all of them to disable the memcg OOM
killer. Instead, just clean up any set OOM state without warning in
case the fault is not returning VM_FAULT_OOM.

Also fail charges immediately when the current task already is in an
OOM context. Otherwise, the previous context gets overwritten and the
memcg reference is leaked.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
include/linux/memcontrol.h | 40 ++++++----------------------------------
include/linux/sched.h | 3 ---
mm/filemap.c | 11 +----------
mm/memcontrol.c | 15 ++++++++-------
mm/memory.c | 8 ++------
mm/oom_kill.c | 2 +-
6 files changed, 18 insertions(+), 61 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b113c0f..7c43903 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,39 +120,16 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

-/**
- * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
- * @new: true to enable, false to disable
- *
- * Toggle whether a failed memcg charge should invoke the OOM killer
- * or just return -ENOMEM. Returns the previous toggle state.
- *
- * NOTE: Any path that enables the OOM killer before charging must
- * call mem_cgroup_oom_synchronize() afterward to finalize the
- * OOM handling and clean up.
- */
-static inline bool mem_cgroup_toggle_oom(bool new)
-{
- bool old;
-
- old = current->memcg_oom.may_oom;
- current->memcg_oom.may_oom = new;
-
- return old;
-}
-
static inline void mem_cgroup_enable_oom(void)
{
- bool old = mem_cgroup_toggle_oom(true);
-
- WARN_ON(old == true);
+ WARN_ON(current->memcg_oom.may_oom);
+ current->memcg_oom.may_oom = true;
}

static inline void mem_cgroup_disable_oom(void)
{
- bool old = mem_cgroup_toggle_oom(false);
-
- WARN_ON(old == false);
+ WARN_ON(!current->memcg_oom.may_oom);
+ current->memcg_oom.may_oom = false;
}

static inline bool task_in_memcg_oom(struct task_struct *p)
@@ -160,7 +137,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
return p->memcg_oom.in_memcg_oom;
}

-bool mem_cgroup_oom_synchronize(void);
+bool mem_cgroup_oom_synchronize(bool wait);

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
@@ -375,11 +352,6 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

-static inline bool mem_cgroup_toggle_oom(bool new)
-{
- return false;
-}
-
static inline void mem_cgroup_enable_oom(void)
{
}
@@ -393,7 +365,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
return false;
}

-static inline bool mem_cgroup_oom_synchronize(void)
+static inline bool mem_cgroup_oom_synchronize(bool wait)
{
return false;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3f2562c..70a62fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,7 +91,6 @@ struct sched_param {
#include <linux/latencytop.h>
#include <linux/cred.h>
#include <linux/llist.h>
-#include <linux/stacktrace.h>

#include <asm/processor.h>

@@ -1573,8 +1572,6 @@ struct task_struct {
unsigned int may_oom:1;
unsigned int in_memcg_oom:1;
unsigned int oom_locked:1;
- struct stack_trace trace;
- unsigned long trace_entries[16];
int wakeups;
struct mem_cgroup *wait_on_memcg;
} memcg_oom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 030774a..5f0a3c9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1661,7 +1661,6 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
struct page *page;
- bool memcg_oom;
pgoff_t size;
int ret = 0;

@@ -1670,11 +1669,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already? Either
- * way, try readahead, but disable the memcg OOM killer for it
- * as readahead is optional and no errors are propagated up
- * the fault stack. The OOM killer is enabled while trying to
- * instantiate the faulting page individually below.
+ * Do we have something in the page cache already?
*/
page = find_get_page(mapping, offset);
if (likely(page)) {
@@ -1682,14 +1677,10 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
- memcg_oom = mem_cgroup_toggle_oom(false);
do_async_mmap_readahead(vma, ra, file, page, offset);
- mem_cgroup_toggle_oom(memcg_oom);
} else {
/* No page in the page cache at all */
- memcg_oom = mem_cgroup_toggle_oom(false);
do_sync_mmap_readahead(vma, ra, file, offset);
- mem_cgroup_toggle_oom(memcg_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 83acd11..ebd07f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1874,12 +1874,6 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)

current->memcg_oom.in_memcg_oom = 1;

- current->memcg_oom.trace.nr_entries = 0;
- current->memcg_oom.trace.max_entries = 16;
- current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
- current->memcg_oom.trace.skip = 1;
- save_stack_trace(&current->memcg_oom.trace);
-
/*
* As with any blocking lock, a contender needs to start
* listening for wakeups before attempting the trylock,
@@ -1935,6 +1929,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)

/**
* mem_cgroup_oom_synchronize - complete memcg OOM handling
+ * @wait: wait for OOM handler or just clear the OOM state
*
* This has to be called at the end of a page fault if the the memcg
* OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
@@ -1950,7 +1945,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
* Returns %true if an ongoing memcg OOM situation was detected and
* finalized, %false otherwise.
*/
-bool mem_cgroup_oom_synchronize(void)
+bool mem_cgroup_oom_synchronize(bool wait)
{
struct oom_wait_info owait;
struct mem_cgroup *memcg;
@@ -1969,6 +1964,9 @@ bool mem_cgroup_oom_synchronize(void)
if (!memcg)
goto out;

+ if (!wait)
+ goto out_memcg;
+
if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
goto out_memcg;

@@ -2369,6 +2367,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
struct mem_cgroup *memcg = NULL;
int ret;

+ if (unlikely(current->memcg_oom.in_memcg_oom))
+ goto nomem;
+
/*
* Unlike gloval-vm's OOM-kill, we're not in memory shortage
* in system level. So, allow to go ahead dying process in addition to
diff --git a/mm/memory.c b/mm/memory.c
index cdbe41b..cdad471 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,7 +57,6 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
-#include <linux/stacktrace.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_disable_oom();

- if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) {
- printk("Fixing unhandled memcg OOM context set up from:\n");
- print_stack_trace(&current->memcg_oom.trace, 0);
- mem_cgroup_oom_synchronize();
- }
+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
+ mem_cgroup_oom_synchronize(false);

return ret;
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index aa60863..3bf664c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,7 +785,7 @@ out:
*/
void pagefault_out_of_memory(void)
{
- if (mem_cgroup_oom_synchronize())
+ if (mem_cgroup_oom_synchronize(true))
return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
--
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-05 12:43:52 UTC
Permalink
On Thu 05-09-13 07:54:30, Johannes Weiner wrote:
> On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> > Ok, i see this message several times in my syslog logs, one of them
> > is also for this unremovable cgroup (but maybe all of them cannot
> > be removed, should i try?). Example of the log is here (don't know
> > where exactly it starts and ends so here is the full kernel log):
> > http://watchdog.sk/lkml/oom_syslog.gz
>
> There is an unfinished OOM invocation here:
>
> Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>
> __getblk() has this weird loop where it tries to instantiate the page,
> frees memory on failure, then retries. If the memcg goes OOM, the OOM
> path might be entered multiple times and each time leak the memcg
> reference of the respective previous OOM invocation.

Very well spotted, Johannes!

> There are a few more find_or_create() sites that do not propagate an
> error and it's incredibly hard to find out whether they are even taken
> during a page fault. It's not practical to annotate them all with
> memcg OOM toggles, so let's just catch all OOM contexts at the end of
> handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> this like an error.
>
> azur, here is a patch on top of your modified 3.2. Note that Michal
> might be onto something and we are looking at multiple issues here,
> but the log excert above suggests this fix is required either way.
>
> ---
> From: Johannes Weiner <***@cmpxchg.org>
> Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully
>
> Many places that can trigger a memcg OOM situation return gracefully
> and don't propagate VM_FAULT_OOM up the fault stack.
>
> It's not practical to annotate all of them to disable the memcg OOM
> killer. Instead, just clean up any set OOM state without warning in
> case the fault is not returning VM_FAULT_OOM.
>
> Also fail charges immediately when the current task already is in an
> OOM context. Otherwise, the previous context gets overwritten and the
> memcg reference is leaked.

This is getting way more trickier than I've expected and hoped for. The
above should work although I cannot say I love it. I am afraid we do not
have many choices left without polluting the every single place which
can charge, though :/

> Signed-off-by: Johannes Weiner <***@cmpxchg.org>

I guess this should be correct but I have to think about it some more.

Two minor comments bellow.

> ---
> include/linux/memcontrol.h | 40 ++++++----------------------------------
> include/linux/sched.h | 3 ---
> mm/filemap.c | 11 +----------
> mm/memcontrol.c | 15 ++++++++-------
> mm/memory.c | 8 ++------
> mm/oom_kill.c | 2 +-
> 6 files changed, 18 insertions(+), 61 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b113c0f..7c43903 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -120,39 +120,16 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
> -/**
> - * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
> - * @new: true to enable, false to disable
> - *
> - * Toggle whether a failed memcg charge should invoke the OOM killer
> - * or just return -ENOMEM. Returns the previous toggle state.
> - *
> - * NOTE: Any path that enables the OOM killer before charging must
> - * call mem_cgroup_oom_synchronize() afterward to finalize the
> - * OOM handling and clean up.
> - */
> -static inline bool mem_cgroup_toggle_oom(bool new)
> -{
> - bool old;
> -
> - old = current->memcg_oom.may_oom;
> - current->memcg_oom.may_oom = new;
> -
> - return old;
> -}

I will not miss this guy.

[...]
> diff --git a/mm/memory.c b/mm/memory.c
> index cdbe41b..cdad471 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -57,7 +57,6 @@
> #include <linux/swapops.h>
> #include <linux/elf.h>
> #include <linux/gfp.h>
> -#include <linux/stacktrace.h>
>
> #include <asm/io.h>
> #include <asm/pgalloc.h>
> @@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (flags & FAULT_FLAG_USER)
> mem_cgroup_disable_oom();
>
> - if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) {
> - printk("Fixing unhandled memcg OOM context set up from:\n");
> - print_stack_trace(&current->memcg_oom.trace, 0);
> - mem_cgroup_oom_synchronize();
> - }
> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> + mem_cgroup_oom_synchronize(false);

This deserves a fat comment /me thinks

[...]
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-05 16:18:17 UTC
Permalink
On Thu, Sep 05, 2013 at 02:43:52PM +0200, Michal Hocko wrote:
> On Thu 05-09-13 07:54:30, Johannes Weiner wrote:
> > There are a few more find_or_create() sites that do not propagate an
> > error and it's incredibly hard to find out whether they are even taken
> > during a page fault. It's not practical to annotate them all with
> > memcg OOM toggles, so let's just catch all OOM contexts at the end of
> > handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> > this like an error.
> >
> > azur, here is a patch on top of your modified 3.2. Note that Michal
> > might be onto something and we are looking at multiple issues here,
> > but the log excert above suggests this fix is required either way.
> >
> > ---
> > From: Johannes Weiner <***@cmpxchg.org>
> > Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully
> >
> > Many places that can trigger a memcg OOM situation return gracefully
> > and don't propagate VM_FAULT_OOM up the fault stack.
> >
> > It's not practical to annotate all of them to disable the memcg OOM
> > killer. Instead, just clean up any set OOM state without warning in
> > case the fault is not returning VM_FAULT_OOM.
> >
> > Also fail charges immediately when the current task already is in an
> > OOM context. Otherwise, the previous context gets overwritten and the
> > memcg reference is leaked.
>
> This is getting way more trickier than I've expected and hoped for. The
> above should work although I cannot say I love it. I am afraid we do not
> have many choices left without polluting the every single place which
> can charge, though :/

I thought it was less tricky, actually, since we don't need to mess
around with the selective nested OOM toggling anymore.

Thinking more about it, the whole thing can be made even simpler.

The series currently keeps the locking & killing in the direct charge
path and then only waits in the synchronize path, which requires quite
a bit of state communication. Wouldn't it be simpler to just do
everything in the unwind path? We would only have to remember the
memcg and the gfp_mask, and then the synchronize path would decide
whether to kill, wait, or just clean up (!VM_FAULT_OOM case). This
would also have the benefit that we really don't invoke the OOM killer
when the fault is overall successful. I'm attaching a draft below.

> > include/linux/memcontrol.h | 40 ++++++----------------------------------
> > include/linux/sched.h | 3 ---
> > mm/filemap.c | 11 +----------
> > mm/memcontrol.c | 15 ++++++++-------
> > mm/memory.c | 8 ++------
> > mm/oom_kill.c | 2 +-
> > 6 files changed, 18 insertions(+), 61 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index b113c0f..7c43903 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -120,39 +120,16 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > struct task_struct *p);
> >
> > -/**
> > - * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
> > - * @new: true to enable, false to disable
> > - *
> > - * Toggle whether a failed memcg charge should invoke the OOM killer
> > - * or just return -ENOMEM. Returns the previous toggle state.
> > - *
> > - * NOTE: Any path that enables the OOM killer before charging must
> > - * call mem_cgroup_oom_synchronize() afterward to finalize the
> > - * OOM handling and clean up.
> > - */
> > -static inline bool mem_cgroup_toggle_oom(bool new)
> > -{
> > - bool old;
> > -
> > - old = current->memcg_oom.may_oom;
> > - current->memcg_oom.may_oom = new;
> > -
> > - return old;
> > -}
>
> I will not miss this guy.

Me neither!

> > diff --git a/mm/memory.c b/mm/memory.c
> > index cdbe41b..cdad471 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -57,7 +57,6 @@
> > #include <linux/swapops.h>
> > #include <linux/elf.h>
> > #include <linux/gfp.h>
> > -#include <linux/stacktrace.h>
> >
> > #include <asm/io.h>
> > #include <asm/pgalloc.h>
> > @@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > if (flags & FAULT_FLAG_USER)
> > mem_cgroup_disable_oom();
> >
> > - if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) {
> > - printk("Fixing unhandled memcg OOM context set up from:\n");
> > - print_stack_trace(&current->memcg_oom.trace, 0);
> > - mem_cgroup_oom_synchronize();
> > - }
> > + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> > + mem_cgroup_oom_synchronize(false);
>
> This deserves a fat comment /me thinks

Yes. As per your other email, I also folded it into the
FAULT_FLAG_USER branch.

Here is an updated draft. Things changed slightly throughout the
series, so I'm sending a complete replacement of the last patch. It's
a much simpler change at this point, IMO and keeps the (cleaned up)
OOM handling code as it was (mem_cgroup_handle_oom is basically just
renamed to mem_cgroup_oom_synchronize)

---
From: Johannes Weiner <***@cmpxchg.org>
Subject: [patch] mm: memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex. The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations. In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit. But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

To fix this, never do any OOM handling directly in the charge context.
When an OOM situation is detected, let the task remember the memcg and
then handle the OOM (kill or wait) only after the page fault stack is
unwound and about to return to userspace.

Reported-by: Reported-by: azurIt <***@pobox.sk>
Debugged-by: Michal Hocko <***@suse.cz>
Not-yet-Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
include/linux/memcontrol.h | 17 ++++++++
include/linux/sched.h | 2 +
mm/memcontrol.c | 96 +++++++++++++++++++++++++++++++---------------
mm/memory.c | 11 +++++-
mm/oom_kill.c | 2 +
5 files changed, 96 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b344b3a..325da07 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -132,6 +132,13 @@ static inline void mem_cgroup_disable_oom(void)
current->memcg_oom.may_oom = 0;
}

+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return p->memcg_oom.memcg;
+}
+
+bool mem_cgroup_oom_synchronize(bool wait);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -353,6 +360,16 @@ static inline void mem_cgroup_disable_oom(void)
{
}

+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(bool wait)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21834a9..fb1f145 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1569,6 +1569,8 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
struct memcg_oom_info {
+ struct mem_cgroup *memcg;
+ gfp_t gfp_mask;
unsigned int may_oom:1;
} memcg_oom;
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36bb58f..56643fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1858,14 +1858,59 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
memcg_wakeup_oom(memcg);
}

-/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
+{
+ if (!current->memcg_oom.may_oom)
+ return;
+ /*
+ * We are in the middle of the charge context here, so we
+ * don't want to block when potentially sitting on a callstack
+ * that holds all kinds of filesystem and mm locks.
+ *
+ * Also, the caller may handle a failed allocation gracefully
+ * (like optional page cache readahead) and so an OOM killer
+ * invocation might not even be necessary.
+ *
+ * That's why we don't do anything here except remember the
+ * OOM context and then deal with it at the end of the page
+ * fault when the stack is unwound, the locks are released,
+ * and when we know whether the fault was overall successful.
+ */
+ css_get(&memcg->css);
+ current->memcg_oom.memcg = memcg;
+ current->memcg_oom.gfp_mask = mask;
+}
+
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ * @handle: actually kill/wait or just clean up the OOM state
+ *
+ * This has to be called at the end of a page fault if the memcg OOM
+ * handler was enabled.
+ *
+ * Memcg supports userspace OOM handling where failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to complete the OOM handling.
+ *
+ * Returns %true if an ongoing memcg OOM situation was detected and
+ * completed, %false otherwise.
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+bool mem_cgroup_oom_synchronize(bool handle)
{
+ struct mem_cgroup *memcg = current->memcg_oom.memcg;
struct oom_wait_info owait;
bool locked;

+ /* OOM is global, do not handle */
+ if (!memcg)
+ return false;
+
+ if (!handle)
+ goto cleanup;
+
owait.mem = memcg;
owait.wait.flags = 0;
owait.wait.func = memcg_oom_wake_function;
@@ -1894,7 +1939,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
if (locked && !memcg->oom_kill_disable) {
mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask);
+ mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
} else {
schedule();
mem_cgroup_unmark_under_oom(memcg);
@@ -1910,11 +1955,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
*/
memcg_oom_recover(memcg);
}
-
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
- return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+cleanup:
+ current->memcg_oom.memcg = NULL;
+ css_put(&memcg->css);
return true;
}

@@ -2204,11 +2247,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2266,14 +2308,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check || !current->memcg_oom.may_oom)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2301,6 +2339,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
goto bypass;

/*
+ * Task already OOMed, just get out of here.
+ */
+ if (unlikely(current->memcg_oom.memcg))
+ goto nomem;
+
+ /*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
@@ -2358,7 +2402,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2366,13 +2410,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2385,16 +2423,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 7b66056..20c43a0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,8 +3517,17 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,

ret = __handle_mm_fault(mm, vma, address, flags);

- if (flags & FAULT_FLAG_USER)
+ if (flags & FAULT_FLAG_USER) {
mem_cgroup_disable_oom();
+ /*
+ * The task may have entered a memcg OOM situation but
+ * if the allocation error was handled gracefully (no
+ * VM_FAULT_OOM), there is no need to kill anything.
+ * Just clean up the OOM state peacefully.
+ */
+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
+ mem_cgroup_oom_synchronize(false);
+ }

return ret;
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..3bf664c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize(true))
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();
--
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-09 12:36:25 UTC
Permalink
On Thu 05-09-13 12:18:17, Johannes Weiner wrote:
[...]
> From: Johannes Weiner <***@cmpxchg.org>
> Subject: [patch] mm: memcg: do not trap chargers with full callstack on OOM
>
[...]
>
> To fix this, never do any OOM handling directly in the charge context.
> When an OOM situation is detected, let the task remember the memcg and
> then handle the OOM (kill or wait) only after the page fault stack is
> unwound and about to return to userspace.

OK, this is indeed nicer because the oom setup is trivial and the
handling is not split into two parts and everything happens close to
out_of_memory where it is expected.

> Reported-by: Reported-by: azurIt <***@pobox.sk>
> Debugged-by: Michal Hocko <***@suse.cz>
> Not-yet-Signed-off-by: Johannes Weiner <***@cmpxchg.org>

Acked-by: Michal Hocko <***@suse.cz>

Thanks!

> ---
> include/linux/memcontrol.h | 17 ++++++++
> include/linux/sched.h | 2 +
> mm/memcontrol.c | 96 +++++++++++++++++++++++++++++++---------------
> mm/memory.c | 11 +++++-
> mm/oom_kill.c | 2 +
> 5 files changed, 96 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b344b3a..325da07 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -132,6 +132,13 @@ static inline void mem_cgroup_disable_oom(void)
> current->memcg_oom.may_oom = 0;
> }
>
> +static inline bool task_in_memcg_oom(struct task_struct *p)
> +{
> + return p->memcg_oom.memcg;
> +}
> +
> +bool mem_cgroup_oom_synchronize(bool wait);
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern int do_swap_account;
> #endif
> @@ -353,6 +360,16 @@ static inline void mem_cgroup_disable_oom(void)
> {
> }
>
> +static inline bool task_in_memcg_oom(struct task_struct *p)
> +{
> + return false;
> +}
> +
> +static inline bool mem_cgroup_oom_synchronize(bool wait)
> +{
> + return false;
> +}
> +
> static inline void mem_cgroup_inc_page_stat(struct page *page,
> enum mem_cgroup_page_stat_item idx)
> {
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 21834a9..fb1f145 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1569,6 +1569,8 @@ struct task_struct {
> unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> } memcg_batch;
> struct memcg_oom_info {
> + struct mem_cgroup *memcg;
> + gfp_t gfp_mask;
> unsigned int may_oom:1;
> } memcg_oom;
> #endif
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36bb58f..56643fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1858,14 +1858,59 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> memcg_wakeup_oom(memcg);
> }
>
> -/*
> - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
> +{
> + if (!current->memcg_oom.may_oom)
> + return;
> + /*
> + * We are in the middle of the charge context here, so we
> + * don't want to block when potentially sitting on a callstack
> + * that holds all kinds of filesystem and mm locks.
> + *
> + * Also, the caller may handle a failed allocation gracefully
> + * (like optional page cache readahead) and so an OOM killer
> + * invocation might not even be necessary.
> + *
> + * That's why we don't do anything here except remember the
> + * OOM context and then deal with it at the end of the page
> + * fault when the stack is unwound, the locks are released,
> + * and when we know whether the fault was overall successful.
> + */
> + css_get(&memcg->css);
> + current->memcg_oom.memcg = memcg;
> + current->memcg_oom.gfp_mask = mask;
> +}
> +
> +/**
> + * mem_cgroup_oom_synchronize - complete memcg OOM handling
> + * @handle: actually kill/wait or just clean up the OOM state
> + *
> + * This has to be called at the end of a page fault if the memcg OOM
> + * handler was enabled.
> + *
> + * Memcg supports userspace OOM handling where failed allocations must
> + * sleep on a waitqueue until the userspace task resolves the
> + * situation. Sleeping directly in the charge context with all kinds
> + * of locks held is not a good idea, instead we remember an OOM state
> + * in the task and mem_cgroup_oom_synchronize() has to be called at
> + * the end of the page fault to complete the OOM handling.
> + *
> + * Returns %true if an ongoing memcg OOM situation was detected and
> + * completed, %false otherwise.
> */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> +bool mem_cgroup_oom_synchronize(bool handle)
> {
> + struct mem_cgroup *memcg = current->memcg_oom.memcg;
> struct oom_wait_info owait;
> bool locked;
>
> + /* OOM is global, do not handle */
> + if (!memcg)
> + return false;
> +
> + if (!handle)
> + goto cleanup;
> +
> owait.mem = memcg;
> owait.wait.flags = 0;
> owait.wait.func = memcg_oom_wake_function;
> @@ -1894,7 +1939,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> if (locked && !memcg->oom_kill_disable) {
> mem_cgroup_unmark_under_oom(memcg);
> finish_wait(&memcg_oom_waitq, &owait.wait);
> - mem_cgroup_out_of_memory(memcg, mask);
> + mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
> } else {
> schedule();
> mem_cgroup_unmark_under_oom(memcg);
> @@ -1910,11 +1955,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> */
> memcg_oom_recover(memcg);
> }
> -
> - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> - return false;
> - /* Give chance to dying process */
> - schedule_timeout_uninterruptible(1);
> +cleanup:
> + current->memcg_oom.memcg = NULL;
> + css_put(&memcg->css);
> return true;
> }
>
> @@ -2204,11 +2247,10 @@ enum {
> CHARGE_RETRY, /* need to retry but retry is not bad */
> CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
> CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
> - CHARGE_OOM_DIE, /* the current is killed because of OOM */
> };
>
> static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> - unsigned int nr_pages, bool oom_check)
> + unsigned int nr_pages, bool invoke_oom)
> {
> unsigned long csize = nr_pages * PAGE_SIZE;
> struct mem_cgroup *mem_over_limit;
> @@ -2266,14 +2308,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (mem_cgroup_wait_acct_move(mem_over_limit))
> return CHARGE_RETRY;
>
> - /* If we don't need to call oom-killer at el, return immediately */
> - if (!oom_check || !current->memcg_oom.may_oom)
> - return CHARGE_NOMEM;
> - /* check OOM */
> - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
> - return CHARGE_OOM_DIE;
> + if (invoke_oom)
> + mem_cgroup_oom(mem_over_limit, gfp_mask);
>
> - return CHARGE_RETRY;
> + return CHARGE_NOMEM;
> }
>
> /*
> @@ -2301,6 +2339,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> goto bypass;
>
> /*
> + * Task already OOMed, just get out of here.
> + */
> + if (unlikely(current->memcg_oom.memcg))
> + goto nomem;
> +
> + /*
> * We always charge the cgroup the mm_struct belongs to.
> * The mm_struct's mem_cgroup changes on task migration if the
> * thread group leader migrates. It's possible that mm is not
> @@ -2358,7 +2402,7 @@ again:
> }
>
> do {
> - bool oom_check;
> + bool invoke_oom = oom && !nr_oom_retries;
>
> /* If killed, bypass charge */
> if (fatal_signal_pending(current)) {
> @@ -2366,13 +2410,7 @@ again:
> goto bypass;
> }
>
> - oom_check = false;
> - if (oom && !nr_oom_retries) {
> - oom_check = true;
> - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> - }
> -
> - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
> + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
> switch (ret) {
> case CHARGE_OK:
> break;
> @@ -2385,16 +2423,12 @@ again:
> css_put(&memcg->css);
> goto nomem;
> case CHARGE_NOMEM: /* OOM routine works */
> - if (!oom) {
> + if (!oom || invoke_oom) {
> css_put(&memcg->css);
> goto nomem;
> }
> - /* If oom, we never return -ENOMEM */
> nr_oom_retries--;
> break;
> - case CHARGE_OOM_DIE: /* Killed by OOM Killer */
> - css_put(&memcg->css);
> - goto bypass;
> }
> } while (ret != CHARGE_OK);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7b66056..20c43a0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3517,8 +3517,17 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> ret = __handle_mm_fault(mm, vma, address, flags);
>
> - if (flags & FAULT_FLAG_USER)
> + if (flags & FAULT_FLAG_USER) {
> mem_cgroup_disable_oom();
> + /*
> + * The task may have entered a memcg OOM situation but
> + * if the allocation error was handled gracefully (no
> + * VM_FAULT_OOM), there is no need to kill anything.
> + * Just clean up the OOM state peacefully.
> + */
> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> + mem_cgroup_oom_synchronize(false);
> + }
>
> return ret;
> }
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 069b64e..3bf664c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -785,6 +785,8 @@ out:
> */
> void pagefault_out_of_memory(void)
> {
> + if (mem_cgroup_oom_synchronize(true))
> + return;
> if (try_set_system_oom()) {
> out_of_memory(NULL, 0, 0, NULL);
> clear_system_oom();
> --
> 1.8.4
>

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-09 12:56:59 UTC
Permalink
[Adding Glauber - the full patch is here https://lkml.org/lkml/2013/9/5/319]

On Mon 09-09-13 14:36:25, Michal Hocko wrote:
> On Thu 05-09-13 12:18:17, Johannes Weiner wrote:
> [...]
> > From: Johannes Weiner <***@cmpxchg.org>
> > Subject: [patch] mm: memcg: do not trap chargers with full callstack on OOM
> >
> [...]
> >
> > To fix this, never do any OOM handling directly in the charge context.
> > When an OOM situation is detected, let the task remember the memcg and
> > then handle the OOM (kill or wait) only after the page fault stack is
> > unwound and about to return to userspace.
>
> OK, this is indeed nicer because the oom setup is trivial and the
> handling is not split into two parts and everything happens close to
> out_of_memory where it is expected.

Hmm, wait a second. I have completely forgot about the kmem charging
path during the review.

So while previously memcg_charge_kmem could have oom killed a
task if the it couldn't charge to the u-limit after it managed
to charge k-limit, now it would simply fail because there is no
mem_cgroup_{enable,disable}_oom around __mem_cgroup_try_charge it relies
on. The allocation will fail in the end but I am not sure whether the
missing oom is an issue or not for existing use cases.

My original objection about oom triggered from kmem paths was that oom
is not kmem aware so the oom decisions might be totally bogus. But we
still have that:

/*
* Conditions under which we can wait for the oom_killer. Those are
* the same conditions tested by the core page allocator
*/
may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY);

_memcg = memcg;
ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
&_memcg, may_oom);

I do not mind having may_oom = false unconditionally in that path but I
would like to hear fromm Glauber first.

> > Reported-by: Reported-by: azurIt <***@pobox.sk>
> > Debugged-by: Michal Hocko <***@suse.cz>
> > Not-yet-Signed-off-by: Johannes Weiner <***@cmpxchg.org>
>
> Acked-by: Michal Hocko <***@suse.cz>
>
> Thanks!
>
> > ---
> > include/linux/memcontrol.h | 17 ++++++++
> > include/linux/sched.h | 2 +
> > mm/memcontrol.c | 96 +++++++++++++++++++++++++++++++---------------
> > mm/memory.c | 11 +++++-
> > mm/oom_kill.c | 2 +
> > 5 files changed, 96 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index b344b3a..325da07 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -132,6 +132,13 @@ static inline void mem_cgroup_disable_oom(void)
> > current->memcg_oom.may_oom = 0;
> > }
> >
> > +static inline bool task_in_memcg_oom(struct task_struct *p)
> > +{
> > + return p->memcg_oom.memcg;
> > +}
> > +
> > +bool mem_cgroup_oom_synchronize(bool wait);
> > +
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > extern int do_swap_account;
> > #endif
> > @@ -353,6 +360,16 @@ static inline void mem_cgroup_disable_oom(void)
> > {
> > }
> >
> > +static inline bool task_in_memcg_oom(struct task_struct *p)
> > +{
> > + return false;
> > +}
> > +
> > +static inline bool mem_cgroup_oom_synchronize(bool wait)
> > +{
> > + return false;
> > +}
> > +
> > static inline void mem_cgroup_inc_page_stat(struct page *page,
> > enum mem_cgroup_page_stat_item idx)
> > {
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 21834a9..fb1f145 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1569,6 +1569,8 @@ struct task_struct {
> > unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> > } memcg_batch;
> > struct memcg_oom_info {
> > + struct mem_cgroup *memcg;
> > + gfp_t gfp_mask;
> > unsigned int may_oom:1;
> > } memcg_oom;
> > #endif
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 36bb58f..56643fe 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1858,14 +1858,59 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > memcg_wakeup_oom(memcg);
> > }
> >
> > -/*
> > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
> > +{
> > + if (!current->memcg_oom.may_oom)
> > + return;
> > + /*
> > + * We are in the middle of the charge context here, so we
> > + * don't want to block when potentially sitting on a callstack
> > + * that holds all kinds of filesystem and mm locks.
> > + *
> > + * Also, the caller may handle a failed allocation gracefully
> > + * (like optional page cache readahead) and so an OOM killer
> > + * invocation might not even be necessary.
> > + *
> > + * That's why we don't do anything here except remember the
> > + * OOM context and then deal with it at the end of the page
> > + * fault when the stack is unwound, the locks are released,
> > + * and when we know whether the fault was overall successful.
> > + */
> > + css_get(&memcg->css);
> > + current->memcg_oom.memcg = memcg;
> > + current->memcg_oom.gfp_mask = mask;
> > +}
> > +
> > +/**
> > + * mem_cgroup_oom_synchronize - complete memcg OOM handling
> > + * @handle: actually kill/wait or just clean up the OOM state
> > + *
> > + * This has to be called at the end of a page fault if the memcg OOM
> > + * handler was enabled.
> > + *
> > + * Memcg supports userspace OOM handling where failed allocations must
> > + * sleep on a waitqueue until the userspace task resolves the
> > + * situation. Sleeping directly in the charge context with all kinds
> > + * of locks held is not a good idea, instead we remember an OOM state
> > + * in the task and mem_cgroup_oom_synchronize() has to be called at
> > + * the end of the page fault to complete the OOM handling.
> > + *
> > + * Returns %true if an ongoing memcg OOM situation was detected and
> > + * completed, %false otherwise.
> > */
> > -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> > +bool mem_cgroup_oom_synchronize(bool handle)
> > {
> > + struct mem_cgroup *memcg = current->memcg_oom.memcg;
> > struct oom_wait_info owait;
> > bool locked;
> >
> > + /* OOM is global, do not handle */
> > + if (!memcg)
> > + return false;
> > +
> > + if (!handle)
> > + goto cleanup;
> > +
> > owait.mem = memcg;
> > owait.wait.flags = 0;
> > owait.wait.func = memcg_oom_wake_function;
> > @@ -1894,7 +1939,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> > if (locked && !memcg->oom_kill_disable) {
> > mem_cgroup_unmark_under_oom(memcg);
> > finish_wait(&memcg_oom_waitq, &owait.wait);
> > - mem_cgroup_out_of_memory(memcg, mask);
> > + mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
> > } else {
> > schedule();
> > mem_cgroup_unmark_under_oom(memcg);
> > @@ -1910,11 +1955,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> > */
> > memcg_oom_recover(memcg);
> > }
> > -
> > - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> > - return false;
> > - /* Give chance to dying process */
> > - schedule_timeout_uninterruptible(1);
> > +cleanup:
> > + current->memcg_oom.memcg = NULL;
> > + css_put(&memcg->css);
> > return true;
> > }
> >
> > @@ -2204,11 +2247,10 @@ enum {
> > CHARGE_RETRY, /* need to retry but retry is not bad */
> > CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
> > CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
> > - CHARGE_OOM_DIE, /* the current is killed because of OOM */
> > };
> >
> > static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > - unsigned int nr_pages, bool oom_check)
> > + unsigned int nr_pages, bool invoke_oom)
> > {
> > unsigned long csize = nr_pages * PAGE_SIZE;
> > struct mem_cgroup *mem_over_limit;
> > @@ -2266,14 +2308,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > if (mem_cgroup_wait_acct_move(mem_over_limit))
> > return CHARGE_RETRY;
> >
> > - /* If we don't need to call oom-killer at el, return immediately */
> > - if (!oom_check || !current->memcg_oom.may_oom)
> > - return CHARGE_NOMEM;
> > - /* check OOM */
> > - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
> > - return CHARGE_OOM_DIE;
> > + if (invoke_oom)
> > + mem_cgroup_oom(mem_over_limit, gfp_mask);
> >
> > - return CHARGE_RETRY;
> > + return CHARGE_NOMEM;
> > }
> >
> > /*
> > @@ -2301,6 +2339,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > goto bypass;
> >
> > /*
> > + * Task already OOMed, just get out of here.
> > + */
> > + if (unlikely(current->memcg_oom.memcg))
> > + goto nomem;
> > +
> > + /*
> > * We always charge the cgroup the mm_struct belongs to.
> > * The mm_struct's mem_cgroup changes on task migration if the
> > * thread group leader migrates. It's possible that mm is not
> > @@ -2358,7 +2402,7 @@ again:
> > }
> >
> > do {
> > - bool oom_check;
> > + bool invoke_oom = oom && !nr_oom_retries;
> >
> > /* If killed, bypass charge */
> > if (fatal_signal_pending(current)) {
> > @@ -2366,13 +2410,7 @@ again:
> > goto bypass;
> > }
> >
> > - oom_check = false;
> > - if (oom && !nr_oom_retries) {
> > - oom_check = true;
> > - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > - }
> > -
> > - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
> > + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
> > switch (ret) {
> > case CHARGE_OK:
> > break;
> > @@ -2385,16 +2423,12 @@ again:
> > css_put(&memcg->css);
> > goto nomem;
> > case CHARGE_NOMEM: /* OOM routine works */
> > - if (!oom) {
> > + if (!oom || invoke_oom) {
> > css_put(&memcg->css);
> > goto nomem;
> > }
> > - /* If oom, we never return -ENOMEM */
> > nr_oom_retries--;
> > break;
> > - case CHARGE_OOM_DIE: /* Killed by OOM Killer */
> > - css_put(&memcg->css);
> > - goto bypass;
> > }
> > } while (ret != CHARGE_OK);
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7b66056..20c43a0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3517,8 +3517,17 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> > ret = __handle_mm_fault(mm, vma, address, flags);
> >
> > - if (flags & FAULT_FLAG_USER)
> > + if (flags & FAULT_FLAG_USER) {
> > mem_cgroup_disable_oom();
> > + /*
> > + * The task may have entered a memcg OOM situation but
> > + * if the allocation error was handled gracefully (no
> > + * VM_FAULT_OOM), there is no need to kill anything.
> > + * Just clean up the OOM state peacefully.
> > + */
> > + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> > + mem_cgroup_oom_synchronize(false);
> > + }
> >
> > return ret;
> > }
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 069b64e..3bf664c 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -785,6 +785,8 @@ out:
> > */
> > void pagefault_out_of_memory(void)
> > {
> > + if (mem_cgroup_oom_synchronize(true))
> > + return;
> > if (try_set_system_oom()) {
> > out_of_memory(NULL, 0, 0, NULL);
> > clear_system_oom();
> > --
> > 1.8.4
> >
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to ***@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-12 12:59:38 UTC
Permalink
On Mon, Sep 09, 2013 at 02:56:59PM +0200, Michal Hocko wrote:
> [Adding Glauber - the full patch is here https://lkml.org/lkml/2013/9/5/319]
>
> On Mon 09-09-13 14:36:25, Michal Hocko wrote:
> > On Thu 05-09-13 12:18:17, Johannes Weiner wrote:
> > [...]
> > > From: Johannes Weiner <***@cmpxchg.org>
> > > Subject: [patch] mm: memcg: do not trap chargers with full callstack on OOM
> > >
> > [...]
> > >
> > > To fix this, never do any OOM handling directly in the charge context.
> > > When an OOM situation is detected, let the task remember the memcg and
> > > then handle the OOM (kill or wait) only after the page fault stack is
> > > unwound and about to return to userspace.
> >
> > OK, this is indeed nicer because the oom setup is trivial and the
> > handling is not split into two parts and everything happens close to
> > out_of_memory where it is expected.
>
> Hmm, wait a second. I have completely forgot about the kmem charging
> path during the review.
>
> So while previously memcg_charge_kmem could have oom killed a
> task if the it couldn't charge to the u-limit after it managed
> to charge k-limit, now it would simply fail because there is no
> mem_cgroup_{enable,disable}_oom around __mem_cgroup_try_charge it relies
> on. The allocation will fail in the end but I am not sure whether the
> missing oom is an issue or not for existing use cases.

Kernel sites should be able to handle -ENOMEM, right? And if this
nests inside a userspace fault, it'll still enter OOM.

> My original objection about oom triggered from kmem paths was that oom
> is not kmem aware so the oom decisions might be totally bogus. But we
> still have that:

Well, k should be a fraction of u+k on any reasonable setup, so there
are always appropriate candidates to take down.

> /*
> * Conditions under which we can wait for the oom_killer. Those are
> * the same conditions tested by the core page allocator
> */
> may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY);
>
> _memcg = memcg;
> ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
> &_memcg, may_oom);
>
> I do not mind having may_oom = false unconditionally in that path but I
> would like to hear fromm Glauber first.

The patch I just sent to azur puts this conditional into try_charge(),
so I'd just change the kmem site to pass `true'.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-16 14:03:18 UTC
Permalink
[Sorry for the late reply. I am in pre-long-vacation mode trying to
clean up my desk]

On Thu 12-09-13 08:59:38, Johannes Weiner wrote:
> On Mon, Sep 09, 2013 at 02:56:59PM +0200, Michal Hocko wrote:
[...]
> > Hmm, wait a second. I have completely forgot about the kmem charging
> > path during the review.
> >
> > So while previously memcg_charge_kmem could have oom killed a
> > task if the it couldn't charge to the u-limit after it managed
> > to charge k-limit, now it would simply fail because there is no
> > mem_cgroup_{enable,disable}_oom around __mem_cgroup_try_charge it relies
> > on. The allocation will fail in the end but I am not sure whether the
> > missing oom is an issue or not for existing use cases.
>
> Kernel sites should be able to handle -ENOMEM, right? And if this
> nests inside a userspace fault, it'll still enter OOM.

Yes, I am not concerned about page faults or the kernel not being able
to handle ENOMEM. I was more worried about somebody relying on kmalloc
allocation trigger OOM (e.g. fork bomb hitting kmem limit). This
wouldn't be a good idea in the first place but I wanted to hear back
from those who use kmem accounting for something real.

I would rather see no-oom from kmalloc until oom is kmem aware.

> > My original objection about oom triggered from kmem paths was that oom
> > is not kmem aware so the oom decisions might be totally bogus. But we
> > still have that:
>
> Well, k should be a fraction of u+k on any reasonable setup, so there
> are always appropriate candidates to take down.
>
> > /*
> > * Conditions under which we can wait for the oom_killer. Those are
> > * the same conditions tested by the core page allocator
> > */
> > may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY);
> >
> > _memcg = memcg;
> > ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
> > &_memcg, may_oom);
> >
> > I do not mind having may_oom = false unconditionally in that path but I
> > would like to hear fromm Glauber first.
>
> The patch I just sent to azur puts this conditional into try_charge(),
> so I'd just change the kmem site to pass `true'.

It seems that your previous patch got merged already (3812c8c8). Could
you post your new version on top of the merged one, please? I am getting
lost in the current patch flow.

I will try to review it before I leave (on Friday).

Thanks!
--
Michal Hocko
SUSE Labs
Michal Hocko
2013-09-05 13:24:15 UTC
Permalink
On Thu 05-09-13 07:54:30, Johannes Weiner wrote:
[...]
> From: Johannes Weiner <***@cmpxchg.org>
> Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully
>
> Many places that can trigger a memcg OOM situation return gracefully
> and don't propagate VM_FAULT_OOM up the fault stack.
>
> It's not practical to annotate all of them to disable the memcg OOM
> killer. Instead, just clean up any set OOM state without warning in
> case the fault is not returning VM_FAULT_OOM.
>
> Also fail charges immediately when the current task already is in an
> OOM context. Otherwise, the previous context gets overwritten and the
> memcg reference is leaked.

Could you paste find_or_create_page called from __get_blk as an example
here, please? So that we do not have to scratch our heads again later...

Also task_in_memcg_oom could be stuffed into mem_cgroup_disable_oom
branch to reduce an overhead for in-kernel faults. The overhead
shouldn't be noticeable so I am not sure this is that important.

> Signed-off-by: Johannes Weiner <***@cmpxchg.org>

I do not see any easier way to fix this without returning back to the
old behavior which is much worse.

Acked-by: Michal Hocko <***@suse.cz>

Thanks!

> diff --git a/mm/memory.c b/mm/memory.c
> index cdbe41b..cdad471 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -57,7 +57,6 @@
> #include <linux/swapops.h>
> #include <linux/elf.h>
> #include <linux/gfp.h>
> -#include <linux/stacktrace.h>
>
> #include <asm/io.h>
> #include <asm/pgalloc.h>
> @@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (flags & FAULT_FLAG_USER)
> mem_cgroup_disable_oom();
>
> - if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) {
> - printk("Fixing unhandled memcg OOM context set up from:\n");
> - print_stack_trace(&current->memcg_oom.trace, 0);
> - mem_cgroup_oom_synchronize();
> - }
> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> + mem_cgroup_oom_synchronize(false);
>
> return ret;
> }
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index aa60863..3bf664c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -785,7 +785,7 @@ out:
> */
> void pagefault_out_of_memory(void)
> {
> - if (mem_cgroup_oom_synchronize())
> + if (mem_cgroup_oom_synchronize(true))
> return;
> if (try_set_system_oom()) {
> out_of_memory(NULL, 0, 0, NULL);
> --
> 1.8.4
>

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-09 13:10:10 UTC
Permalink
>Hi azur,
>
>On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >Hello azur,
>> >
>> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >>Hi azur,
>> >> >>
>> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >>
>> >> >>Thanks!
>> >> >>Johannes
>> >> >>---
>> >> >
>> >> >
>> >> >Johannes,
>> >> >
>> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >
>> >Did the OOM killer go off in this group?
>> >
>> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >context")?
>>
>>
>>
>> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> http://watchdog.sk/lkml/oom_syslog.gz
>There is an unfinished OOM invocation here:
>
> Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>
>__getblk() has this weird loop where it tries to instantiate the page,
>frees memory on failure, then retries. If the memcg goes OOM, the OOM
>path might be entered multiple times and each time leak the memcg
>reference of the respective previous OOM invocation.
>
>There are a few more find_or_create() sites that do not propagate an
>error and it's incredibly hard to find out whether they are even taken
>during a page fault. It's not practical to annotate them all with
>memcg OOM toggles, so let's just catch all OOM contexts at the end of
>handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>this like an error.
>
>azur, here is a patch on top of your modified 3.2. Note that Michal
>might be onto something and we are looking at multiple issues here,
>but the log excert above suggests this fix is required either way.




Johannes, is this still up to date? Thank you.

azur






>---
>From: Johannes Weiner <***@cmpxchg.org>
>Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully
>
>Many places that can trigger a memcg OOM situation return gracefully
>and don't propagate VM_FAULT_OOM up the fault stack.
>
>It's not practical to annotate all of them to disable the memcg OOM
>killer. Instead, just clean up any set OOM state without warning in
>case the fault is not returning VM_FAULT_OOM.
>
>Also fail charges immediately when the current task already is in an
>OOM context. Otherwise, the previous context gets overwritten and the
>memcg reference is leaked.
>
>Signed-off-by: Johannes Weiner <***@cmpxchg.org>
>---
> include/linux/memcontrol.h | 40 ++++++----------------------------------
> include/linux/sched.h | 3 ---
> mm/filemap.c | 11 +----------
> mm/memcontrol.c | 15 ++++++++-------
> mm/memory.c | 8 ++------
> mm/oom_kill.c | 2 +-
> 6 files changed, 18 insertions(+), 61 deletions(-)
>
>diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>index b113c0f..7c43903 100644
>--- a/include/linux/memcontrol.h
>+++ b/include/linux/memcontrol.h
>@@ -120,39 +120,16 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
>-/**
>- * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
>- * @new: true to enable, false to disable
>- *
>- * Toggle whether a failed memcg charge should invoke the OOM killer
>- * or just return -ENOMEM. Returns the previous toggle state.
>- *
>- * NOTE: Any path that enables the OOM killer before charging must
>- * call mem_cgroup_oom_synchronize() afterward to finalize the
>- * OOM handling and clean up.
>- */
>-static inline bool mem_cgroup_toggle_oom(bool new)
>-{
>- bool old;
>-
>- old = current->memcg_oom.may_oom;
>- current->memcg_oom.may_oom = new;
>-
>- return old;
>-}
>-
> static inline void mem_cgroup_enable_oom(void)
> {
>- bool old = mem_cgroup_toggle_oom(true);
>-
>- WARN_ON(old == true);
>+ WARN_ON(current->memcg_oom.may_oom);
>+ current->memcg_oom.may_oom = true;
> }
>
> static inline void mem_cgroup_disable_oom(void)
> {
>- bool old = mem_cgroup_toggle_oom(false);
>-
>- WARN_ON(old == false);
>+ WARN_ON(!current->memcg_oom.may_oom);
>+ current->memcg_oom.may_oom = false;
> }
>
> static inline bool task_in_memcg_oom(struct task_struct *p)
>@@ -160,7 +137,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
> return p->memcg_oom.in_memcg_oom;
> }
>
>-bool mem_cgroup_oom_synchronize(void);
>+bool mem_cgroup_oom_synchronize(bool wait);
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern int do_swap_account;
>@@ -375,11 +352,6 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> {
> }
>
>-static inline bool mem_cgroup_toggle_oom(bool new)
>-{
>- return false;
>-}
>-
> static inline void mem_cgroup_enable_oom(void)
> {
> }
>@@ -393,7 +365,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
> return false;
> }
>
>-static inline bool mem_cgroup_oom_synchronize(void)
>+static inline bool mem_cgroup_oom_synchronize(bool wait)
> {
> return false;
> }
>diff --git a/include/linux/sched.h b/include/linux/sched.h
>index 3f2562c..70a62fd 100644
>--- a/include/linux/sched.h
>+++ b/include/linux/sched.h
>@@ -91,7 +91,6 @@ struct sched_param {
> #include <linux/latencytop.h>
> #include <linux/cred.h>
> #include <linux/llist.h>
>-#include <linux/stacktrace.h>
>
> #include <asm/processor.h>
>
>@@ -1573,8 +1572,6 @@ struct task_struct {
> unsigned int may_oom:1;
> unsigned int in_memcg_oom:1;
> unsigned int oom_locked:1;
>- struct stack_trace trace;
>- unsigned long trace_entries[16];
> int wakeups;
> struct mem_cgroup *wait_on_memcg;
> } memcg_oom;
>diff --git a/mm/filemap.c b/mm/filemap.c
>index 030774a..5f0a3c9 100644
>--- a/mm/filemap.c
>+++ b/mm/filemap.c
>@@ -1661,7 +1661,6 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> struct inode *inode = mapping->host;
> pgoff_t offset = vmf->pgoff;
> struct page *page;
>- bool memcg_oom;
> pgoff_t size;
> int ret = 0;
>
>@@ -1670,11 +1669,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
>
> /*
>- * Do we have something in the page cache already? Either
>- * way, try readahead, but disable the memcg OOM killer for it
>- * as readahead is optional and no errors are propagated up
>- * the fault stack. The OOM killer is enabled while trying to
>- * instantiate the faulting page individually below.
>+ * Do we have something in the page cache already?
> */
> page = find_get_page(mapping, offset);
> if (likely(page)) {
>@@ -1682,14 +1677,10 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> * We found the page, so try async readahead before
> * waiting for the lock.
> */
>- memcg_oom = mem_cgroup_toggle_oom(false);
> do_async_mmap_readahead(vma, ra, file, page, offset);
>- mem_cgroup_toggle_oom(memcg_oom);
> } else {
> /* No page in the page cache at all */
>- memcg_oom = mem_cgroup_toggle_oom(false);
> do_sync_mmap_readahead(vma, ra, file, offset);
>- mem_cgroup_toggle_oom(memcg_oom);
> count_vm_event(PGMAJFAULT);
> mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> ret = VM_FAULT_MAJOR;
>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>index 83acd11..ebd07f3 100644
>--- a/mm/memcontrol.c
>+++ b/mm/memcontrol.c
>@@ -1874,12 +1874,6 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
>
> current->memcg_oom.in_memcg_oom = 1;
>
>- current->memcg_oom.trace.nr_entries = 0;
>- current->memcg_oom.trace.max_entries = 16;
>- current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
>- current->memcg_oom.trace.skip = 1;
>- save_stack_trace(&current->memcg_oom.trace);
>-
> /*
> * As with any blocking lock, a contender needs to start
> * listening for wakeups before attempting the trylock,
>@@ -1935,6 +1929,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
>
> /**
> * mem_cgroup_oom_synchronize - complete memcg OOM handling
>+ * @wait: wait for OOM handler or just clear the OOM state
> *
> * This has to be called at the end of a page fault if the the memcg
> * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
>@@ -1950,7 +1945,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
> * Returns %true if an ongoing memcg OOM situation was detected and
> * finalized, %false otherwise.
> */
>-bool mem_cgroup_oom_synchronize(void)
>+bool mem_cgroup_oom_synchronize(bool wait)
> {
> struct oom_wait_info owait;
> struct mem_cgroup *memcg;
>@@ -1969,6 +1964,9 @@ bool mem_cgroup_oom_synchronize(void)
> if (!memcg)
> goto out;
>
>+ if (!wait)
>+ goto out_memcg;
>+
> if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> goto out_memcg;
>
>@@ -2369,6 +2367,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> struct mem_cgroup *memcg = NULL;
> int ret;
>
>+ if (unlikely(current->memcg_oom.in_memcg_oom))
>+ goto nomem;
>+
> /*
> * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> * in system level. So, allow to go ahead dying process in addition to
>diff --git a/mm/memory.c b/mm/memory.c
>index cdbe41b..cdad471 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -57,7 +57,6 @@
> #include <linux/swapops.h>
> #include <linux/elf.h>
> #include <linux/gfp.h>
>-#include <linux/stacktrace.h>
>
> #include <asm/io.h>
> #include <asm/pgalloc.h>
>@@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (flags & FAULT_FLAG_USER)
> mem_cgroup_disable_oom();
>
>- if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) {
>- printk("Fixing unhandled memcg OOM context set up from:\n");
>- print_stack_trace(&current->memcg_oom.trace, 0);
>- mem_cgroup_oom_synchronize();
>- }
>+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>+ mem_cgroup_oom_synchronize(false);
>
> return ret;
> }
>diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>index aa60863..3bf664c 100644
>--- a/mm/oom_kill.c
>+++ b/mm/oom_kill.c
>@@ -785,7 +785,7 @@ out:
> */
> void pagefault_out_of_memory(void)
> {
>- if (mem_cgroup_oom_synchronize())
>+ if (mem_cgroup_oom_synchronize(true))
> return;
> if (try_set_system_oom()) {
> out_of_memory(NULL, 0, 0, NULL);
>--
>1.8.4
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-09 17:28:49 UTC
Permalink
On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
> >Hi azur,
> >
> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >> >Hello azur,
> >> >
> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >> >> >>Hi azur,
> >> >> >>
> >> >> >>here is the x86-only rollup of the series for 3.2.
> >> >> >>
> >> >> >>Thanks!
> >> >> >>Johannes
> >> >> >>---
> >> >> >
> >> >> >
> >> >> >Johannes,
> >> >> >
> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
> >> >
> >> >Did the OOM killer go off in this group?
> >> >
> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
> >> >context")?
> >>
> >>
> >>
> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
> >> http://watchdog.sk/lkml/oom_syslog.gz
> >There is an unfinished OOM invocation here:
> >
> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
> >
> >__getblk() has this weird loop where it tries to instantiate the page,
> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
> >path might be entered multiple times and each time leak the memcg
> >reference of the respective previous OOM invocation.
> >
> >There are a few more find_or_create() sites that do not propagate an
> >error and it's incredibly hard to find out whether they are even taken
> >during a page fault. It's not practical to annotate them all with
> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> >this like an error.
> >
> >azur, here is a patch on top of your modified 3.2. Note that Michal
> >might be onto something and we are looking at multiple issues here,
> >but the log excert above suggests this fix is required either way.
>
>
>
>
> Johannes, is this still up to date? Thank you.

No, please use the following on top of 3.2 (i.e. full replacement, not
incremental to what you have):

---

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..314fe53 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,30 +842,22 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
+ if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+ up_read(&current->mm->mmap_sem);
+ no_context(regs, error_code, address);
+ return;
}
- if (!(fault & VM_FAULT_ERROR))
- return 0;

if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address);
- return 1;
+ return;
}

out_of_memory(regs, error_code, address);
@@ -876,7 +868,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1070,6 +1061,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1167,9 +1159,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..325da07 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,25 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+static inline void mem_cgroup_enable_oom(void)
+{
+ WARN_ON(current->memcg_oom.may_oom);
+ current->memcg_oom.may_oom = 1;
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+ WARN_ON(!current->memcg_oom.may_oom);
+ current->memcg_oom.may_oom = 0;
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return p->memcg_oom.memcg;
+}
+
+bool mem_cgroup_oom_synchronize(bool wait);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +352,24 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_enable_oom(void)
+{
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(bool wait)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..fb1f145 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,6 +1568,11 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ struct mem_cgroup *memcg;
+ gfp_t gfp_mask;
+ unsigned int may_oom:1;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..56643fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1743,16 +1743,19 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
return total;
}

+static DEFINE_SPINLOCK(memcg_oom_lock);
+
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
- * Has to be called with memcg_oom_lock
*/
-static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter, *failed = NULL;
bool cond = true;

+ spin_lock(&memcg_oom_lock);
+
for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
if (iter->oom_lock) {
/*
@@ -1765,34 +1768,34 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
iter->oom_lock = true;
}

- if (!failed)
- return true;
-
- /*
- * OK, we failed to lock the whole subtree so we have to clean up
- * what we set up to the failing subtree
- */
- cond = true;
- for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
- if (iter == failed) {
- cond = false;
- continue;
+ if (failed) {
+ /*
+ * OK, we failed to lock the whole subtree so we have
+ * to clean up what we set up to the failing subtree
+ */
+ cond = true;
+ for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+ if (iter == failed) {
+ cond = false;
+ continue;
+ }
+ iter->oom_lock = false;
}
- iter->oom_lock = false;
}
- return false;
+
+ spin_unlock(&memcg_oom_lock);
+
+ return !failed;
}

-/*
- * Has to be called with memcg_oom_lock
- */
-static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter;

+ spin_lock(&memcg_oom_lock);
for_each_mem_cgroup_tree(iter, memcg)
iter->oom_lock = false;
- return 0;
+ spin_unlock(&memcg_oom_lock);
}

static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
@@ -1816,7 +1819,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
atomic_add_unless(&iter->under_oom, -1, 0);
}

-static DEFINE_SPINLOCK(memcg_oom_lock);
static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);

struct oom_wait_info {
@@ -1856,56 +1858,106 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
memcg_wakeup_oom(memcg);
}

-/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
+{
+ if (!current->memcg_oom.may_oom)
+ return;
+ /*
+ * We are in the middle of the charge context here, so we
+ * don't want to block when potentially sitting on a callstack
+ * that holds all kinds of filesystem and mm locks.
+ *
+ * Also, the caller may handle a failed allocation gracefully
+ * (like optional page cache readahead) and so an OOM killer
+ * invocation might not even be necessary.
+ *
+ * That's why we don't do anything here except remember the
+ * OOM context and then deal with it at the end of the page
+ * fault when the stack is unwound, the locks are released,
+ * and when we know whether the fault was overall successful.
+ */
+ css_get(&memcg->css);
+ current->memcg_oom.memcg = memcg;
+ current->memcg_oom.gfp_mask = mask;
+}
+
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ * @handle: actually kill/wait or just clean up the OOM state
+ *
+ * This has to be called at the end of a page fault if the memcg OOM
+ * handler was enabled.
+ *
+ * Memcg supports userspace OOM handling where failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to complete the OOM handling.
+ *
+ * Returns %true if an ongoing memcg OOM situation was detected and
+ * completed, %false otherwise.
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+bool mem_cgroup_oom_synchronize(bool handle)
{
+ struct mem_cgroup *memcg = current->memcg_oom.memcg;
struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked;
+
+ /* OOM is global, do not handle */
+ if (!memcg)
+ return false;
+
+ if (!handle)
+ goto cleanup;

owait.mem = memcg;
owait.wait.flags = 0;
owait.wait.func = memcg_oom_wake_function;
owait.wait.private = current;
INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);

- /* At first, try to OOM lock hierarchy under memcg.*/
- spin_lock(&memcg_oom_lock);
- locked = mem_cgroup_oom_lock(memcg);
/*
+ * As with any blocking lock, a contender needs to start
+ * listening for wakeups before attempting the trylock,
+ * otherwise it can miss the wakeup from the unlock and sleep
+ * indefinitely. This is just open-coded because our locking
+ * is so particular to memcg hierarchies.
+ *
* Even if signal_pending(), we can't quit charge() loop without
* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
* under OOM is always welcomed, use TASK_KILLABLE here.
*/
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
- need_to_kill = false;
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
if (locked)
mem_cgroup_oom_notify(memcg);
- spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
+ if (locked && !memcg->oom_kill_disable) {
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask);
+ mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
} else {
schedule();
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
}
- spin_lock(&memcg_oom_lock);
- if (locked)
- mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
-
- mem_cgroup_unmark_under_oom(memcg);

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
- return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+ if (locked) {
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that an OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
+ }
+cleanup:
+ current->memcg_oom.memcg = NULL;
+ css_put(&memcg->css);
return true;
}

@@ -2195,11 +2247,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2308,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2292,6 +2339,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
goto bypass;

/*
+ * Task already OOMed, just get out of here.
+ */
+ if (unlikely(current->memcg_oom.memcg))
+ goto nomem;
+
+ /*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
@@ -2349,7 +2402,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2410,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2423,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..20c43a0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3439,22 +3439,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3495,43 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ /*
+ * Enable the memcg OOM handling for faults triggered in user
+ * space. Kernel faults are handled more gracefully.
+ */
+ if (flags & FAULT_FLAG_USER)
+ mem_cgroup_enable_oom();
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (flags & FAULT_FLAG_USER) {
+ mem_cgroup_disable_oom();
+ /*
+ * The task may have entered a memcg OOM situation but
+ * if the allocation error was handled gracefully (no
+ * VM_FAULT_OOM), there is no need to kill anything.
+ * Just clean up the OOM state peacefully.
+ */
+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
+ mem_cgroup_oom_synchronize(false);
+ }
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..3bf664c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize(true))
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-09 19:59:17 UTC
Permalink
>On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >Hi azur,
>> >
>> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >Hello azur,
>> >> >
>> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >>Hi azur,
>> >> >> >>
>> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >>
>> >> >> >>Thanks!
>> >> >> >>Johannes
>> >> >> >>---
>> >> >> >
>> >> >> >
>> >> >> >Johannes,
>> >> >> >
>> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >
>> >> >Did the OOM killer go off in this group?
>> >> >
>> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >context")?
>> >>
>> >>
>> >>
>> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >There is an unfinished OOM invocation here:
>> >
>> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >
>> >__getblk() has this weird loop where it tries to instantiate the page,
>> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >path might be entered multiple times and each time leak the memcg
>> >reference of the respective previous OOM invocation.
>> >
>> >There are a few more find_or_create() sites that do not propagate an
>> >error and it's incredibly hard to find out whether they are even taken
>> >during a page fault. It's not practical to annotate them all with
>> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >this like an error.
>> >
>> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >might be onto something and we are looking at multiple issues here,
>> >but the log excert above suggests this fix is required either way.
>>
>>
>>
>>
>> Johannes, is this still up to date? Thank you.
>
>No, please use the following on top of 3.2 (i.e. full replacement, not
>incremental to what you have):



Unfortunately it didn't compile:




LD vmlinux.o
MODPOST vmlinux.o
WARNING: modpost: Found 4924 section mismatch(es).
To see full details build your kernel with:
'make CONFIG_DEBUG_SECTION_MISMATCH=y'
GEN .version
CHK include/generated/compile.h
UPD include/generated/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
arch/x86/built-in.o: In function `do_page_fault':
(.text+0x26a77): undefined reference to `handle_mm_fault'
mm/built-in.o: In function `fixup_user_fault':
(.text+0x224d3): undefined reference to `handle_mm_fault'
mm/built-in.o: In function `__get_user_pages':
(.text+0x24a0f): undefined reference to `handle_mm_fault'
make: *** [.tmp_vmlinux1] Error 1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-09 20:12:38 UTC
Permalink
On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
> >> >Hi azur,
> >> >
> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >> >> >Hello azur,
> >> >> >
> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >> >> >> >>Hi azur,
> >> >> >> >>
> >> >> >> >>here is the x86-only rollup of the series for 3.2.
> >> >> >> >>
> >> >> >> >>Thanks!
> >> >> >> >>Johannes
> >> >> >> >>---
> >> >> >> >
> >> >> >> >
> >> >> >> >Johannes,
> >> >> >> >
> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
> >> >> >
> >> >> >Did the OOM killer go off in this group?
> >> >> >
> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
> >> >> >context")?
> >> >>
> >> >>
> >> >>
> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
> >> >> http://watchdog.sk/lkml/oom_syslog.gz
> >> >There is an unfinished OOM invocation here:
> >> >
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
> >> >
> >> >__getblk() has this weird loop where it tries to instantiate the page,
> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
> >> >path might be entered multiple times and each time leak the memcg
> >> >reference of the respective previous OOM invocation.
> >> >
> >> >There are a few more find_or_create() sites that do not propagate an
> >> >error and it's incredibly hard to find out whether they are even taken
> >> >during a page fault. It's not practical to annotate them all with
> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> >> >this like an error.
> >> >
> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
> >> >might be onto something and we are looking at multiple issues here,
> >> >but the log excert above suggests this fix is required either way.
> >>
> >>
> >>
> >>
> >> Johannes, is this still up to date? Thank you.
> >
> >No, please use the following on top of 3.2 (i.e. full replacement, not
> >incremental to what you have):
>
>
>
> Unfortunately it didn't compile:
>
>
>
>
> LD vmlinux.o
> MODPOST vmlinux.o
> WARNING: modpost: Found 4924 section mismatch(es).
> To see full details build your kernel with:
> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
> GEN .version
> CHK include/generated/compile.h
> UPD include/generated/compile.h
> CC init/version.o
> LD init/built-in.o
> LD .tmp_vmlinux1
> arch/x86/built-in.o: In function `do_page_fault':
> (.text+0x26a77): undefined reference to `handle_mm_fault'
> mm/built-in.o: In function `fixup_user_fault':
> (.text+0x224d3): undefined reference to `handle_mm_fault'
> mm/built-in.o: In function `__get_user_pages':
> (.text+0x24a0f): undefined reference to `handle_mm_fault'
> make: *** [.tmp_vmlinux1] Error 1

Oops, sorry about that. Must be configuration dependent because it
works for me (and handle_mm_fault is obviously defined).

Do you have warnings earlier in the compilation? You can use make -s
to filter out everything but warnings.

Or send me your configuration so I can try to reproduce it here.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-09 20:18:39 UTC
Permalink
>On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
>> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >> >Hi azur,
>> >> >
>> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >> >Hello azur,
>> >> >> >
>> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >> >>Hi azur,
>> >> >> >> >>
>> >> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >> >>
>> >> >> >> >>Thanks!
>> >> >> >> >>Johannes
>> >> >> >> >>---
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >Johannes,
>> >> >> >> >
>> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >> >
>> >> >> >Did the OOM killer go off in this group?
>> >> >> >
>> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >> >context")?
>> >> >>
>> >> >>
>> >> >>
>> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >> >There is an unfinished OOM invocation here:
>> >> >
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >> >
>> >> >__getblk() has this weird loop where it tries to instantiate the page,
>> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >> >path might be entered multiple times and each time leak the memcg
>> >> >reference of the respective previous OOM invocation.
>> >> >
>> >> >There are a few more find_or_create() sites that do not propagate an
>> >> >error and it's incredibly hard to find out whether they are even taken
>> >> >during a page fault. It's not practical to annotate them all with
>> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >> >this like an error.
>> >> >
>> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >> >might be onto something and we are looking at multiple issues here,
>> >> >but the log excert above suggests this fix is required either way.
>> >>
>> >>
>> >>
>> >>
>> >> Johannes, is this still up to date? Thank you.
>> >
>> >No, please use the following on top of 3.2 (i.e. full replacement, not
>> >incremental to what you have):
>>
>>
>>
>> Unfortunately it didn't compile:
>>
>>
>>
>>
>> LD vmlinux.o
>> MODPOST vmlinux.o
>> WARNING: modpost: Found 4924 section mismatch(es).
>> To see full details build your kernel with:
>> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>> GEN .version
>> CHK include/generated/compile.h
>> UPD include/generated/compile.h
>> CC init/version.o
>> LD init/built-in.o
>> LD .tmp_vmlinux1
>> arch/x86/built-in.o: In function `do_page_fault':
>> (.text+0x26a77): undefined reference to `handle_mm_fault'
>> mm/built-in.o: In function `fixup_user_fault':
>> (.text+0x224d3): undefined reference to `handle_mm_fault'
>> mm/built-in.o: In function `__get_user_pages':
>> (.text+0x24a0f): undefined reference to `handle_mm_fault'
>> make: *** [.tmp_vmlinux1] Error 1
>
>Oops, sorry about that. Must be configuration dependent because it
>works for me (and handle_mm_fault is obviously defined).
>
>Do you have warnings earlier in the compilation? You can use make -s
>to filter out everything but warnings.
>
>Or send me your configuration so I can try to reproduce it here.
>
>Thanks!




Here it is:
http://watchdog.sk/lkml/config

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-09 21:08:12 UTC
Permalink
>On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
>> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >> >Hi azur,
>> >> >
>> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >> >Hello azur,
>> >> >> >
>> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >> >>Hi azur,
>> >> >> >> >>
>> >> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >> >>
>> >> >> >> >>Thanks!
>> >> >> >> >>Johannes
>> >> >> >> >>---
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >Johannes,
>> >> >> >> >
>> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >> >
>> >> >> >Did the OOM killer go off in this group?
>> >> >> >
>> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >> >context")?
>> >> >>
>> >> >>
>> >> >>
>> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >> >There is an unfinished OOM invocation here:
>> >> >
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >> >
>> >> >__getblk() has this weird loop where it tries to instantiate the page,
>> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >> >path might be entered multiple times and each time leak the memcg
>> >> >reference of the respective previous OOM invocation.
>> >> >
>> >> >There are a few more find_or_create() sites that do not propagate an
>> >> >error and it's incredibly hard to find out whether they are even taken
>> >> >during a page fault. It's not practical to annotate them all with
>> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >> >this like an error.
>> >> >
>> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >> >might be onto something and we are looking at multiple issues here,
>> >> >but the log excert above suggests this fix is required either way.
>> >>
>> >>
>> >>
>> >>
>> >> Johannes, is this still up to date? Thank you.
>> >
>> >No, please use the following on top of 3.2 (i.e. full replacement, not
>> >incremental to what you have):
>>
>>
>>
>> Unfortunately it didn't compile:
>>
>>
>>
>>
>> LD vmlinux.o
>> MODPOST vmlinux.o
>> WARNING: modpost: Found 4924 section mismatch(es).
>> To see full details build your kernel with:
>> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>> GEN .version
>> CHK include/generated/compile.h
>> UPD include/generated/compile.h
>> CC init/version.o
>> LD init/built-in.o
>> LD .tmp_vmlinux1
>> arch/x86/built-in.o: In function `do_page_fault':
>> (.text+0x26a77): undefined reference to `handle_mm_fault'
>> mm/built-in.o: In function `fixup_user_fault':
>> (.text+0x224d3): undefined reference to `handle_mm_fault'
>> mm/built-in.o: In function `__get_user_pages':
>> (.text+0x24a0f): undefined reference to `handle_mm_fault'
>> make: *** [.tmp_vmlinux1] Error 1
>
>Oops, sorry about that. Must be configuration dependent because it
>works for me (and handle_mm_fault is obviously defined).
>
>Do you have warnings earlier in the compilation? You can use make -s
>to filter out everything but warnings.
>
>Or send me your configuration so I can try to reproduce it here.
>
>Thanks!



I'm soooooo sorry Johannes! It was my fault - I had to modify your patch a little because of grsecurity and I did it wrong (24 + 4 apparently isn't 27, haha ;) ).

All compiled fine now, thank you very much. I will install new kernel this night.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-10 18:13:59 UTC
Permalink
>On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
>> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >> >Hi azur,
>> >> >
>> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >> >Hello azur,
>> >> >> >
>> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >> >>Hi azur,
>> >> >> >> >>
>> >> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >> >>
>> >> >> >> >>Thanks!
>> >> >> >> >>Johannes
>> >> >> >> >>---
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >Johannes,
>> >> >> >> >
>> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >> >
>> >> >> >Did the OOM killer go off in this group?
>> >> >> >
>> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >> >context")?
>> >> >>
>> >> >>
>> >> >>
>> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >> >There is an unfinished OOM invocation here:
>> >> >
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >> >
>> >> >__getblk() has this weird loop where it tries to instantiate the page,
>> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >> >path might be entered multiple times and each time leak the memcg
>> >> >reference of the respective previous OOM invocation.
>> >> >
>> >> >There are a few more find_or_create() sites that do not propagate an
>> >> >error and it's incredibly hard to find out whether they are even taken
>> >> >during a page fault. It's not practical to annotate them all with
>> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >> >this like an error.
>> >> >
>> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >> >might be onto something and we are looking at multiple issues here,
>> >> >but the log excert above suggests this fix is required either way.
>> >>
>> >>
>> >>
>> >>
>> >> Johannes, is this still up to date? Thank you.
>> >
>> >No, please use the following on top of 3.2 (i.e. full replacement, not
>> >incremental to what you have):
>>
>>
>>
>> Unfortunately it didn't compile:
>>
>>
>>
>>
>> LD vmlinux.o
>> MODPOST vmlinux.o
>> WARNING: modpost: Found 4924 section mismatch(es).
>> To see full details build your kernel with:
>> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>> GEN .version
>> CHK include/generated/compile.h
>> UPD include/generated/compile.h
>> CC init/version.o
>> LD init/built-in.o
>> LD .tmp_vmlinux1
>> arch/x86/built-in.o: In function `do_page_fault':
>> (.text+0x26a77): undefined reference to `handle_mm_fault'
>> mm/built-in.o: In function `fixup_user_fault':
>> (.text+0x224d3): undefined reference to `handle_mm_fault'
>> mm/built-in.o: In function `__get_user_pages':
>> (.text+0x24a0f): undefined reference to `handle_mm_fault'
>> make: *** [.tmp_vmlinux1] Error 1
>
>Oops, sorry about that. Must be configuration dependent because it
>works for me (and handle_mm_fault is obviously defined).
>
>Do you have warnings earlier in the compilation? You can use make -s
>to filter out everything but warnings.
>
>Or send me your configuration so I can try to reproduce it here.
>
>Thanks!


Johannes,

the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
- applications were able to write to HDD so it wasn't deadlocked as before
- here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
- server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
- my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
- i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
- also other type of logs are available
- nothing like this happened before

What do you think? I'm now running kernel with your previous patch, not with the newest one.


azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-10 18:37:40 UTC
Permalink
On Tue, Sep 10, 2013 at 08:13:59PM +0200, azurIt wrote:
> >On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
> >> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
> >> >> >Hi azur,
> >> >> >
> >> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> >> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >> >> >> >Hello azur,
> >> >> >> >
> >> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >> >> >> >> >>Hi azur,
> >> >> >> >> >>
> >> >> >> >> >>here is the x86-only rollup of the series for 3.2.
> >> >> >> >> >>
> >> >> >> >> >>Thanks!
> >> >> >> >> >>Johannes
> >> >> >> >> >>---
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >Johannes,
> >> >> >> >> >
> >> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
> >> >> >> >
> >> >> >> >Did the OOM killer go off in this group?
> >> >> >> >
> >> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
> >> >> >> >context")?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
> >> >> >> http://watchdog.sk/lkml/oom_syslog.gz
> >> >> >There is an unfinished OOM invocation here:
> >> >> >
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
> >> >> >
> >> >> >__getblk() has this weird loop where it tries to instantiate the page,
> >> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
> >> >> >path might be entered multiple times and each time leak the memcg
> >> >> >reference of the respective previous OOM invocation.
> >> >> >
> >> >> >There are a few more find_or_create() sites that do not propagate an
> >> >> >error and it's incredibly hard to find out whether they are even taken
> >> >> >during a page fault. It's not practical to annotate them all with
> >> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
> >> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> >> >> >this like an error.
> >> >> >
> >> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
> >> >> >might be onto something and we are looking at multiple issues here,
> >> >> >but the log excert above suggests this fix is required either way.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Johannes, is this still up to date? Thank you.
> >> >
> >> >No, please use the following on top of 3.2 (i.e. full replacement, not
> >> >incremental to what you have):
> >>
> >>
> >>
> >> Unfortunately it didn't compile:
> >>
> >>
> >>
> >>
> >> LD vmlinux.o
> >> MODPOST vmlinux.o
> >> WARNING: modpost: Found 4924 section mismatch(es).
> >> To see full details build your kernel with:
> >> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
> >> GEN .version
> >> CHK include/generated/compile.h
> >> UPD include/generated/compile.h
> >> CC init/version.o
> >> LD init/built-in.o
> >> LD .tmp_vmlinux1
> >> arch/x86/built-in.o: In function `do_page_fault':
> >> (.text+0x26a77): undefined reference to `handle_mm_fault'
> >> mm/built-in.o: In function `fixup_user_fault':
> >> (.text+0x224d3): undefined reference to `handle_mm_fault'
> >> mm/built-in.o: In function `__get_user_pages':
> >> (.text+0x24a0f): undefined reference to `handle_mm_fault'
> >> make: *** [.tmp_vmlinux1] Error 1
> >
> >Oops, sorry about that. Must be configuration dependent because it
> >works for me (and handle_mm_fault is obviously defined).
> >
> >Do you have warnings earlier in the compilation? You can use make -s
> >to filter out everything but warnings.
> >
> >Or send me your configuration so I can try to reproduce it here.
> >
> >Thanks!
>
>
> Johannes,
>
> the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
> - applications were able to write to HDD so it wasn't deadlocked as before
> - here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
> - server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
> - my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
> - i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
> - also other type of logs are available
> - nothing like this happened before

That IO from init looks really screwy, I have no idea what's going on
on that machine, but it looks like there is more than just a memcg
problem... Any chance your thirdparty security patches are concealing
kernel daemon activity behind the init process and the IO is actually
coming from a kernel thread like the flushers or kswapd?

Are there OOM kill messages in the syslog?

> What do you think? I'm now running kernel with your previous patch, not with the newest one.

Which one exactly? Can you attach the diff?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-10 19:32:53 UTC
Permalink
>On Tue, Sep 10, 2013 at 08:13:59PM +0200, azurIt wrote:
>> >On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
>> >> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >> >> >Hi azur,
>> >> >> >
>> >> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >> >> >Hello azur,
>> >> >> >> >
>> >> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >> >> >>Hi azur,
>> >> >> >> >> >>
>> >> >> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >> >> >>
>> >> >> >> >> >>Thanks!
>> >> >> >> >> >>Johannes
>> >> >> >> >> >>---
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >Johannes,
>> >> >> >> >> >
>> >> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >> >> >
>> >> >> >> >Did the OOM killer go off in this group?
>> >> >> >> >
>> >> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >> >> >context")?
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> >> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >> >> >There is an unfinished OOM invocation here:
>> >> >> >
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >> >> >
>> >> >> >__getblk() has this weird loop where it tries to instantiate the page,
>> >> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >> >> >path might be entered multiple times and each time leak the memcg
>> >> >> >reference of the respective previous OOM invocation.
>> >> >> >
>> >> >> >There are a few more find_or_create() sites that do not propagate an
>> >> >> >error and it's incredibly hard to find out whether they are even taken
>> >> >> >during a page fault. It's not practical to annotate them all with
>> >> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >> >> >this like an error.
>> >> >> >
>> >> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >> >> >might be onto something and we are looking at multiple issues here,
>> >> >> >but the log excert above suggests this fix is required either way.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> Johannes, is this still up to date? Thank you.
>> >> >
>> >> >No, please use the following on top of 3.2 (i.e. full replacement, not
>> >> >incremental to what you have):
>> >>
>> >>
>> >>
>> >> Unfortunately it didn't compile:
>> >>
>> >>
>> >>
>> >>
>> >> LD vmlinux.o
>> >> MODPOST vmlinux.o
>> >> WARNING: modpost: Found 4924 section mismatch(es).
>> >> To see full details build your kernel with:
>> >> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>> >> GEN .version
>> >> CHK include/generated/compile.h
>> >> UPD include/generated/compile.h
>> >> CC init/version.o
>> >> LD init/built-in.o
>> >> LD .tmp_vmlinux1
>> >> arch/x86/built-in.o: In function `do_page_fault':
>> >> (.text+0x26a77): undefined reference to `handle_mm_fault'
>> >> mm/built-in.o: In function `fixup_user_fault':
>> >> (.text+0x224d3): undefined reference to `handle_mm_fault'
>> >> mm/built-in.o: In function `__get_user_pages':
>> >> (.text+0x24a0f): undefined reference to `handle_mm_fault'
>> >> make: *** [.tmp_vmlinux1] Error 1
>> >
>> >Oops, sorry about that. Must be configuration dependent because it
>> >works for me (and handle_mm_fault is obviously defined).
>> >
>> >Do you have warnings earlier in the compilation? You can use make -s
>> >to filter out everything but warnings.
>> >
>> >Or send me your configuration so I can try to reproduce it here.
>> >
>> >Thanks!
>>
>>
>> Johannes,
>>
>> the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
>> - applications were able to write to HDD so it wasn't deadlocked as before
>> - here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
>> - server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
>> - my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
>> - i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
>> - also other type of logs are available
>> - nothing like this happened before
>
>That IO from init looks really screwy, I have no idea what's going on
>on that machine, but it looks like there is more than just a memcg
>problem... Any chance your thirdparty security patches are concealing
>kernel daemon activity behind the init process and the IO is actually
>coming from a kernel thread like the flushers or kswapd?




I really cannot tell but I never ever saw this before and i'm using all of my patches for several years. Here are all patches which i'm using right now (+ your patch):
http://watchdog.sk/lkml/patches3




>Are there OOM kill messages in the syslog?



Here is full kernel log between 6:00 and 7:59:
http://watchdog.sk/lkml/kern6.log



>> What do you think? I'm now running kernel with your previous patch, not with the newest one.
>
>Which one exactly? Can you attach the diff?



I meant, the problem above occured on kernel with your latest patch:
http://watchdog.sk/lkml/7-2-memcg-fix.patch

but after i had to reboot the server i booted the kernel with your previous patch:
http://watchdog.sk/lkml/7-1-memcg-fix.patch


azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-10 20:12:22 UTC
Permalink
On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
> >On Tue, Sep 10, 2013 at 08:13:59PM +0200, azurIt wrote:
> >> >On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
> >> >> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
> >> >> >> >Hi azur,
> >> >> >> >
> >> >> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> >> >> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >> >> >> >> >Hello azur,
> >> >> >> >> >
> >> >> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >> >> >> >> >> >>Hi azur,
> >> >> >> >> >> >>
> >> >> >> >> >> >>here is the x86-only rollup of the series for 3.2.
> >> >> >> >> >> >>
> >> >> >> >> >> >>Thanks!
> >> >> >> >> >> >>Johannes
> >> >> >> >> >> >>---
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >Johannes,
> >> >> >> >> >> >
> >> >> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
> >> >> >> >> >
> >> >> >> >> >Did the OOM killer go off in this group?
> >> >> >> >> >
> >> >> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
> >> >> >> >> >context")?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
> >> >> >> >> http://watchdog.sk/lkml/oom_syslog.gz
> >> >> >> >There is an unfinished OOM invocation here:
> >> >> >> >
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
> >> >> >> >
> >> >> >> >__getblk() has this weird loop where it tries to instantiate the page,
> >> >> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
> >> >> >> >path might be entered multiple times and each time leak the memcg
> >> >> >> >reference of the respective previous OOM invocation.
> >> >> >> >
> >> >> >> >There are a few more find_or_create() sites that do not propagate an
> >> >> >> >error and it's incredibly hard to find out whether they are even taken
> >> >> >> >during a page fault. It's not practical to annotate them all with
> >> >> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
> >> >> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> >> >> >> >this like an error.
> >> >> >> >
> >> >> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
> >> >> >> >might be onto something and we are looking at multiple issues here,
> >> >> >> >but the log excert above suggests this fix is required either way.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Johannes, is this still up to date? Thank you.
> >> >> >
> >> >> >No, please use the following on top of 3.2 (i.e. full replacement, not
> >> >> >incremental to what you have):
> >> >>
> >> >>
> >> >>
> >> >> Unfortunately it didn't compile:
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> LD vmlinux.o
> >> >> MODPOST vmlinux.o
> >> >> WARNING: modpost: Found 4924 section mismatch(es).
> >> >> To see full details build your kernel with:
> >> >> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
> >> >> GEN .version
> >> >> CHK include/generated/compile.h
> >> >> UPD include/generated/compile.h
> >> >> CC init/version.o
> >> >> LD init/built-in.o
> >> >> LD .tmp_vmlinux1
> >> >> arch/x86/built-in.o: In function `do_page_fault':
> >> >> (.text+0x26a77): undefined reference to `handle_mm_fault'
> >> >> mm/built-in.o: In function `fixup_user_fault':
> >> >> (.text+0x224d3): undefined reference to `handle_mm_fault'
> >> >> mm/built-in.o: In function `__get_user_pages':
> >> >> (.text+0x24a0f): undefined reference to `handle_mm_fault'
> >> >> make: *** [.tmp_vmlinux1] Error 1
> >> >
> >> >Oops, sorry about that. Must be configuration dependent because it
> >> >works for me (and handle_mm_fault is obviously defined).
> >> >
> >> >Do you have warnings earlier in the compilation? You can use make -s
> >> >to filter out everything but warnings.
> >> >
> >> >Or send me your configuration so I can try to reproduce it here.
> >> >
> >> >Thanks!
> >>
> >>
> >> Johannes,
> >>
> >> the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
> >> - applications were able to write to HDD so it wasn't deadlocked as before
> >> - here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
> >> - server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
> >> - my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
> >> - i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
> >> - also other type of logs are available
> >> - nothing like this happened before
> >
> >That IO from init looks really screwy, I have no idea what's going on
> >on that machine, but it looks like there is more than just a memcg
> >problem... Any chance your thirdparty security patches are concealing
> >kernel daemon activity behind the init process and the IO is actually
> >coming from a kernel thread like the flushers or kswapd?
>
>
>
>
> I really cannot tell but I never ever saw this before and i'm using all of my patches for several years. Here are all patches which i'm using right now (+ your patch):
> http://watchdog.sk/lkml/patches3
>
>
>
>
> >Are there OOM kill messages in the syslog?
>
>
>
> Here is full kernel log between 6:00 and 7:59:
> http://watchdog.sk/lkml/kern6.log

Wow, your apaches are like the hydra. Whenever one is OOM killed,
more show up!

> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
> >
> >Which one exactly? Can you attach the diff?
>
>
>
> I meant, the problem above occured on kernel with your latest patch:
> http://watchdog.sk/lkml/7-2-memcg-fix.patch

The above log has the following callstack:

Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30

The charge code seems to be directly invoking the OOM killer, which is
not possible with 7-2-memcg-fix. Are you sure this is the right patch
for this log? This _looks_ more like what 7-1-memcg-fix was doing,
with a direct kill in the charge context and a fixup later on.

It's somewhat eerie that you have to manually apply these patches
because of grsec because I have no idea of knowing what the end result
is, especially since you had compile errors in this area before. Is
grsec making changes to memcg code or why are these patches not
applying cleanly?

> but after i had to reboot the server i booted the kernel with your previous patch:
> http://watchdog.sk/lkml/7-1-memcg-fix.patch

This one still has the known memcg leak.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-10 21:08:53 UTC
Permalink
>On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
>> >On Tue, Sep 10, 2013 at 08:13:59PM +0200, azurIt wrote:
>> >> >On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
>> >> >> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >> >> >> >Hi azur,
>> >> >> >> >
>> >> >> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> >> >> >> > CC: "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "Michal Hocko" <mhocko-***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>> >> >> >> >> >Hello azur,
>> >> >> >> >> >
>> >> >> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >> >> >> >>Hi azur,
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>Thanks!
>> >> >> >> >> >> >>Johannes
>> >> >> >> >> >> >>---
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >Johannes,
>> >> >> >> >> >> >
>> >> >> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >> >> >> >
>> >> >> >> >> >Did the OOM killer go off in this group?
>> >> >> >> >> >
>> >> >> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >> >> >> >context")?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> >> >> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >> >> >> >There is an unfinished OOM invocation here:
>> >> >> >> >
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >> >> >> >
>> >> >> >> >__getblk() has this weird loop where it tries to instantiate the page,
>> >> >> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >> >> >> >path might be entered multiple times and each time leak the memcg
>> >> >> >> >reference of the respective previous OOM invocation.
>> >> >> >> >
>> >> >> >> >There are a few more find_or_create() sites that do not propagate an
>> >> >> >> >error and it's incredibly hard to find out whether they are even taken
>> >> >> >> >during a page fault. It's not practical to annotate them all with
>> >> >> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >> >> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >> >> >> >this like an error.
>> >> >> >> >
>> >> >> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >> >> >> >might be onto something and we are looking at multiple issues here,
>> >> >> >> >but the log excert above suggests this fix is required either way.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Johannes, is this still up to date? Thank you.
>> >> >> >
>> >> >> >No, please use the following on top of 3.2 (i.e. full replacement, not
>> >> >> >incremental to what you have):
>> >> >>
>> >> >>
>> >> >>
>> >> >> Unfortunately it didn't compile:
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> LD vmlinux.o
>> >> >> MODPOST vmlinux.o
>> >> >> WARNING: modpost: Found 4924 section mismatch(es).
>> >> >> To see full details build your kernel with:
>> >> >> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>> >> >> GEN .version
>> >> >> CHK include/generated/compile.h
>> >> >> UPD include/generated/compile.h
>> >> >> CC init/version.o
>> >> >> LD init/built-in.o
>> >> >> LD .tmp_vmlinux1
>> >> >> arch/x86/built-in.o: In function `do_page_fault':
>> >> >> (.text+0x26a77): undefined reference to `handle_mm_fault'
>> >> >> mm/built-in.o: In function `fixup_user_fault':
>> >> >> (.text+0x224d3): undefined reference to `handle_mm_fault'
>> >> >> mm/built-in.o: In function `__get_user_pages':
>> >> >> (.text+0x24a0f): undefined reference to `handle_mm_fault'
>> >> >> make: *** [.tmp_vmlinux1] Error 1
>> >> >
>> >> >Oops, sorry about that. Must be configuration dependent because it
>> >> >works for me (and handle_mm_fault is obviously defined).
>> >> >
>> >> >Do you have warnings earlier in the compilation? You can use make -s
>> >> >to filter out everything but warnings.
>> >> >
>> >> >Or send me your configuration so I can try to reproduce it here.
>> >> >
>> >> >Thanks!
>> >>
>> >>
>> >> Johannes,
>> >>
>> >> the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
>> >> - applications were able to write to HDD so it wasn't deadlocked as before
>> >> - here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
>> >> - server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
>> >> - my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
>> >> - i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
>> >> - also other type of logs are available
>> >> - nothing like this happened before
>> >
>> >That IO from init looks really screwy, I have no idea what's going on
>> >on that machine, but it looks like there is more than just a memcg
>> >problem... Any chance your thirdparty security patches are concealing
>> >kernel daemon activity behind the init process and the IO is actually
>> >coming from a kernel thread like the flushers or kswapd?
>>
>>
>>
>>
>> I really cannot tell but I never ever saw this before and i'm using all of my patches for several years. Here are all patches which i'm using right now (+ your patch):
>> http://watchdog.sk/lkml/patches3
>>
>>
>>
>>
>> >Are there OOM kill messages in the syslog?
>>
>>
>>
>> Here is full kernel log between 6:00 and 7:59:
>> http://watchdog.sk/lkml/kern6.log
>
>Wow, your apaches are like the hydra. Whenever one is OOM killed,
>more show up!



Yeah, it's supposed to do this ;)



>> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
>> >
>> >Which one exactly? Can you attach the diff?
>>
>>
>>
>> I meant, the problem above occured on kernel with your latest patch:
>> http://watchdog.sk/lkml/7-2-memcg-fix.patch
>
>The above log has the following callstack:
>
>Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
>Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
>Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
>Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
>Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
>Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
>Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
>Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
>Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
>Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
>Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
>Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
>Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
>Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
>Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
>Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
>Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30
>
>The charge code seems to be directly invoking the OOM killer, which is
>not possible with 7-2-memcg-fix. Are you sure this is the right patch
>for this log? This _looks_ more like what 7-1-memcg-fix was doing,
>with a direct kill in the charge context and a fixup later on.




I, luckyly, still have the kernel source from which that kernel was build. I tried to re-apply the 7-2-memcg-fix.patch:

# patch -p1 --dry-run < 7-2-memcg-fix.patch
patching file arch/x86/mm/fault.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
4 out of 4 hunks ignored -- saving rejects to file arch/x86/mm/fault.c.rej
patching file include/linux/memcontrol.h
Hunk #1 succeeded at 141 with fuzz 2 (offset 21 lines).
Hunk #2 succeeded at 391 with fuzz 1 (offset 39 lines).
patching file include/linux/mm.h
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file include/linux/mm.h.rej
patching file include/linux/sched.h
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file include/linux/sched.h.rej
patching file mm/memcontrol.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
10 out of 10 hunks ignored -- saving rejects to file mm/memcontrol.c.rej
patching file mm/memory.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file mm/memory.c.rej
patching file mm/oom_kill.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file mm/oom_kill.c.rej


Can you tell from this if the source has the right patch?



>It's somewhat eerie that you have to manually apply these patches
>because of grsec because I have no idea of knowing what the end result
>is, especially since you had compile errors in this area before. Is
>grsec making changes to memcg code or why are these patches not
>applying cleanly?




The problem was in mm/memory.c (first hunk) because grsec added this:

pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

+#ifdef CONFIG_PAX_SEGMEXEC
+ struct vm_area_struct *vma_m;
+#endif

if (unlikely(is_vm_hugetlb_page(vma)))



I'm not using PAX anyway so it shouldn't be used. This was the only rejection but there were lots of fuzz too - I wasn't considering it as a problem, should I?




>> but after i had to reboot the server i booted the kernel with your previous patch:
>> http://watchdog.sk/lkml/7-1-memcg-fix.patch
>
>This one still has the known memcg leak.



I know but it's the best I have which don't take down the server (yet).


azur
Johannes Weiner
2013-09-10 21:18:24 UTC
Permalink
On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
> >> >On Tue, Sep 10, 2013 at 08:13:59PM +0200, azurIt wrote:
> >> >> >On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
> >> >> >> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
> >> >> >> >> >Hi azur,
> >> >> >> >> >
> >> >> >> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
> >> >> >> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >> >> >> >> >> >Hello azur,
> >> >> >> >> >> >
> >> >> >> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
> >> >> >> >> >> >> >>Hi azur,
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>here is the x86-only rollup of the series for 3.2.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>Thanks!
> >> >> >> >> >> >> >>Johannes
> >> >> >> >> >> >> >>---
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >Johannes,
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
> >> >> >> >> >> >
> >> >> >> >> >> >Did the OOM killer go off in this group?
> >> >> >> >> >> >
> >> >> >> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
> >> >> >> >> >> >context")?
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
> >> >> >> >> >> http://watchdog.sk/lkml/oom_syslog.gz
> >> >> >> >> >There is an unfinished OOM invocation here:
> >> >> >> >> >
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
> >> >> >> >> >
> >> >> >> >> >__getblk() has this weird loop where it tries to instantiate the page,
> >> >> >> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
> >> >> >> >> >path might be entered multiple times and each time leak the memcg
> >> >> >> >> >reference of the respective previous OOM invocation.
> >> >> >> >> >
> >> >> >> >> >There are a few more find_or_create() sites that do not propagate an
> >> >> >> >> >error and it's incredibly hard to find out whether they are even taken
> >> >> >> >> >during a page fault. It's not practical to annotate them all with
> >> >> >> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
> >> >> >> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
> >> >> >> >> >this like an error.
> >> >> >> >> >
> >> >> >> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
> >> >> >> >> >might be onto something and we are looking at multiple issues here,
> >> >> >> >> >but the log excert above suggests this fix is required either way.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Johannes, is this still up to date? Thank you.
> >> >> >> >
> >> >> >> >No, please use the following on top of 3.2 (i.e. full replacement, not
> >> >> >> >incremental to what you have):
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Unfortunately it didn't compile:
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> LD vmlinux.o
> >> >> >> MODPOST vmlinux.o
> >> >> >> WARNING: modpost: Found 4924 section mismatch(es).
> >> >> >> To see full details build your kernel with:
> >> >> >> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
> >> >> >> GEN .version
> >> >> >> CHK include/generated/compile.h
> >> >> >> UPD include/generated/compile.h
> >> >> >> CC init/version.o
> >> >> >> LD init/built-in.o
> >> >> >> LD .tmp_vmlinux1
> >> >> >> arch/x86/built-in.o: In function `do_page_fault':
> >> >> >> (.text+0x26a77): undefined reference to `handle_mm_fault'
> >> >> >> mm/built-in.o: In function `fixup_user_fault':
> >> >> >> (.text+0x224d3): undefined reference to `handle_mm_fault'
> >> >> >> mm/built-in.o: In function `__get_user_pages':
> >> >> >> (.text+0x24a0f): undefined reference to `handle_mm_fault'
> >> >> >> make: *** [.tmp_vmlinux1] Error 1
> >> >> >
> >> >> >Oops, sorry about that. Must be configuration dependent because it
> >> >> >works for me (and handle_mm_fault is obviously defined).
> >> >> >
> >> >> >Do you have warnings earlier in the compilation? You can use make -s
> >> >> >to filter out everything but warnings.
> >> >> >
> >> >> >Or send me your configuration so I can try to reproduce it here.
> >> >> >
> >> >> >Thanks!
> >> >>
> >> >>
> >> >> Johannes,
> >> >>
> >> >> the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
> >> >> - applications were able to write to HDD so it wasn't deadlocked as before
> >> >> - here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
> >> >> - server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
> >> >> - my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
> >> >> - i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
> >> >> - also other type of logs are available
> >> >> - nothing like this happened before
> >> >
> >> >That IO from init looks really screwy, I have no idea what's going on
> >> >on that machine, but it looks like there is more than just a memcg
> >> >problem... Any chance your thirdparty security patches are concealing
> >> >kernel daemon activity behind the init process and the IO is actually
> >> >coming from a kernel thread like the flushers or kswapd?
> >>
> >>
> >>
> >>
> >> I really cannot tell but I never ever saw this before and i'm using all of my patches for several years. Here are all patches which i'm using right now (+ your patch):
> >> http://watchdog.sk/lkml/patches3
> >>
> >>
> >>
> >>
> >> >Are there OOM kill messages in the syslog?
> >>
> >>
> >>
> >> Here is full kernel log between 6:00 and 7:59:
> >> http://watchdog.sk/lkml/kern6.log
> >
> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
> >more show up!
>
>
>
> Yeah, it's supposed to do this ;)
>
>
>
> >> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
> >> >
> >> >Which one exactly? Can you attach the diff?
> >>
> >>
> >>
> >> I meant, the problem above occured on kernel with your latest patch:
> >> http://watchdog.sk/lkml/7-2-memcg-fix.patch
> >
> >The above log has the following callstack:
> >
> >Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
> >Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
> >Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
> >Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
> >Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
> >Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
> >Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
> >Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
> >Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
> >Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
> >Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
> >Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
> >Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
> >Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
> >Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
> >Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
> >Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30
> >
> >The charge code seems to be directly invoking the OOM killer, which is
> >not possible with 7-2-memcg-fix. Are you sure this is the right patch
> >for this log? This _looks_ more like what 7-1-memcg-fix was doing,
> >with a direct kill in the charge context and a fixup later on.
>
>
>
>
> I, luckyly, still have the kernel source from which that kernel was build. I tried to re-apply the 7-2-memcg-fix.patch:
>
> # patch -p1 --dry-run < 7-2-memcg-fix.patch
> patching file arch/x86/mm/fault.c
> Reversed (or previously applied) patch detected! Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 4 out of 4 hunks ignored -- saving rejects to file arch/x86/mm/fault.c.rej
> patching file include/linux/memcontrol.h
> Hunk #1 succeeded at 141 with fuzz 2 (offset 21 lines).
> Hunk #2 succeeded at 391 with fuzz 1 (offset 39 lines).

Uhm, some of it applied... I have absolutely no idea what state that
tree is in now...

> patching file include/linux/mm.h
> Reversed (or previously applied) patch detected! Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 1 out of 1 hunk ignored -- saving rejects to file include/linux/mm.h.rej
> patching file include/linux/sched.h
> Reversed (or previously applied) patch detected! Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 1 out of 1 hunk ignored -- saving rejects to file include/linux/sched.h.rej
> patching file mm/memcontrol.c
> Reversed (or previously applied) patch detected! Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 10 out of 10 hunks ignored -- saving rejects to file mm/memcontrol.c.rej
> patching file mm/memory.c
> Reversed (or previously applied) patch detected! Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 2 out of 2 hunks ignored -- saving rejects to file mm/memory.c.rej
> patching file mm/oom_kill.c
> Reversed (or previously applied) patch detected! Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 1 out of 1 hunk ignored -- saving rejects to file mm/oom_kill.c.rej
>
>
> Can you tell from this if the source has the right patch?

Not reliably, I don't think. Can you send me

include/linux/memcontrol.h
mm/memcontrol.c
mm/memory.c
mm/oom_kill.c

from those sources?

It might be easier to start the application from scratch... Keep in
mind that 7-2 was not an incremental fix, you need to remove the
previous memcg patches (as opposed to 7-1).

> >It's somewhat eerie that you have to manually apply these patches
> >because of grsec because I have no idea of knowing what the end result
> >is, especially since you had compile errors in this area before. Is
> >grsec making changes to memcg code or why are these patches not
> >applying cleanly?
>
>
>
>
> The problem was in mm/memory.c (first hunk) because grsec added this:
>
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> +#ifdef CONFIG_PAX_SEGMEXEC
> + struct vm_area_struct *vma_m;
> +#endif
>
> if (unlikely(is_vm_hugetlb_page(vma)))
>
>
>
> I'm not using PAX anyway so it shouldn't be used. This was the only rejection but there were lots of fuzz too - I wasn't considering it as a problem, should I?

It COULD be... Can you send me the files listed above after
application?

> >> but after i had to reboot the server i booted the kernel with your previous patch:
> >> http://watchdog.sk/lkml/7-1-memcg-fix.patch
> >
> >This one still has the known memcg leak.
>
>
>
> I know but it's the best I have which don't take down the server (yet).

Ok. I wouldn't expect it to crash under regular load but it will
probably create hangs again when you try to remove memcgs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-10 21:32:47 UTC
Permalink
>On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
>> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
>> >> >On Tue, Sep 10, 2013 at 08:13:59PM +0200, azurIt wrote:
>> >> >> >On Mon, Sep 09, 2013 at 09:59:17PM +0200, azurIt wrote:
>> >> >> >> >On Mon, Sep 09, 2013 at 03:10:10PM +0200, azurIt wrote:
>> >> >> >> >> >Hi azur,
>> >> >> >> >> >
>> >> >> >> >> >On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote:
>> >> >> >> >> >> > CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >> >> >> >> >Hello azur,
>> >> >> >> >> >> >
>> >> >> >> >> >> >On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >> >> >> >> >> >> >>Hi azur,
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>here is the x86-only rollup of the series for 3.2.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>Thanks!
>> >> >> >> >> >> >> >>Johannes
>> >> >> >> >> >> >> >>---
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >Johannes,
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>> >> >> >> >> >> >
>> >> >> >> >> >> >Did the OOM killer go off in this group?
>> >> >> >> >> >> >
>> >> >> >> >> >> >Was there a warning in the syslog ("Fixing unhandled memcg OOM
>> >> >> >> >> >> >context")?
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Ok, i see this message several times in my syslog logs, one of them is also for this unremovable cgroup (but maybe all of them cannot be removed, should i try?). Example of the log is here (don't know where exactly it starts and ends so here is the full kernel log):
>> >> >> >> >> >> http://watchdog.sk/lkml/oom_syslog.gz
>> >> >> >> >> >There is an unfinished OOM invocation here:
>> >> >> >> >> >
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from:
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715191] [<ffffffff811105c2>] T.1154+0x622/0x8f0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715274] [<ffffffff8111153e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715357] [<ffffffff810cf31c>] add_to_page_cache_locked+0x4c/0x140
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715443] [<ffffffff810cf432>] add_to_page_cache_lru+0x22/0x50
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715526] [<ffffffff810cfdd3>] find_or_create_page+0x73/0xb0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715608] [<ffffffff811493ba>] __getblk+0xea/0x2c0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715692] [<ffffffff8114ca73>] __bread+0x13/0xc0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715774] [<ffffffff81196968>] ext3_get_branch+0x98/0x140
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715859] [<ffffffff81197557>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.715942] [<ffffffff81198304>] ext3_get_block+0xc4/0x120
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716023] [<ffffffff81155c3a>] do_mpage_readpage+0x38a/0x690
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716107] [<ffffffff81155f8f>] mpage_readpage+0x4f/0x70
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716188] [<ffffffff811973a8>] ext3_readpage+0x28/0x60
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716268] [<ffffffff810cfa48>] filemap_fault+0x308/0x560
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716350] [<ffffffff810ef898>] __do_fault+0x78/0x5a0
>> >> >> >> >> > Aug 22 13:15:21 server01 kernel: [1251422.716433] [<ffffffff810f2ab4>] handle_pte_fault+0x84/0x940
>> >> >> >> >> >
>> >> >> >> >> >__getblk() has this weird loop where it tries to instantiate the page,
>> >> >> >> >> >frees memory on failure, then retries. If the memcg goes OOM, the OOM
>> >> >> >> >> >path might be entered multiple times and each time leak the memcg
>> >> >> >> >> >reference of the respective previous OOM invocation.
>> >> >> >> >> >
>> >> >> >> >> >There are a few more find_or_create() sites that do not propagate an
>> >> >> >> >> >error and it's incredibly hard to find out whether they are even taken
>> >> >> >> >> >during a page fault. It's not practical to annotate them all with
>> >> >> >> >> >memcg OOM toggles, so let's just catch all OOM contexts at the end of
>> >> >> >> >> >handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating
>> >> >> >> >> >this like an error.
>> >> >> >> >> >
>> >> >> >> >> >azur, here is a patch on top of your modified 3.2. Note that Michal
>> >> >> >> >> >might be onto something and we are looking at multiple issues here,
>> >> >> >> >> >but the log excert above suggests this fix is required either way.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Johannes, is this still up to date? Thank you.
>> >> >> >> >
>> >> >> >> >No, please use the following on top of 3.2 (i.e. full replacement, not
>> >> >> >> >incremental to what you have):
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Unfortunately it didn't compile:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> LD vmlinux.o
>> >> >> >> MODPOST vmlinux.o
>> >> >> >> WARNING: modpost: Found 4924 section mismatch(es).
>> >> >> >> To see full details build your kernel with:
>> >> >> >> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>> >> >> >> GEN .version
>> >> >> >> CHK include/generated/compile.h
>> >> >> >> UPD include/generated/compile.h
>> >> >> >> CC init/version.o
>> >> >> >> LD init/built-in.o
>> >> >> >> LD .tmp_vmlinux1
>> >> >> >> arch/x86/built-in.o: In function `do_page_fault':
>> >> >> >> (.text+0x26a77): undefined reference to `handle_mm_fault'
>> >> >> >> mm/built-in.o: In function `fixup_user_fault':
>> >> >> >> (.text+0x224d3): undefined reference to `handle_mm_fault'
>> >> >> >> mm/built-in.o: In function `__get_user_pages':
>> >> >> >> (.text+0x24a0f): undefined reference to `handle_mm_fault'
>> >> >> >> make: *** [.tmp_vmlinux1] Error 1
>> >> >> >
>> >> >> >Oops, sorry about that. Must be configuration dependent because it
>> >> >> >works for me (and handle_mm_fault is obviously defined).
>> >> >> >
>> >> >> >Do you have warnings earlier in the compilation? You can use make -s
>> >> >> >to filter out everything but warnings.
>> >> >> >
>> >> >> >Or send me your configuration so I can try to reproduce it here.
>> >> >> >
>> >> >> >Thanks!
>> >> >>
>> >> >>
>> >> >> Johannes,
>> >> >>
>> >> >> the server went down early in the morning, the symptoms were similar as before - huge I/O. Can't tell what exactly happened since I wasn't able to login even on the console. But I have some info:
>> >> >> - applications were able to write to HDD so it wasn't deadlocked as before
>> >> >> - here is how it looked on graphs: http://watchdog.sk/lkml/graphs.jpg
>> >> >> - server wasn't responding from 6:36, it was down between 6:54 and 7:02 (i had to hard reboot it), I was awoken at 6:36 by really creepy sound from my phone ;)
>> >> >> - my 'load check' script successfully killed apache at 6:41 but it didn't help as you can see
>> >> >> - i have one screen with info from atop from time 6:44, looks like i/o was done by init (??!): http://watchdog.sk/lkml/atop.jpg (ignore swap warning, i have no swap)
>> >> >> - also other type of logs are available
>> >> >> - nothing like this happened before
>> >> >
>> >> >That IO from init looks really screwy, I have no idea what's going on
>> >> >on that machine, but it looks like there is more than just a memcg
>> >> >problem... Any chance your thirdparty security patches are concealing
>> >> >kernel daemon activity behind the init process and the IO is actually
>> >> >coming from a kernel thread like the flushers or kswapd?
>> >>
>> >>
>> >>
>> >>
>> >> I really cannot tell but I never ever saw this before and i'm using all of my patches for several years. Here are all patches which i'm using right now (+ your patch):
>> >> http://watchdog.sk/lkml/patches3
>> >>
>> >>
>> >>
>> >>
>> >> >Are there OOM kill messages in the syslog?
>> >>
>> >>
>> >>
>> >> Here is full kernel log between 6:00 and 7:59:
>> >> http://watchdog.sk/lkml/kern6.log
>> >
>> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
>> >more show up!
>>
>>
>>
>> Yeah, it's supposed to do this ;)
>>
>>
>>
>> >> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
>> >> >
>> >> >Which one exactly? Can you attach the diff?
>> >>
>> >>
>> >>
>> >> I meant, the problem above occured on kernel with your latest patch:
>> >> http://watchdog.sk/lkml/7-2-memcg-fix.patch
>> >
>> >The above log has the following callstack:
>> >
>> >Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
>> >Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
>> >Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
>> >Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
>> >Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30
>> >
>> >The charge code seems to be directly invoking the OOM killer, which is
>> >not possible with 7-2-memcg-fix. Are you sure this is the right patch
>> >for this log? This _looks_ more like what 7-1-memcg-fix was doing,
>> >with a direct kill in the charge context and a fixup later on.
>>
>>
>>
>>
>> I, luckyly, still have the kernel source from which that kernel was build. I tried to re-apply the 7-2-memcg-fix.patch:
>>
>> # patch -p1 --dry-run < 7-2-memcg-fix.patch
>> patching file arch/x86/mm/fault.c
>> Reversed (or previously applied) patch detected! Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 4 out of 4 hunks ignored -- saving rejects to file arch/x86/mm/fault.c.rej
>> patching file include/linux/memcontrol.h
>> Hunk #1 succeeded at 141 with fuzz 2 (offset 21 lines).
>> Hunk #2 succeeded at 391 with fuzz 1 (offset 39 lines).
>
>Uhm, some of it applied... I have absolutely no idea what state that
>tree is in now...




I used '--dry-run' so it should be ok :)




>> patching file include/linux/mm.h
>> Reversed (or previously applied) patch detected! Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 1 out of 1 hunk ignored -- saving rejects to file include/linux/mm.h.rej
>> patching file include/linux/sched.h
>> Reversed (or previously applied) patch detected! Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 1 out of 1 hunk ignored -- saving rejects to file include/linux/sched.h.rej
>> patching file mm/memcontrol.c
>> Reversed (or previously applied) patch detected! Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 10 out of 10 hunks ignored -- saving rejects to file mm/memcontrol.c.rej
>> patching file mm/memory.c
>> Reversed (or previously applied) patch detected! Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 2 out of 2 hunks ignored -- saving rejects to file mm/memory.c.rej
>> patching file mm/oom_kill.c
>> Reversed (or previously applied) patch detected! Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 1 out of 1 hunk ignored -- saving rejects to file mm/oom_kill.c.rej
>>
>>
>> Can you tell from this if the source has the right patch?
>
>Not reliably, I don't think. Can you send me
>
> include/linux/memcontrol.h
> mm/memcontrol.c
> mm/memory.c
> mm/oom_kill.c
>
>from those sources?
>
>It might be easier to start the application from scratch... Keep in
>mind that 7-2 was not an incremental fix, you need to remove the
>previous memcg patches (as opposed to 7-1).



Yes, i used only 7-2 from your patches. Here are the files:
http://watchdog.sk/lkml/kernel

orig - kernel source which was used to build the kernel i was talking about earlier
new - newly unpacked and patched 3.2.50 with all of 'my' patches


Here is how your patch was applied:

# patch -p1 < 7-2-memcg-fix.patch
patching file arch/x86/mm/fault.c
Hunk #1 succeeded at 944 (offset 102 lines).
Hunk #2 succeeded at 970 (offset 102 lines).
Hunk #3 succeeded at 1273 with fuzz 1 (offset 212 lines).
Hunk #4 succeeded at 1382 (offset 223 lines).
patching file include/linux/memcontrol.h
Hunk #1 succeeded at 122 with fuzz 2 (offset 2 lines).
Hunk #2 succeeded at 354 (offset 2 lines).
patching file include/linux/mm.h
Hunk #1 succeeded at 163 (offset 7 lines).
patching file include/linux/sched.h
Hunk #1 succeeded at 1644 (offset 76 lines).
patching file mm/memcontrol.c
Hunk #1 succeeded at 1752 (offset 9 lines).
Hunk #2 succeeded at 1777 (offset 9 lines).
Hunk #3 succeeded at 1828 (offset 9 lines).
Hunk #4 succeeded at 1867 (offset 9 lines).
Hunk #5 succeeded at 2256 (offset 9 lines).
Hunk #6 succeeded at 2317 (offset 9 lines).
Hunk #7 succeeded at 2348 (offset 9 lines).
Hunk #8 succeeded at 2411 (offset 9 lines).
Hunk #9 succeeded at 2419 (offset 9 lines).
Hunk #10 succeeded at 2432 (offset 9 lines).
patching file mm/memory.c
Hunk #1 succeeded at 3712 (offset 273 lines).
Hunk #2 succeeded at 3812 (offset 317 lines).
patching file mm/oom_kill.c



>> >It's somewhat eerie that you have to manually apply these patches
>> >because of grsec because I have no idea of knowing what the end result
>> >is, especially since you had compile errors in this area before. Is
>> >grsec making changes to memcg code or why are these patches not
>> >applying cleanly?
>>
>>
>>
>>
>> The problem was in mm/memory.c (first hunk) because grsec added this:
>>
>> pgd_t *pgd;
>> pud_t *pud;
>> pmd_t *pmd;
>> pte_t *pte;
>>
>> +#ifdef CONFIG_PAX_SEGMEXEC
>> + struct vm_area_struct *vma_m;
>> +#endif
>>
>> if (unlikely(is_vm_hugetlb_page(vma)))
>>
>>
>>
>> I'm not using PAX anyway so it shouldn't be used. This was the only rejection but there were lots of fuzz too - I wasn't considering it as a problem, should I?
>
>It COULD be... Can you send me the files listed above after
>application?
>
>> >> but after i had to reboot the server i booted the kernel with your previous patch:
>> >> http://watchdog.sk/lkml/7-1-memcg-fix.patch
>> >
>> >This one still has the known memcg leak.
>>
>>
>>
>> I know but it's the best I have which don't take down the server (yet).
>
>Ok. I wouldn't expect it to crash under regular load but it will
>probably create hangs again when you try to remove memcgs.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-10 22:03:29 UTC
Permalink
On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
> >> >> Here is full kernel log between 6:00 and 7:59:
> >> >> http://watchdog.sk/lkml/kern6.log
> >> >
> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
> >> >more show up!
> >>
> >>
> >>
> >> Yeah, it's supposed to do this ;)

How are you expecting the machine to recover from an OOM situation,
though? I guess I don't really understand what these machines are
doing. But if you are overloading them like crazy, isn't that the
expected outcome?

> >> >> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
> >> >> >
> >> >> >Which one exactly? Can you attach the diff?
> >> >>
> >> >>
> >> >>
> >> >> I meant, the problem above occured on kernel with your latest patch:
> >> >> http://watchdog.sk/lkml/7-2-memcg-fix.patch
> >> >
> >> >The above log has the following callstack:
> >> >
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30
> >> >
> >> >The charge code seems to be directly invoking the OOM killer, which is
> >> >not possible with 7-2-memcg-fix. Are you sure this is the right patch
> >> >for this log? This _looks_ more like what 7-1-memcg-fix was doing,
> >> >with a direct kill in the charge context and a fixup later on.
> >>
> >> I, luckyly, still have the kernel source from which that kernel was build. I tried to re-apply the 7-2-memcg-fix.patch:
> >>
> >> # patch -p1 --dry-run < 7-2-memcg-fix.patch
> >> patching file arch/x86/mm/fault.c
> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> Apply anyway? [n]
> >> Skipping patch.
> >> 4 out of 4 hunks ignored -- saving rejects to file arch/x86/mm/fault.c.rej
> >> patching file include/linux/memcontrol.h
> >> Hunk #1 succeeded at 141 with fuzz 2 (offset 21 lines).
> >> Hunk #2 succeeded at 391 with fuzz 1 (offset 39 lines).
> >
> >Uhm, some of it applied... I have absolutely no idea what state that
> >tree is in now...
>
> I used '--dry-run' so it should be ok :)

Ah, right.

> >> patching file include/linux/mm.h
> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> Apply anyway? [n]
> >> Skipping patch.
> >> 1 out of 1 hunk ignored -- saving rejects to file include/linux/mm.h.rej
> >> patching file include/linux/sched.h
> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> Apply anyway? [n]
> >> Skipping patch.
> >> 1 out of 1 hunk ignored -- saving rejects to file include/linux/sched.h.rej
> >> patching file mm/memcontrol.c
> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> Apply anyway? [n]
> >> Skipping patch.
> >> 10 out of 10 hunks ignored -- saving rejects to file mm/memcontrol.c.rej
> >> patching file mm/memory.c
> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> Apply anyway? [n]
> >> Skipping patch.
> >> 2 out of 2 hunks ignored -- saving rejects to file mm/memory.c.rej
> >> patching file mm/oom_kill.c
> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> Apply anyway? [n]
> >> Skipping patch.
> >> 1 out of 1 hunk ignored -- saving rejects to file mm/oom_kill.c.rej
> >>
> >>
> >> Can you tell from this if the source has the right patch?
> >
> >Not reliably, I don't think. Can you send me
> >
> > include/linux/memcontrol.h
> > mm/memcontrol.c
> > mm/memory.c
> > mm/oom_kill.c
> >
> >from those sources?
> >
> >It might be easier to start the application from scratch... Keep in
> >mind that 7-2 was not an incremental fix, you need to remove the
> >previous memcg patches (as opposed to 7-1).
>
>
>
> Yes, i used only 7-2 from your patches. Here are the files:
> http://watchdog.sk/lkml/kernel
>
> orig - kernel source which was used to build the kernel i was talking about earlier
> new - newly unpacked and patched 3.2.50 with all of 'my' patches

Ok, thanks!

> Here is how your patch was applied:
>
> # patch -p1 < 7-2-memcg-fix.patch
> patching file arch/x86/mm/fault.c
> Hunk #1 succeeded at 944 (offset 102 lines).
> Hunk #2 succeeded at 970 (offset 102 lines).
> Hunk #3 succeeded at 1273 with fuzz 1 (offset 212 lines).
> Hunk #4 succeeded at 1382 (offset 223 lines).

Ah, I forgot about this one. Could you provide that file (fault.c) as
well please?

> patching file include/linux/memcontrol.h
> Hunk #1 succeeded at 122 with fuzz 2 (offset 2 lines).
> Hunk #2 succeeded at 354 (offset 2 lines).

Looks good, still.

> patching file include/linux/mm.h
> Hunk #1 succeeded at 163 (offset 7 lines).
> patching file include/linux/sched.h
> Hunk #1 succeeded at 1644 (offset 76 lines).
> patching file mm/memcontrol.c
> Hunk #1 succeeded at 1752 (offset 9 lines).
> Hunk #2 succeeded at 1777 (offset 9 lines).
> Hunk #3 succeeded at 1828 (offset 9 lines).
> Hunk #4 succeeded at 1867 (offset 9 lines).
> Hunk #5 succeeded at 2256 (offset 9 lines).
> Hunk #6 succeeded at 2317 (offset 9 lines).
> Hunk #7 succeeded at 2348 (offset 9 lines).
> Hunk #8 succeeded at 2411 (offset 9 lines).
> Hunk #9 succeeded at 2419 (offset 9 lines).
> Hunk #10 succeeded at 2432 (offset 9 lines).
> patching file mm/memory.c
> Hunk #1 succeeded at 3712 (offset 273 lines).
> Hunk #2 succeeded at 3812 (offset 317 lines).
> patching file mm/oom_kill.c

These look good as well.

That leaves the weird impossible stack trace. Did you double check
that this crash came from a kernel with those exact files?

I'm also confused about the freezer. You used to freeze cgroups that
were out of memory in the past, right? Are you no longer doing this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-11 18:03:27 UTC
Permalink
On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
> >> >> >> Here is full kernel log between 6:00 and 7:59:
> >> >> >> http://watchdog.sk/lkml/kern6.log
> >> >> >
> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
> >> >> >more show up!
> >> >>
> >> >>
> >> >>
> >> >> Yeah, it's supposed to do this ;)
> >
> >How are you expecting the machine to recover from an OOM situation,
> >though? I guess I don't really understand what these machines are
> >doing. But if you are overloading them like crazy, isn't that the
> >expected outcome?
>
>
>
>
>
> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).

Yes, sure, but when the cgroups are thrashing, they use the disk and
CPU to the point where the overall system is affected.

> >> >> >> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
> >> >> >> >
> >> >> >> >Which one exactly? Can you attach the diff?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> I meant, the problem above occured on kernel with your latest patch:
> >> >> >> http://watchdog.sk/lkml/7-2-memcg-fix.patch
> >> >> >
> >> >> >The above log has the following callstack:
> >> >> >
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30
> >> >> >
> >> >> >The charge code seems to be directly invoking the OOM killer, which is
> >> >> >not possible with 7-2-memcg-fix. Are you sure this is the right patch
> >> >> >for this log? This _looks_ more like what 7-1-memcg-fix was doing,
> >> >> >with a direct kill in the charge context and a fixup later on.
> >> >>
> >> >> I, luckyly, still have the kernel source from which that kernel was build. I tried to re-apply the 7-2-memcg-fix.patch:
> >> >>
> >> >> # patch -p1 --dry-run < 7-2-memcg-fix.patch
> >> >> patching file arch/x86/mm/fault.c
> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> >> Apply anyway? [n]
> >> >> Skipping patch.
> >> >> 4 out of 4 hunks ignored -- saving rejects to file arch/x86/mm/fault.c.rej
> >> >> patching file include/linux/memcontrol.h
> >> >> Hunk #1 succeeded at 141 with fuzz 2 (offset 21 lines).
> >> >> Hunk #2 succeeded at 391 with fuzz 1 (offset 39 lines).
> >> >
> >> >Uhm, some of it applied... I have absolutely no idea what state that
> >> >tree is in now...
> >>
> >> I used '--dry-run' so it should be ok :)
> >
> >Ah, right.
> >
> >> >> patching file include/linux/mm.h
> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> >> Apply anyway? [n]
> >> >> Skipping patch.
> >> >> 1 out of 1 hunk ignored -- saving rejects to file include/linux/mm.h.rej
> >> >> patching file include/linux/sched.h
> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> >> Apply anyway? [n]
> >> >> Skipping patch.
> >> >> 1 out of 1 hunk ignored -- saving rejects to file include/linux/sched.h.rej
> >> >> patching file mm/memcontrol.c
> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> >> Apply anyway? [n]
> >> >> Skipping patch.
> >> >> 10 out of 10 hunks ignored -- saving rejects to file mm/memcontrol.c.rej
> >> >> patching file mm/memory.c
> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> >> Apply anyway? [n]
> >> >> Skipping patch.
> >> >> 2 out of 2 hunks ignored -- saving rejects to file mm/memory.c.rej
> >> >> patching file mm/oom_kill.c
> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
> >> >> Apply anyway? [n]
> >> >> Skipping patch.
> >> >> 1 out of 1 hunk ignored -- saving rejects to file mm/oom_kill.c.rej
> >> >>
> >> >>
> >> >> Can you tell from this if the source has the right patch?
> >> >
> >> >Not reliably, I don't think. Can you send me
> >> >
> >> > include/linux/memcontrol.h
> >> > mm/memcontrol.c
> >> > mm/memory.c
> >> > mm/oom_kill.c
> >> >
> >> >from those sources?
> >> >
> >> >It might be easier to start the application from scratch... Keep in
> >> >mind that 7-2 was not an incremental fix, you need to remove the
> >> >previous memcg patches (as opposed to 7-1).
> >>
> >>
> >>
> >> Yes, i used only 7-2 from your patches. Here are the files:
> >> http://watchdog.sk/lkml/kernel
> >>
> >> orig - kernel source which was used to build the kernel i was talking about earlier
> >> new - newly unpacked and patched 3.2.50 with all of 'my' patches
> >
> >Ok, thanks!
> >
> >> Here is how your patch was applied:
> >>
> >> # patch -p1 < 7-2-memcg-fix.patch
> >> patching file arch/x86/mm/fault.c
> >> Hunk #1 succeeded at 944 (offset 102 lines).
> >> Hunk #2 succeeded at 970 (offset 102 lines).
> >> Hunk #3 succeeded at 1273 with fuzz 1 (offset 212 lines).
> >> Hunk #4 succeeded at 1382 (offset 223 lines).
> >
> >Ah, I forgot about this one. Could you provide that file (fault.c) as
> >well please?
>
>
>
>
> I added it.

Thanks. This one looks good, too.

> >> patching file include/linux/memcontrol.h
> >> Hunk #1 succeeded at 122 with fuzz 2 (offset 2 lines).
> >> Hunk #2 succeeded at 354 (offset 2 lines).
> >
> >Looks good, still.
> >
> >> patching file include/linux/mm.h
> >> Hunk #1 succeeded at 163 (offset 7 lines).
> >> patching file include/linux/sched.h
> >> Hunk #1 succeeded at 1644 (offset 76 lines).
> >> patching file mm/memcontrol.c
> >> Hunk #1 succeeded at 1752 (offset 9 lines).
> >> Hunk #2 succeeded at 1777 (offset 9 lines).
> >> Hunk #3 succeeded at 1828 (offset 9 lines).
> >> Hunk #4 succeeded at 1867 (offset 9 lines).
> >> Hunk #5 succeeded at 2256 (offset 9 lines).
> >> Hunk #6 succeeded at 2317 (offset 9 lines).
> >> Hunk #7 succeeded at 2348 (offset 9 lines).
> >> Hunk #8 succeeded at 2411 (offset 9 lines).
> >> Hunk #9 succeeded at 2419 (offset 9 lines).
> >> Hunk #10 succeeded at 2432 (offset 9 lines).
> >> patching file mm/memory.c
> >> Hunk #1 succeeded at 3712 (offset 273 lines).
> >> Hunk #2 succeeded at 3812 (offset 317 lines).
> >> patching file mm/oom_kill.c
> >
> >These look good as well.
> >
> >That leaves the weird impossible stack trace. Did you double check
> >that this crash came from a kernel with those exact files?
>
>
>
> Yes i'm sure.

Okay, my suspicion is that the previous patches invoked the OOM killer
right away, whereas in this latest version it's invoked only when the
fault is finished. Maybe the task that locked the group gets held up
somewhere else and then it takes too long until something is actually
killed. Meanwhile, every other allocator drops into 5 reclaim cycles
before giving up, which could explain the thrashing. And on the memcg
level we don't have BDI congestion sleeps like on the global level, so
everybody is backing off from the disk.

Here is an incremental fix to the latest version, i.e. the one that
livelocked under heavy IO, not the one you are using right now.

First, it reduces the reclaim retries from 5 to 2, which resembles the
global kswapd + ttfp somewhat. Next, NOFS/NORETRY allocators are not
allowed to kick off the OOM killer, like in the global case, so that
we don't kill things and give up just because light reclaim can't free
anything. Last, the memcg is marked under OOM when one task enters
OOM so that not everybody is livelocking in reclaim in a hopeless
situation.

---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56643fe..f565857 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1878,6 +1878,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
*/
css_get(&memcg->css);
current->memcg_oom.memcg = memcg;
+ mem_cgroup_mark_under_oom(memcg);
current->memcg_oom.gfp_mask = mask;
}

@@ -1929,7 +1930,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
* under OOM is always welcomed, use TASK_KILLABLE here.
*/
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- mem_cgroup_mark_under_oom(memcg);

locked = mem_cgroup_oom_trylock(memcg);

@@ -1937,12 +1937,10 @@ bool mem_cgroup_oom_synchronize(bool handle)
mem_cgroup_oom_notify(memcg);

if (locked && !memcg->oom_kill_disable) {
- mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
} else {
schedule();
- mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
}

@@ -1956,6 +1954,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
memcg_oom_recover(memcg);
}
cleanup:
+ mem_cgroup_unmark_under_oom(memcg);
current->memcg_oom.memcg = NULL;
css_put(&memcg->css);
return true;
@@ -2250,7 +2249,7 @@ enum {
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool invoke_oom)
+ unsigned int nr_pages, bool enter_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2285,6 +2284,11 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

+ if (enter_oom) {
+ mem_cgroup_oom(mem_over_limit, gfp_mask);
+ return CHARGE_NOMEM;
+ }
+
ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
gfp_mask, flags, NULL);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -2308,9 +2312,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- if (invoke_oom)
- mem_cgroup_oom(mem_over_limit, gfp_mask);
-
return CHARGE_NOMEM;
}

@@ -2325,8 +2326,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
bool oom)
{
unsigned int batch = max(CHARGE_BATCH, nr_pages);
- int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup *memcg = NULL;
+ int nr_reclaim_retries = 2;
int ret;

/*
@@ -2352,6 +2353,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
*/
if (!*ptr && !mm)
goto bypass;
+
+ if (!(gfp_mask & __GFP_FS) || (gfp_mask & __GFP_NORETRY))
+ oom = false;
again:
if (*ptr) { /* css should be a valid one */
memcg = *ptr;
@@ -2402,7 +2406,7 @@ again:
}

do {
- bool invoke_oom = oom && !nr_oom_retries;
+ bool enter_oom = false;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2410,7 +2414,13 @@ again:
goto bypass;
}

- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
+ if (oom && !nr_reclaim_retries)
+ enter_oom = true;
+
+ if (atomic_read(&memcg->under_oom))
+ enter_oom = true;
+
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, enter_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2422,12 +2432,12 @@ again:
case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
css_put(&memcg->css);
goto nomem;
- case CHARGE_NOMEM: /* OOM routine works */
- if (!oom || invoke_oom) {
+ case CHARGE_NOMEM:
+ if (!nr_reclaim_retries || enter_oom) {
css_put(&memcg->css);
goto nomem;
}
- nr_oom_retries--;
+ nr_reclaim_retries--;
break;
}
} while (ret != CHARGE_OK);
azurIt
2013-09-11 18:54:48 UTC
Permalink
>On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
>> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
>> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
>> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
>> >> >> >> Here is full kernel log between 6:00 and 7:59:
>> >> >> >> http://watchdog.sk/lkml/kern6.log
>> >> >> >
>> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
>> >> >> >more show up!
>> >> >>
>> >> >>
>> >> >>
>> >> >> Yeah, it's supposed to do this ;)
>> >
>> >How are you expecting the machine to recover from an OOM situation,
>> >though? I guess I don't really understand what these machines are
>> >doing. But if you are overloading them like crazy, isn't that the
>> >expected outcome?
>>
>>
>>
>>
>>
>> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).
>
>Yes, sure, but when the cgroups are thrashing, they use the disk and
>CPU to the point where the overall system is affected.




Didn't know that there is a disk usage because of this, i never noticed anything yet.




>> >> >> >> >> What do you think? I'm now running kernel with your previous patch, not with the newest one.
>> >> >> >> >
>> >> >> >> >Which one exactly? Can you attach the diff?
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> I meant, the problem above occured on kernel with your latest patch:
>> >> >> >> http://watchdog.sk/lkml/7-2-memcg-fix.patch
>> >> >> >
>> >> >> >The above log has the following callstack:
>> >> >> >
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337628] [<ffffffff810d19fe>] dump_header+0x7e/0x1e0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337707] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337790] [<ffffffff810d18ff>] ? find_lock_task_mm+0x2f/0x70
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337874] [<ffffffff81094bb0>] ? __css_put+0x50/0x90
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.337952] [<ffffffff810d1ec5>] oom_kill_process+0x85/0x2a0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338037] [<ffffffff810d2448>] mem_cgroup_out_of_memory+0xa8/0xf0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338120] [<ffffffff81110858>] T.1154+0x8b8/0x8f0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338201] [<ffffffff81110fa6>] mem_cgroup_charge_common+0x56/0xa0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338283] [<ffffffff81111035>] mem_cgroup_newpage_charge+0x45/0x50
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338364] [<ffffffff810f3039>] handle_pte_fault+0x609/0x940
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338451] [<ffffffff8102ab1f>] ? pte_alloc_one+0x3f/0x50
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338532] [<ffffffff8107e455>] ? sched_clock_local+0x25/0x90
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338617] [<ffffffff810f34d7>] handle_mm_fault+0x167/0x340
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338699] [<ffffffff8102714b>] do_page_fault+0x13b/0x490
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338781] [<ffffffff810f8848>] ? do_brk+0x208/0x3a0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338865] [<ffffffff812dba22>] ? gr_learn_resource+0x42/0x1e0
>> >> >> >Sep 10 07:59:43 server01 kernel: [ 3846.338951] [<ffffffff815cb7bf>] page_fault+0x1f/0x30
>> >> >> >
>> >> >> >The charge code seems to be directly invoking the OOM killer, which is
>> >> >> >not possible with 7-2-memcg-fix. Are you sure this is the right patch
>> >> >> >for this log? This _looks_ more like what 7-1-memcg-fix was doing,
>> >> >> >with a direct kill in the charge context and a fixup later on.
>> >> >>
>> >> >> I, luckyly, still have the kernel source from which that kernel was build. I tried to re-apply the 7-2-memcg-fix.patch:
>> >> >>
>> >> >> # patch -p1 --dry-run < 7-2-memcg-fix.patch
>> >> >> patching file arch/x86/mm/fault.c
>> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
>> >> >> Apply anyway? [n]
>> >> >> Skipping patch.
>> >> >> 4 out of 4 hunks ignored -- saving rejects to file arch/x86/mm/fault.c.rej
>> >> >> patching file include/linux/memcontrol.h
>> >> >> Hunk #1 succeeded at 141 with fuzz 2 (offset 21 lines).
>> >> >> Hunk #2 succeeded at 391 with fuzz 1 (offset 39 lines).
>> >> >
>> >> >Uhm, some of it applied... I have absolutely no idea what state that
>> >> >tree is in now...
>> >>
>> >> I used '--dry-run' so it should be ok :)
>> >
>> >Ah, right.
>> >
>> >> >> patching file include/linux/mm.h
>> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
>> >> >> Apply anyway? [n]
>> >> >> Skipping patch.
>> >> >> 1 out of 1 hunk ignored -- saving rejects to file include/linux/mm.h.rej
>> >> >> patching file include/linux/sched.h
>> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
>> >> >> Apply anyway? [n]
>> >> >> Skipping patch.
>> >> >> 1 out of 1 hunk ignored -- saving rejects to file include/linux/sched.h.rej
>> >> >> patching file mm/memcontrol.c
>> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
>> >> >> Apply anyway? [n]
>> >> >> Skipping patch.
>> >> >> 10 out of 10 hunks ignored -- saving rejects to file mm/memcontrol.c.rej
>> >> >> patching file mm/memory.c
>> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
>> >> >> Apply anyway? [n]
>> >> >> Skipping patch.
>> >> >> 2 out of 2 hunks ignored -- saving rejects to file mm/memory.c.rej
>> >> >> patching file mm/oom_kill.c
>> >> >> Reversed (or previously applied) patch detected! Assume -R? [n]
>> >> >> Apply anyway? [n]
>> >> >> Skipping patch.
>> >> >> 1 out of 1 hunk ignored -- saving rejects to file mm/oom_kill.c.rej
>> >> >>
>> >> >>
>> >> >> Can you tell from this if the source has the right patch?
>> >> >
>> >> >Not reliably, I don't think. Can you send me
>> >> >
>> >> > include/linux/memcontrol.h
>> >> > mm/memcontrol.c
>> >> > mm/memory.c
>> >> > mm/oom_kill.c
>> >> >
>> >> >from those sources?
>> >> >
>> >> >It might be easier to start the application from scratch... Keep in
>> >> >mind that 7-2 was not an incremental fix, you need to remove the
>> >> >previous memcg patches (as opposed to 7-1).
>> >>
>> >>
>> >>
>> >> Yes, i used only 7-2 from your patches. Here are the files:
>> >> http://watchdog.sk/lkml/kernel
>> >>
>> >> orig - kernel source which was used to build the kernel i was talking about earlier
>> >> new - newly unpacked and patched 3.2.50 with all of 'my' patches
>> >
>> >Ok, thanks!
>> >
>> >> Here is how your patch was applied:
>> >>
>> >> # patch -p1 < 7-2-memcg-fix.patch
>> >> patching file arch/x86/mm/fault.c
>> >> Hunk #1 succeeded at 944 (offset 102 lines).
>> >> Hunk #2 succeeded at 970 (offset 102 lines).
>> >> Hunk #3 succeeded at 1273 with fuzz 1 (offset 212 lines).
>> >> Hunk #4 succeeded at 1382 (offset 223 lines).
>> >
>> >Ah, I forgot about this one. Could you provide that file (fault.c) as
>> >well please?
>>
>>
>>
>>
>> I added it.
>
>Thanks. This one looks good, too.
>
>> >> patching file include/linux/memcontrol.h
>> >> Hunk #1 succeeded at 122 with fuzz 2 (offset 2 lines).
>> >> Hunk #2 succeeded at 354 (offset 2 lines).
>> >
>> >Looks good, still.
>> >
>> >> patching file include/linux/mm.h
>> >> Hunk #1 succeeded at 163 (offset 7 lines).
>> >> patching file include/linux/sched.h
>> >> Hunk #1 succeeded at 1644 (offset 76 lines).
>> >> patching file mm/memcontrol.c
>> >> Hunk #1 succeeded at 1752 (offset 9 lines).
>> >> Hunk #2 succeeded at 1777 (offset 9 lines).
>> >> Hunk #3 succeeded at 1828 (offset 9 lines).
>> >> Hunk #4 succeeded at 1867 (offset 9 lines).
>> >> Hunk #5 succeeded at 2256 (offset 9 lines).
>> >> Hunk #6 succeeded at 2317 (offset 9 lines).
>> >> Hunk #7 succeeded at 2348 (offset 9 lines).
>> >> Hunk #8 succeeded at 2411 (offset 9 lines).
>> >> Hunk #9 succeeded at 2419 (offset 9 lines).
>> >> Hunk #10 succeeded at 2432 (offset 9 lines).
>> >> patching file mm/memory.c
>> >> Hunk #1 succeeded at 3712 (offset 273 lines).
>> >> Hunk #2 succeeded at 3812 (offset 317 lines).
>> >> patching file mm/oom_kill.c
>> >
>> >These look good as well.
>> >
>> >That leaves the weird impossible stack trace. Did you double check
>> >that this crash came from a kernel with those exact files?
>>
>>
>>
>> Yes i'm sure.
>
>Okay, my suspicion is that the previous patches invoked the OOM killer
>right away, whereas in this latest version it's invoked only when the
>fault is finished. Maybe the task that locked the group gets held up
>somewhere else and then it takes too long until something is actually
>killed. Meanwhile, every other allocator drops into 5 reclaim cycles
>before giving up, which could explain the thrashing. And on the memcg
>level we don't have BDI congestion sleeps like on the global level, so
>everybody is backing off from the disk.
>
>Here is an incremental fix to the latest version, i.e. the one that
>livelocked under heavy IO, not the one you are using right now.
>
>First, it reduces the reclaim retries from 5 to 2, which resembles the
>global kswapd + ttfp somewhat. Next, NOFS/NORETRY allocators are not
>allowed to kick off the OOM killer, like in the global case, so that
>we don't kill things and give up just because light reclaim can't free
>anything. Last, the memcg is marked under OOM when one task enters
>OOM so that not everybody is livelocking in reclaim in a hopeless
>situation.



Thank you i will boot it this night. I also created a new server load checking and recuing script so i hope i won't be forced to hard reboot the server in case something similar as before happens. Btw, patch didn't apply to 3.2.51, there were probably big changes in memory system (almost all hunks failed). I used 3.2.50 as before.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-11 19:11:50 UTC
Permalink
On Wed, Sep 11, 2013 at 08:54:48PM +0200, azurIt wrote:
> >On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
> >> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
> >> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
> >> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
> >> >> >> >> Here is full kernel log between 6:00 and 7:59:
> >> >> >> >> http://watchdog.sk/lkml/kern6.log
> >> >> >> >
> >> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
> >> >> >> >more show up!
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Yeah, it's supposed to do this ;)
> >> >
> >> >How are you expecting the machine to recover from an OOM situation,
> >> >though? I guess I don't really understand what these machines are
> >> >doing. But if you are overloading them like crazy, isn't that the
> >> >expected outcome?
> >>
> >>
> >>
> >>
> >>
> >> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).
> >
> >Yes, sure, but when the cgroups are thrashing, they use the disk and
> >CPU to the point where the overall system is affected.
>
>
>
>
> Didn't know that there is a disk usage because of this, i never noticed anything yet.

You said there was heavy IO going on...?

> >Okay, my suspicion is that the previous patches invoked the OOM killer
> >right away, whereas in this latest version it's invoked only when the
> >fault is finished. Maybe the task that locked the group gets held up
> >somewhere else and then it takes too long until something is actually
> >killed. Meanwhile, every other allocator drops into 5 reclaim cycles
> >before giving up, which could explain the thrashing. And on the memcg
> >level we don't have BDI congestion sleeps like on the global level, so
> >everybody is backing off from the disk.
> >
> >Here is an incremental fix to the latest version, i.e. the one that
> >livelocked under heavy IO, not the one you are using right now.
> >
> >First, it reduces the reclaim retries from 5 to 2, which resembles the
> >global kswapd + ttfp somewhat. Next, NOFS/NORETRY allocators are not
> >allowed to kick off the OOM killer, like in the global case, so that
> >we don't kill things and give up just because light reclaim can't free
> >anything. Last, the memcg is marked under OOM when one task enters
> >OOM so that not everybody is livelocking in reclaim in a hopeless
> >situation.
>
>
>
> Thank you i will boot it this night. I also created a new server load checking and recuing script so i hope i won't be forced to hard reboot the server in case something similar as before happens. Btw, patch didn't apply to 3.2.51, there were probably big changes in memory system (almost all hunks failed). I used 3.2.50 as before.

Yes, please don't change the test base in the middle of this!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-11 19:41:18 UTC
Permalink
>On Wed, Sep 11, 2013 at 08:54:48PM +0200, azurIt wrote:
>> >On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
>> >> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
>> >> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
>> >> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
>> >> >> >> >> Here is full kernel log between 6:00 and 7:59:
>> >> >> >> >> http://watchdog.sk/lkml/kern6.log
>> >> >> >> >
>> >> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
>> >> >> >> >more show up!
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Yeah, it's supposed to do this ;)
>> >> >
>> >> >How are you expecting the machine to recover from an OOM situation,
>> >> >though? I guess I don't really understand what these machines are
>> >> >doing. But if you are overloading them like crazy, isn't that the
>> >> >expected outcome?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).
>> >
>> >Yes, sure, but when the cgroups are thrashing, they use the disk and
>> >CPU to the point where the overall system is affected.
>>
>>
>>
>>
>> Didn't know that there is a disk usage because of this, i never noticed anything yet.
>
>You said there was heavy IO going on...?



Yes, there usually was a big IO but it was related to that deadlocking bug in kernel (or i assume it was). I never saw a big IO in normal conditions even when there were lots of OOM in cgroups. I'm even not using swap because of this so i was assuming that lacks of memory is not doing any additional IO (or am i wrong?). And if you mean that last problem with IO from Monday, i don't exactly know what happens but it's really long time when we had so big problem with IO that it disables also root login on console.




>> >Okay, my suspicion is that the previous patches invoked the OOM killer
>> >right away, whereas in this latest version it's invoked only when the
>> >fault is finished. Maybe the task that locked the group gets held up
>> >somewhere else and then it takes too long until something is actually
>> >killed. Meanwhile, every other allocator drops into 5 reclaim cycles
>> >before giving up, which could explain the thrashing. And on the memcg
>> >level we don't have BDI congestion sleeps like on the global level, so
>> >everybody is backing off from the disk.
>> >
>> >Here is an incremental fix to the latest version, i.e. the one that
>> >livelocked under heavy IO, not the one you are using right now.
>> >
>> >First, it reduces the reclaim retries from 5 to 2, which resembles the
>> >global kswapd + ttfp somewhat. Next, NOFS/NORETRY allocators are not
>> >allowed to kick off the OOM killer, like in the global case, so that
>> >we don't kill things and give up just because light reclaim can't free
>> >anything. Last, the memcg is marked under OOM when one task enters
>> >OOM so that not everybody is livelocking in reclaim in a hopeless
>> >situation.
>>
>>
>>
>> Thank you i will boot it this night. I also created a new server load checking and recuing script so i hope i won't be forced to hard reboot the server in case something similar as before happens. Btw, patch didn't apply to 3.2.51, there were probably big changes in memory system (almost all hunks failed). I used 3.2.50 as before.
>
>Yes, please don't change the test base in the middle of this!
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-11 20:04:26 UTC
Permalink
On Wed, Sep 11, 2013 at 09:41:18PM +0200, azurIt wrote:
> >On Wed, Sep 11, 2013 at 08:54:48PM +0200, azurIt wrote:
> >> >On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
> >> >> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
> >> >> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
> >> >> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
> >> >> >> >> >> Here is full kernel log between 6:00 and 7:59:
> >> >> >> >> >> http://watchdog.sk/lkml/kern6.log
> >> >> >> >> >
> >> >> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
> >> >> >> >> >more show up!
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Yeah, it's supposed to do this ;)
> >> >> >
> >> >> >How are you expecting the machine to recover from an OOM situation,
> >> >> >though? I guess I don't really understand what these machines are
> >> >> >doing. But if you are overloading them like crazy, isn't that the
> >> >> >expected outcome?
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).
> >> >
> >> >Yes, sure, but when the cgroups are thrashing, they use the disk and
> >> >CPU to the point where the overall system is affected.
> >>
> >>
> >>
> >>
> >> Didn't know that there is a disk usage because of this, i never noticed anything yet.
> >
> >You said there was heavy IO going on...?
>
>
>
> Yes, there usually was a big IO but it was related to that
> deadlocking bug in kernel (or i assume it was). I never saw a big IO
> in normal conditions even when there were lots of OOM in
> cgroups. I'm even not using swap because of this so i was assuming
> that lacks of memory is not doing any additional IO (or am i
> wrong?). And if you mean that last problem with IO from Monday, i
> don't exactly know what happens but it's really long time when we
> had so big problem with IO that it disables also root login on
> console.

The deadlocking problem should be separate from this.

Even without swap, the binaries and libraries of the running tasks can
get reclaimed (and immediately faulted back from disk, i.e thrashing).

Usually the OOM killer should kick in before tasks cannibalize each
other like that.

The patch you were using did in fact have the side effect of widening
the window between tasks entering heavy reclaim and the OOM killer
kicking in, so it could explain the IO worsening while fixing the dead
lock problem.

That followup patch tries to narrow this window by quite a bit and
tries to stop concurrent reclaim when the group is already OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-14 10:48:31 UTC
Permalink
> CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Wed, Sep 11, 2013 at 09:41:18PM +0200, azurIt wrote:
>> >On Wed, Sep 11, 2013 at 08:54:48PM +0200, azurIt wrote:
>> >> >On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
>> >> >> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
>> >> >> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
>> >> >> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
>> >> >> >> >> >> Here is full kernel log between 6:00 and 7:59:
>> >> >> >> >> >> http://watchdog.sk/lkml/kern6.log
>> >> >> >> >> >
>> >> >> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
>> >> >> >> >> >more show up!
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Yeah, it's supposed to do this ;)
>> >> >> >
>> >> >> >How are you expecting the machine to recover from an OOM situation,
>> >> >> >though? I guess I don't really understand what these machines are
>> >> >> >doing. But if you are overloading them like crazy, isn't that the
>> >> >> >expected outcome?
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).
>> >> >
>> >> >Yes, sure, but when the cgroups are thrashing, they use the disk and
>> >> >CPU to the point where the overall system is affected.
>> >>
>> >>
>> >>
>> >>
>> >> Didn't know that there is a disk usage because of this, i never noticed anything yet.
>> >
>> >You said there was heavy IO going on...?
>>
>>
>>
>> Yes, there usually was a big IO but it was related to that
>> deadlocking bug in kernel (or i assume it was). I never saw a big IO
>> in normal conditions even when there were lots of OOM in
>> cgroups. I'm even not using swap because of this so i was assuming
>> that lacks of memory is not doing any additional IO (or am i
>> wrong?). And if you mean that last problem with IO from Monday, i
>> don't exactly know what happens but it's really long time when we
>> had so big problem with IO that it disables also root login on
>> console.
>
>The deadlocking problem should be separate from this.
>
>Even without swap, the binaries and libraries of the running tasks can
>get reclaimed (and immediately faulted back from disk, i.e thrashing).
>
>Usually the OOM killer should kick in before tasks cannibalize each
>other like that.
>
>The patch you were using did in fact have the side effect of widening
>the window between tasks entering heavy reclaim and the OOM killer
>kicking in, so it could explain the IO worsening while fixing the dead
>lock problem.
>
>That followup patch tries to narrow this window by quite a bit and
>tries to stop concurrent reclaim when the group is already OOM.



Johannes,

the problem happened again, twice, but i have little more info than before.

Here is the first occurence, this night between 5:15 and 5:25:
- this time i kept opened terminal from other server to this problematic one with htop running
- when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)
- everything was stucked so that htop didn't help me much
- luckily, my new 'load check' script, which i was mentioning before, was able to kill apache and everything went to normal (success with it's very first version, wow ;) )
- i checked some other logs and everything seems to point to cgroup 1304, also kernel log at 5:14-15 is showing hard OOM in that cgroup:
http://watchdog.sk/lkml/kern7.log


Second time it happend between 12:01 and 12:09 but it was in the middle of the day so i'm not attaching any logs (there will be lots of other junk so it will be harded to read something from it). It was related to different cgroup than in first time.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-16 13:40:14 UTC
Permalink
On Sat 14-09-13 12:48:31, azurIt wrote:
[...]
> Here is the first occurence, this night between 5:15 and 5:25:
> - this time i kept opened terminal from other server to this problematic one with htop running
> - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)

I guess you do not have a stack trace(s) for that process? That would be
extremely helpful.

> - everything was stucked so that htop didn't help me much
> - luckily, my new 'load check' script, which i was mentioning before, was able to kill apache and everything went to normal (success with it's very first version, wow ;) )
> - i checked some other logs and everything seems to point to cgroup 1304, also kernel log at 5:14-15 is showing hard OOM in that cgroup:
> http://watchdog.sk/lkml/kern7.log

I am not sure what you mean by hard OOM because there is no global OOM
in that log:
$ grep "Kill process" kern7.log | sed '***@.*]\(.*Kill process\>\).*@\1@' | sort -u
Memory cgroup out of memory: Kill process

But you had a lot of memcg OOMs in that group (1304) during that time
(and even earlier):
$ grep "\<1304\>" kern7.log
Sep 14 05:03:45 server01 kernel: [188287.778020] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:46 server01 kernel: [188287.871427] [30433] 1304 30433 181781 66426 7 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871594] [30808] 1304 30808 169111 53866 4 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871742] [30809] 1304 30809 181168 65992 2 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871890] [30811] 1304 30811 168684 53399 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.872041] [30814] 1304 30814 181102 65924 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.872189] [30815] 1304 30815 168814 53451 4 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.877731] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:46 server01 kernel: [188287.973155] [30808] 1304 30808 169111 53918 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30809] 1304 30809 181168 65992 2 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30811] 1304 30811 168684 53399 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30814] 1304 30814 181102 65924 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30815] 1304 30815 168815 53558 0 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.137540] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:47 server01 kernel: [188289.231873] [30809] 1304 30809 182662 67534 7 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232021] [30811] 1304 30811 171920 56781 4 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232171] [30814] 1304 30814 182596 67470 3 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232319] [30815] 1304 30815 171920 56778 1 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232478] [30896] 1304 30896 171918 56761 0 0 0 apache2
[...]
Sep 14 05:14:00 server01 kernel: [188902.666893] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:00 server01 kernel: [188902.742928] [ 7806] 1304 7806 178891 64008 6 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743080] [ 7910] 1304 7910 175318 60302 2 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743228] [ 7911] 1304 7911 174943 59878 1 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743376] [ 7912] 1304 7912 171568 56404 3 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743524] [ 7914] 1304 7914 174911 59879 5 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743673] [ 7915] 1304 7915 173472 58386 2 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.249749] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7910] 1304 7910 176278 61211 6 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7911] 1304 7911 176278 61211 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7912] 1304 7912 173732 58655 3 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7914] 1304 7914 176269 61211 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7915] 1304 7915 176269 61211 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7966] 1304 7966 170385 55164 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.340992] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7911] 1304 7911 176340 61332 2 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7912] 1304 7912 173996 58901 1 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7914] 1304 7914 176331 61331 4 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7915] 1304 7915 176331 61331 2 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7966] 1304 7966 170385 55164 7 0 0 apache2
[...]

The only thing that is clear from this is that there is always one
process killed and a new one is spawned and that leads to the same
out of memory situation. So this is precisely what Johannes already
described as a Hydra load.

There is a silence in the logs:
Sep 14 05:14:39 server01 kernel: [188940.869639] Killed process 8453 (apache2) total-vm:710732kB, anon-rss:245680kB, file-rss:4588kB
Sep 14 05:21:24 server01 kernel: [189344.518699] grsec: From 95.103.217.66: failed fork with errno EAGAIN by /bin/dash[sh:10362] uid/euid:1387/1387 g
id/egid:100/100, parent /usr/sbin/cron[cron:10144] uid/euid:0/0 gid/egid:0/0

Myabe that is what you are referring to as a stuck situation. Is pid
8453 the task you have seen consuming the CPU? If yes, then we would
need a stack for that task to find out what is going on.

Other than that nothing really suspicious in the log AFAICS.
--
Michal Hocko
SUSE Labs
azurIt
2013-09-16 14:01:19 UTC
Permalink
> CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Sat 14-09-13 12:48:31, azurIt wrote:
>[...]
>> Here is the first occurence, this night between 5:15 and 5:25:
>> - this time i kept opened terminal from other server to this problematic one with htop running
>> - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)
>
>I guess you do not have a stack trace(s) for that process? That would be
>extremely helpful.



I'm afraid it won't be possible as server is completely not responding when it happens. Anyway, i don't think it was a fault of one process or one user.




>> - everything was stucked so that htop didn't help me much
>> - luckily, my new 'load check' script, which i was mentioning before, was able to kill apache and everything went to normal (success with it's very first version, wow ;) )
>> - i checked some other logs and everything seems to point to cgroup 1304, also kernel log at 5:14-15 is showing hard OOM in that cgroup:
>> http://watchdog.sk/lkml/kern7.log
>
>I am not sure what you mean by hard OOM because there is no global OOM
>in that log:
>$ grep "Kill process" kern7.log | sed '***@.*]\(.*Kill process\>\).*@\1@' | sort -u
> Memory cgroup out of memory: Kill process
>
>But you had a lot of memcg OOMs in that group (1304) during that time
>(and even earlier):



I meant OOM inside cgroup 1304. I'm sure this cgroup created the problem.




>$ grep "\<1304\>" kern7.log
>Sep 14 05:03:45 server01 kernel: [188287.778020] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:03:46 server01 kernel: [188287.871427] [30433] 1304 30433 181781 66426 7 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.871594] [30808] 1304 30808 169111 53866 4 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.871742] [30809] 1304 30809 181168 65992 2 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.871890] [30811] 1304 30811 168684 53399 3 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.872041] [30814] 1304 30814 181102 65924 3 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.872189] [30815] 1304 30815 168814 53451 4 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.877731] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30808] 1304 30808 169111 53918 3 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30809] 1304 30809 181168 65992 2 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30811] 1304 30811 168684 53399 3 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30814] 1304 30814 181102 65924 3 0 0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30815] 1304 30815 168815 53558 0 0 0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.137540] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:03:47 server01 kernel: [188289.231873] [30809] 1304 30809 182662 67534 7 0 0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232021] [30811] 1304 30811 171920 56781 4 0 0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232171] [30814] 1304 30814 182596 67470 3 0 0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232319] [30815] 1304 30815 171920 56778 1 0 0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232478] [30896] 1304 30896 171918 56761 0 0 0 apache2
>[...]
>Sep 14 05:14:00 server01 kernel: [188902.666893] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:14:00 server01 kernel: [188902.742928] [ 7806] 1304 7806 178891 64008 6 0 0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743080] [ 7910] 1304 7910 175318 60302 2 0 0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743228] [ 7911] 1304 7911 174943 59878 1 0 0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743376] [ 7912] 1304 7912 171568 56404 3 0 0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743524] [ 7914] 1304 7914 174911 59879 5 0 0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743673] [ 7915] 1304 7915 173472 58386 2 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.249749] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7910] 1304 7910 176278 61211 6 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7911] 1304 7911 176278 61211 7 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7912] 1304 7912 173732 58655 3 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7914] 1304 7914 176269 61211 7 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7915] 1304 7915 176269 61211 7 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7966] 1304 7966 170385 55164 7 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.340992] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7911] 1304 7911 176340 61332 2 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7912] 1304 7912 173996 58901 1 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7914] 1304 7914 176331 61331 4 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7915] 1304 7915 176331 61331 2 0 0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7966] 1304 7966 170385 55164 7 0 0 apache2
>[...]
>
>The only thing that is clear from this is that there is always one
>process killed and a new one is spawned and that leads to the same
>out of memory situation. So this is precisely what Johannes already
>described as a Hydra load.



I can't do anything with this, the processes are visitors on web sites of that user.




>There is a silence in the logs:
>Sep 14 05:14:39 server01 kernel: [188940.869639] Killed process 8453 (apache2) total-vm:710732kB, anon-rss:245680kB, file-rss:4588kB
>Sep 14 05:21:24 server01 kernel: [189344.518699] grsec: From 95.103.217.66: failed fork with errno EAGAIN by /bin/dash[sh:10362] uid/euid:1387/1387 g
>id/egid:100/100, parent /usr/sbin/cron[cron:10144] uid/euid:0/0 gid/egid:0/0
>
>Myabe that is what you are referring to as a stuck situation. Is pid
>8453 the task you have seen consuming the CPU? If yes, then we would
>need a stack for that task to find out what is going on.




Unfortunately i don't know the PID but i don't think it's important. I just wanted to tell that cgroup 1304 was doing problem in this particular case (there were several signes pointing to it). As you can see in the logs, too much memcg OOM is creating huge I/O which is taking down the whole server for no reason.

The same thing is happennig several times per day *if* i'm running kernel with Joahnnes latest patch.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-16 14:06:07 UTC
Permalink
On Mon 16-09-13 16:01:19, azurIt wrote:
> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >On Sat 14-09-13 12:48:31, azurIt wrote:
> >[...]
> >> Here is the first occurence, this night between 5:15 and 5:25:
> >> - this time i kept opened terminal from other server to this problematic one with htop running
> >> - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)
> >
> >I guess you do not have a stack trace(s) for that process? That would be
> >extremely helpful.
>
> I'm afraid it won't be possible as server is completely not responding
> when it happens. Anyway, i don't think it was a fault of one process
> or one user.

You can use sysrq+l via serial console to see tasks hogging the CPU or
sysrq+t to see all the existing tasks.

[...]
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-16 14:13:16 UTC
Permalink
> CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>On Mon 16-09-13 16:01:19, azurIt wrote:
>> > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>> >On Sat 14-09-13 12:48:31, azurIt wrote:
>> >[...]
>> >> Here is the first occurence, this night between 5:15 and 5:25:
>> >> - this time i kept opened terminal from other server to this problematic one with htop running
>> >> - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)
>> >
>> >I guess you do not have a stack trace(s) for that process? That would be
>> >extremely helpful.
>>
>> I'm afraid it won't be possible as server is completely not responding
>> when it happens. Anyway, i don't think it was a fault of one process
>> or one user.
>
>You can use sysrq+l via serial console to see tasks hogging the CPU or
>sysrq+t to see all the existing tasks.


Doesn't work here, it just prints 'l' resp. 't'.

azur
Michal Hocko
2013-09-16 14:57:44 UTC
Permalink
On Mon 16-09-13 16:13:16, azurIt wrote:
[...]
> >You can use sysrq+l via serial console to see tasks hogging the CPU or
> >sysrq+t to see all the existing tasks.
>
>
> Doesn't work here, it just prints 'l' resp. 't'.

I am using telnet for accessing my serial consoles exported by
the multiplicator or KVM and it can send sysrq via ctrl+t (Send
Break). Check your serial console setup.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-16 15:05:43 UTC
Permalink
> CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Mon 16-09-13 16:13:16, azurIt wrote:
>[...]
>> >You can use sysrq+l via serial console to see tasks hogging the CPU or
>> >sysrq+t to see all the existing tasks.
>>
>>
>> Doesn't work here, it just prints 'l' resp. 't'.
>
>I am using telnet for accessing my serial consoles exported by
>the multiplicator or KVM and it can send sysrq via ctrl+t (Send
>Break). Check your serial console setup.



I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp. 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs to be enabled somehow?

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-16 15:17:26 UTC
Permalink
On Mon, Sep 16, 2013 at 05:05:43PM +0200, azurIt wrote:
> > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> >On Mon 16-09-13 16:13:16, azurIt wrote:
> >[...]
> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
> >> >sysrq+t to see all the existing tasks.
> >>
> >>
> >> Doesn't work here, it just prints 'l' resp. 't'.
> >
> >I am using telnet for accessing my serial consoles exported by
> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
> >Break). Check your serial console setup.
>
>
>
> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp. 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs to be enabled somehow?

Can you 'echo t >/proc/sysrq-trigger'?
azurIt
2013-09-16 15:24:05 UTC
Permalink
> CC: "Michal Hocko" <***@suse.cz>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Mon, Sep 16, 2013 at 05:05:43PM +0200, azurIt wrote:
>> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >On Mon 16-09-13 16:13:16, azurIt wrote:
>> >[...]
>> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
>> >> >sysrq+t to see all the existing tasks.
>> >>
>> >>
>> >> Doesn't work here, it just prints 'l' resp. 't'.
>> >
>> >I am using telnet for accessing my serial consoles exported by
>> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
>> >Break). Check your serial console setup.
>>
>>
>>
>> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp. 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs to be enabled somehow?
>
>Can you 'echo t >/proc/sysrq-trigger'?



# ls -la /proc/sysrq-trigger
ls: cannot access /proc/sysrq-trigger: No such file or directory

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-16 15:25:48 UTC
Permalink
On Mon 16-09-13 17:05:43, azurIt wrote:
> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >On Mon 16-09-13 16:13:16, azurIt wrote:
> >[...]
> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
> >> >sysrq+t to see all the existing tasks.
> >>
> >>
> >> Doesn't work here, it just prints 'l' resp. 't'.
> >
> >I am using telnet for accessing my serial consoles exported by
> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
> >Break). Check your serial console setup.
>
>
>
> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp.
> 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs
> to be enabled somehow?

Probably yes. echo 1 > /proc/sys/kernel/sysrq should enable all sysrq
commands. You can select also some of them (have a look at
Documentation/sysrq.txt for more information)
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-16 15:40:39 UTC
Permalink
> CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Mon 16-09-13 17:05:43, azurIt wrote:
>> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >On Mon 16-09-13 16:13:16, azurIt wrote:
>> >[...]
>> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
>> >> >sysrq+t to see all the existing tasks.
>> >>
>> >>
>> >> Doesn't work here, it just prints 'l' resp. 't'.
>> >
>> >I am using telnet for accessing my serial consoles exported by
>> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
>> >Break). Check your serial console setup.
>>
>>
>>
>> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp.
>> 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs
>> to be enabled somehow?
>
>Probably yes. echo 1 > /proc/sys/kernel/sysrq should enable all sysrq
>commands. You can select also some of them (have a look at
>Documentation/sysrq.txt for more information)

# ls -la /proc/sys/kernel/sysrq
ls: cannot access /proc/sys/kernel/sysrq: No such file or directory

ok, so problem is probably here:
# CONFIG_MAGIC_SYSRQ is not set

I will enable it with next reboot.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-16 20:52:46 UTC
Permalink
> CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Mon 16-09-13 17:05:43, azurIt wrote:
>> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >On Mon 16-09-13 16:13:16, azurIt wrote:
>> >[...]
>> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
>> >> >sysrq+t to see all the existing tasks.
>> >>
>> >>
>> >> Doesn't work here, it just prints 'l' resp. 't'.
>> >
>> >I am using telnet for accessing my serial consoles exported by
>> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
>> >Break). Check your serial console setup.
>>
>>
>>
>> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp.
>> 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs
>> to be enabled somehow?
>
>Probably yes. echo 1 > /proc/sys/kernel/sysrq should enable all sysrq
>commands. You can select also some of them (have a look at
>Documentation/sysrq.txt for more information)


Now it happens again and i was just looking on the server's htop. I'm sure that this time it was only one process (apache) running under user account (not root). It was taking about 100% CPU (about 100% of one core). I was able to kill it by hand inside htop but everything was very slow, server load was immediately on 500. I'm sure it must be related to that Johannes kernel patches because i'm also using i/o throttling in cgroups via Block IO controller so users are unable to create such a huge I/O. I will try to take stacks of processes but i'm not able to identify the problematic process so i will have to take them from *all* apache processes while killing them.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-09-17 00:02:44 UTC
Permalink
On Mon, Sep 16, 2013 at 10:52:46PM +0200, azurIt wrote:
> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >On Mon 16-09-13 17:05:43, azurIt wrote:
> >> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
> >> >On Mon 16-09-13 16:13:16, azurIt wrote:
> >> >[...]
> >> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
> >> >> >sysrq+t to see all the existing tasks.
> >> >>
> >> >>
> >> >> Doesn't work here, it just prints 'l' resp. 't'.
> >> >
> >> >I am using telnet for accessing my serial consoles exported by
> >> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
> >> >Break). Check your serial console setup.
> >>
> >>
> >>
> >> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp.
> >> 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs
> >> to be enabled somehow?
> >
> >Probably yes. echo 1 > /proc/sys/kernel/sysrq should enable all sysrq
> >commands. You can select also some of them (have a look at
> >Documentation/sysrq.txt for more information)
>
>
> Now it happens again and i was just looking on the server's
> htop. I'm sure that this time it was only one process (apache)
> running under user account (not root). It was taking about 100% CPU
> (about 100% of one core). I was able to kill it by hand inside htop
> but everything was very slow, server load was immediately on
> 500. I'm sure it must be related to that Johannes kernel patches
> because i'm also using i/o throttling in cgroups via Block IO
> controller so users are unable to create such a huge I/O. I will try
> to take stacks of processes but i'm not able to identify the
> problematic process so i will have to take them from *all* apache
> processes while killing them.

It would be fantastic if you could capture those stacks. sysrq+t
captures ALL of them in one go and drops them into your syslog.

/proc/<pid>/stack for individual tasks works too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-17 11:15:35 UTC
Permalink
______________________________________________________________
> Od: Johannes Weiner <***@cmpxchg.org>
> Komu: azurIt <***@pobox.sk>
> D=C3=A1tum: 17.09.2013 02:02
> Predmet: Re: [patch 0/7] improve memcg oom killer robustness v2
>
> CC: "Michal Hocko" <***@suse.cz>, "Andrew Morton" <***@linux-foun=
dation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki=
" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <kosaki.motohiro@=
jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kerne=
l.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Mon, Sep 16, 2013 at 10:52:46PM +0200, azurIt wrote:
>> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <akpm@=
linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAW=
A Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <kosaki=
=***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.or=
g, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel=
=2Eorg
>> >On Mon 16-09-13 17:05:43, azurIt wrote:
>> >> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <ak=
***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAME=
ZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <kos=
***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.o=
rg, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kerne=
l.org
>> >> >On Mon 16-09-13 16:13:16, azurIt wrote:
>> >> >[...]
>> >> >> >You can use sysrq+l via serial console to see tasks hogging t=
he CPU or
>> >> >> >sysrq+t to see all the existing tasks.
>> >> >>=20
>> >> >>=20
>> >> >> Doesn't work here, it just prints 'l' resp. 't'.
>> >> >
>> >> >I am using telnet for accessing my serial consoles exported by
>> >> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
>> >> >Break). Check your serial console setup.
>> >>=20
>> >>=20
>> >>=20
>> >> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' re=
sp.
>> >> 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it n=
eeds
>> >> to be enabled somehow?
>> >
>> >Probably yes. echo 1 > /proc/sys/kernel/sysrq should enable all sys=
rq
>> >commands. You can select also some of them (have a look at
>> >Documentation/sysrq.txt for more information)
>>=20
>>=20
>> Now it happens again and i was just looking on the server's
>> htop. I'm sure that this time it was only one process (apache)
>> running under user account (not root). It was taking about 100% CPU
>> (about 100% of one core). I was able to kill it by hand inside htop
>> but everything was very slow, server load was immediately on
>> 500. I'm sure it must be related to that Johannes kernel patches
>> because i'm also using i/o throttling in cgroups via Block IO
>> controller so users are unable to create such a huge I/O. I will try
>> to take stacks of processes but i'm not able to identify the
>> problematic process so i will have to take them from *all* apache
>> processes while killing them.
>
>It would be fantastic if you could capture those stacks. sysrq+t
>captures ALL of them in one go and drops them into your syslog.
>
>/proc/<pid>/stack for individual tasks works too.


Is something unusual on this stack?


[<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
[<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
[<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
[<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
[<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
[<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
[<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
[<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
[<ffffffff81027424>] do_page_fault+0x404/0x490
[<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
[<ffffffff815cb87f>] page_fault+0x1f/0x30


Problem happens again but my script was unable to get stacks. I was abl=
e to see processes which were doing problems (two this time) and i have=
their PIDs. The stack above is from different process but from the sam=
e cgroup (memcg OOM killed it and prints it's stack into syslog).

azur
Michal Hocko
2013-09-17 14:10:13 UTC
Permalink
On Tue 17-09-13 13:15:35, azurIt wrote:
[...]
> Is something unusual on this stack?
>
>
> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
> [<ffffffff81027424>] do_page_fault+0x404/0x490
> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
> [<ffffffff815cb87f>] page_fault+0x1f/0x30

This is a regular memcg OOM killer. Which dumps messages about what is
going to do. So no, nothing unusual, except if it was like that for ever
which would mean that oom_kill_process is in the endless loop. But a
single stack doesn't tell us much.

Just a note. When you see something hogging a cpu and you are not sure
whether it might be in an endless loop inside the kernel it makes sense
to take several snaphosts of the stack trace and see if it changes. If
not and the process is not sleeping (there is no schedule on the trace)
then it might be looping somewhere waiting for Godot. If it is sleeping
then it is slightly harder because you would have to identify what it is
waiting for which requires to know a deeper context.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-18 14:03:04 UTC
Permalink
> CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>On Tue 17-09-13 13:15:35, azurIt wrote:
>[...]
>> Is something unusual on this stack?
>>
>>
>> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
>> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
>> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
>> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
>> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
>> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
>> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
>> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
>> [<ffffffff81027424>] do_page_fault+0x404/0x490
>> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
>> [<ffffffff815cb87f>] page_fault+0x1f/0x30
>
>This is a regular memcg OOM killer. Which dumps messages about what is
>going to do. So no, nothing unusual, except if it was like that for ever
>which would mean that oom_kill_process is in the endless loop. But a
>single stack doesn't tell us much.
>
>Just a note. When you see something hogging a cpu and you are not sure
>whether it might be in an endless loop inside the kernel it makes sense
>to take several snaphosts of the stack trace and see if it changes. If
>not and the process is not sleeping (there is no schedule on the trace)
>then it might be looping somewhere waiting for Godot. If it is sleeping
>then it is slightly harder because you would have to identify what it is
>waiting for which requires to know a deeper context.
>--
>Michal Hocko
>SUSE Labs



I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:

First (doesn't look very helpfull):
[<ffffffffffffffff>] 0xffffffffffffffff


Second:
[<ffffffff810e17d1>] shrink_zone+0x481/0x650
[<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
[<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
[<ffffffff81148ccd>] free_more_memory+0x5d/0x60
[<ffffffff8114931d>] __getblk+0x14d/0x2c0
[<ffffffff8114c973>] __bread+0x13/0xc0
[<ffffffff811968a8>] ext3_get_branch+0x98/0x140
[<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
[<ffffffff81198244>] ext3_get_block+0xc4/0x120
[<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
[<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
[<ffffffff811972bd>] ext3_readpages+0x1d/0x20
[<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
[<ffffffff810d9411>] ra_submit+0x21/0x30
[<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
[<ffffffff810ef908>] __do_fault+0x78/0x5a0
[<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
[<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
[<ffffffff8102715b>] do_page_fault+0x13b/0x490
[<ffffffff815cb87f>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff


What do you think about it?

azur
Michal Hocko
2013-09-18 14:24:00 UTC
Permalink
On Wed 18-09-13 16:03:04, azurIt wrote:
[..]
> I was finally able to get stack of problematic process :) I saved it
> two times from the same process, as Michal suggested (i wasn't able to
> take more). Here it is:
>
> First (doesn't look very helpfull):
> [<ffffffffffffffff>] 0xffffffffffffffff

No it is not.

> Second:
> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> [<ffffffff8114c973>] __bread+0x13/0xc0
> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> [<ffffffff810d9411>] ra_submit+0x21/0x30
> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

This is the direct reclaim path. You are simply running out of memory
globaly. There is no memcg specific code in that trace.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-18 14:33:06 UTC
Permalink
> CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Wed 18-09-13 16:03:04, azurIt wrote:
>[..]
>> I was finally able to get stack of problematic process :) I saved it
>> two times from the same process, as Michal suggested (i wasn't able to
>> take more). Here it is:
>>
>> First (doesn't look very helpfull):
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
>No it is not.
>
>> Second:
>> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
>> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
>> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
>> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
>> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
>> [<ffffffff8114c973>] __bread+0x13/0xc0
>> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
>> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
>> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
>> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
>> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
>> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
>> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
>> [<ffffffff810d9411>] ra_submit+0x21/0x30
>> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
>> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
>> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
>> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
>> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
>> [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
>This is the direct reclaim path. You are simply running out of memory
>globaly. There is no memcg specific code in that trace.


No, i'm not. Here is htop and server graphs from this case:
http://watchdog.sk/lkml/htop3.jpg (here you can see actual memory usage)
http://watchdog.sk/lkml/server01.jpg

If i was really having global OOM (which i'm not for 101%) where that i/o comes from? I have no swap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-18 14:42:45 UTC
Permalink
On Wed 18-09-13 16:33:06, azurIt wrote:
> > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> >On Wed 18-09-13 16:03:04, azurIt wrote:
> >[..]
> >> I was finally able to get stack of problematic process :) I saved it
> >> two times from the same process, as Michal suggested (i wasn't able to
> >> take more). Here it is:
> >>
> >> First (doesn't look very helpfull):
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >No it is not.
> >
> >> Second:
> >> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> >> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> >> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> >> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> >> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> >> [<ffffffff8114c973>] __bread+0x13/0xc0
> >> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> >> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> >> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> >> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> >> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> >> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> >> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> >> [<ffffffff810d9411>] ra_submit+0x21/0x30
> >> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> >> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> >> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> >> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> >> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >This is the direct reclaim path. You are simply running out of memory
> >globaly. There is no memcg specific code in that trace.
>
>
> No, i'm not. Here is htop and server graphs from this case:

Bahh, right you are. I didn't look at the trace carefully. It is
free_more_memory which calls the direct reclaim shrinking.

Sorry about the confusion
--
Michal Hocko
SUSE Labs
azurIt
2013-09-18 18:02:39 UTC
Permalink
> CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Wed 18-09-13 16:33:06, azurIt wrote:
>> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >On Wed 18-09-13 16:03:04, azurIt wrote:
>> >[..]
>> >> I was finally able to get stack of problematic process :) I saved it
>> >> two times from the same process, as Michal suggested (i wasn't able to
>> >> take more). Here it is:
>> >>
>> >> First (doesn't look very helpfull):
>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> >No it is not.
>> >
>> >> Second:
>> >> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
>> >> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
>> >> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
>> >> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
>> >> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
>> >> [<ffffffff8114c973>] __bread+0x13/0xc0
>> >> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
>> >> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
>> >> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
>> >> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
>> >> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
>> >> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
>> >> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
>> >> [<ffffffff810d9411>] ra_submit+0x21/0x30
>> >> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
>> >> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
>> >> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
>> >> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
>> >> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
>> >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> >This is the direct reclaim path. You are simply running out of memory
>> >globaly. There is no memcg specific code in that trace.
>>
>>
>> No, i'm not. Here is htop and server graphs from this case:
>
>Bahh, right you are. I didn't look at the trace carefully. It is
>free_more_memory which calls the direct reclaim shrinking.
>
>Sorry about the confusion


Happens again and this time i got 5x this:
[<ffffffffffffffff>] 0xffffffffffffffff

:( it's probably looping very fast so i need to have some luck

azur
Michal Hocko
2013-09-18 18:36:17 UTC
Permalink
On Wed 18-09-13 20:02:39, azurIt wrote:
> > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> >On Wed 18-09-13 16:33:06, azurIt wrote:
> >> > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> >> >On Wed 18-09-13 16:03:04, azurIt wrote:
> >> >[..]
> >> >> I was finally able to get stack of problematic process :) I saved it
> >> >> two times from the same process, as Michal suggested (i wasn't able to
> >> >> take more). Here it is:
> >> >>
> >> >> First (doesn't look very helpfull):
> >> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >> >
> >> >No it is not.
> >> >
> >> >> Second:
> >> >> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> >> >> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> >> >> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> >> >> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> >> >> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> >> >> [<ffffffff8114c973>] __bread+0x13/0xc0
> >> >> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> >> >> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> >> >> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> >> >> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> >> >> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> >> >> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> >> >> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> >> >> [<ffffffff810d9411>] ra_submit+0x21/0x30
> >> >> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> >> >> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> >> >> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> >> >> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> >> >> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> >> >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> >> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >> >
> >> >This is the direct reclaim path. You are simply running out of memory
> >> >globaly. There is no memcg specific code in that trace.
> >>
> >>
> >> No, i'm not. Here is htop and server graphs from this case:
> >
> >Bahh, right you are. I didn't look at the trace carefully. It is
> >free_more_memory which calls the direct reclaim shrinking.
> >
> >Sorry about the confusion
>
>
> Happens again and this time i got 5x this:
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> :( it's probably looping very fast so i need to have some luck

Or it is looping in the userspace.
--
Michal Hocko
SUSE Labs
Johannes Weiner
2013-09-18 18:04:55 UTC
Permalink
On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
> > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> >On Tue 17-09-13 13:15:35, azurIt wrote:
> >[...]
> >> Is something unusual on this stack?
> >>
> >>
> >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
> >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
> >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
> >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
> >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
> >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
> >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
> >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
> >> [<ffffffff81027424>] do_page_fault+0x404/0x490
> >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
> >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> >
> >This is a regular memcg OOM killer. Which dumps messages about what is
> >going to do. So no, nothing unusual, except if it was like that for ever
> >which would mean that oom_kill_process is in the endless loop. But a
> >single stack doesn't tell us much.
> >
> >Just a note. When you see something hogging a cpu and you are not sure
> >whether it might be in an endless loop inside the kernel it makes sense
> >to take several snaphosts of the stack trace and see if it changes. If
> >not and the process is not sleeping (there is no schedule on the trace)
> >then it might be looping somewhere waiting for Godot. If it is sleeping
> >then it is slightly harder because you would have to identify what it is
> >waiting for which requires to know a deeper context.
> >--
> >Michal Hocko
> >SUSE Labs
>
>
>
> I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
>
> First (doesn't look very helpfull):
> [<ffffffffffffffff>] 0xffffffffffffffff
>
>
> Second:
> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> [<ffffffff8114c973>] __bread+0x13/0xc0
> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> [<ffffffff810d9411>] ra_submit+0x21/0x30
> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

Ah, crap. I'm sorry. You even showed us this exact trace before in
another context, but I did not fully realize what __getblk() is doing.

My subsequent patches made a charge attempt return -ENOMEM without
reclaim if the memcg is under OOM. And so the reason you have these
reclaim livelocks is because __getblk never fails on -ENOMEM. When
the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
tries again in an endless loop. The memcg code would previously just
loop inside the charge, reclaiming and killing, until the allocation
succeeded. But the new code relies on the fault stack being unwound
to complete the OOM kill. And since the stack is not unwound with
__getblk() looping around the allocation there is no more memcg
reclaim AND no memcg OOM kill, thus no chance of exiting.

That code is weird but really old, so it may take a while to evaluate
all the callers as to whether this can be changed.

In the meantime, I would just allow __getblk to bypass the memcg limit
when it still can't charge after reclaim. Does the below get your
machine back on track?

---

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..83c8716 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1085,6 +1085,8 @@ grow_buffers(struct block_device *bdev, sector_t block, int size)
static struct buffer_head *
__getblk_slow(struct block_device *bdev, sector_t block, int size)
{
+ struct buffer_head *bh = NULL;
+
/* Size must be multiple of hard sectorsize */
if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
(size < 512 || size > PAGE_SIZE))) {
@@ -1097,20 +1099,23 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
return NULL;
}

+ mem_cgroup_oom_enable();
for (;;) {
- struct buffer_head * bh;
int ret;

bh = __find_get_block(bdev, block, size);
if (bh)
- return bh;
+ break;

ret = grow_buffers(bdev, block, size);
if (ret < 0)
- return NULL;
+ break;
if (ret == 0)
free_more_memory();
}
+ mem_cgroup_oom_disable();
+ mem_cgroup_oom_synchronize(false);
+ return bh;
}

/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 325da07..e441647 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,16 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

-static inline void mem_cgroup_enable_oom(void)
+static inline void mem_cgroup_oom_enable(void)
{
- WARN_ON(current->memcg_oom.may_oom);
- current->memcg_oom.may_oom = 1;
+ current->memcg_oom.may_oom++;
}

-static inline void mem_cgroup_disable_oom(void)
+static inline void mem_cgroup_oom_disable(void)
{
WARN_ON(!current->memcg_oom.may_oom);
- current->memcg_oom.may_oom = 0;
+ current->memcg_oom.may_oom--;
}

static inline bool task_in_memcg_oom(struct task_struct *p)
@@ -352,11 +351,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

-static inline void mem_cgroup_enable_oom(void)
+static inline void mem_cgroup_oom_enable(void)
{
}

-static inline void mem_cgroup_disable_oom(void)
+static inline void mem_cgroup_oom_disable(void)
{
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fb1f145..dc71a17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1571,7 +1571,7 @@ struct task_struct {
struct memcg_oom_info {
struct mem_cgroup *memcg;
gfp_t gfp_mask;
- unsigned int may_oom:1;
+ unsigned int may_oom;
} memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f565857..1441fc5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1878,7 +1878,6 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
*/
css_get(&memcg->css);
current->memcg_oom.memcg = memcg;
- mem_cgroup_mark_under_oom(memcg);
current->memcg_oom.gfp_mask = mask;
}

@@ -1930,6 +1929,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
* under OOM is always welcomed, use TASK_KILLABLE here.
*/
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ mem_cgroup_mark_under_oom(memcg);

locked = mem_cgroup_oom_trylock(memcg);

@@ -1937,10 +1937,12 @@ bool mem_cgroup_oom_synchronize(bool handle)
mem_cgroup_oom_notify(memcg);

if (locked && !memcg->oom_kill_disable) {
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
} else {
schedule();
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
}

@@ -1954,7 +1956,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
memcg_oom_recover(memcg);
}
cleanup:
- mem_cgroup_unmark_under_oom(memcg);
current->memcg_oom.memcg = NULL;
css_put(&memcg->css);
return true;
@@ -2340,10 +2341,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
goto bypass;

/*
- * Task already OOMed, just get out of here.
+ * Task already OOMed, just allow it to finish the fault as
+ * quickly as possible to start the OOM handling.
*/
if (unlikely(current->memcg_oom.memcg))
- goto nomem;
+ goto bypass;

/*
* We always charge the cgroup the mm_struct belongs to.
@@ -2417,9 +2419,6 @@ again:
if (oom && !nr_reclaim_retries)
enter_oom = true;

- if (atomic_read(&memcg->under_oom))
- enter_oom = true;
-
ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, enter_oom);
switch (ret) {
case CHARGE_OK:
diff --git a/mm/memory.c b/mm/memory.c
index 20c43a0..3d82ef9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3513,12 +3513,12 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* space. Kernel faults are handled more gracefully.
*/
if (flags & FAULT_FLAG_USER)
- mem_cgroup_enable_oom();
+ mem_cgroup_oom_enable();

ret = __handle_mm_fault(mm, vma, address, flags);

if (flags & FAULT_FLAG_USER) {
- mem_cgroup_disable_oom();
+ mem_cgroup_oom_disable();
/*
* The task may have entered a memcg OOM situation but
* if the allocation error was handled gracefully (no
Johannes Weiner
2013-09-18 18:19:46 UTC
Permalink
On Wed, Sep 18, 2013 at 02:04:55PM -0400, Johannes Weiner wrote:
> On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
> > > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> > >On Tue 17-09-13 13:15:35, azurIt wrote:
> > >[...]
> > >> Is something unusual on this stack?
> > >>
> > >>
> > >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
> > >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
> > >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
> > >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
> > >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
> > >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
> > >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
> > >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
> > >> [<ffffffff81027424>] do_page_fault+0x404/0x490
> > >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
> > >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> > >
> > >This is a regular memcg OOM killer. Which dumps messages about what is
> > >going to do. So no, nothing unusual, except if it was like that for ever
> > >which would mean that oom_kill_process is in the endless loop. But a
> > >single stack doesn't tell us much.
> > >
> > >Just a note. When you see something hogging a cpu and you are not sure
> > >whether it might be in an endless loop inside the kernel it makes sense
> > >to take several snaphosts of the stack trace and see if it changes. If
> > >not and the process is not sleeping (there is no schedule on the trace)
> > >then it might be looping somewhere waiting for Godot. If it is sleeping
> > >then it is slightly harder because you would have to identify what it is
> > >waiting for which requires to know a deeper context.
> > >--
> > >Michal Hocko
> > >SUSE Labs
> >
> >
> >
> > I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
> >
> > First (doesn't look very helpfull):
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >
> > Second:
> > [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> > [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> > [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> > [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> > [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> > [<ffffffff8114c973>] __bread+0x13/0xc0
> > [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> > [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> > [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> > [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> > [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> > [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> > [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> > [<ffffffff810d9411>] ra_submit+0x21/0x30
> > [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> > [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> > [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> > [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> > [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> > [<ffffffff815cb87f>] page_fault+0x1f/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> Ah, crap. I'm sorry. You even showed us this exact trace before in
> another context, but I did not fully realize what __getblk() is doing.
>
> My subsequent patches made a charge attempt return -ENOMEM without
> reclaim if the memcg is under OOM. And so the reason you have these
> reclaim livelocks is because __getblk never fails on -ENOMEM. When
> the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
> tries again in an endless loop. The memcg code would previously just
> loop inside the charge, reclaiming and killing, until the allocation
> succeeded. But the new code relies on the fault stack being unwound
> to complete the OOM kill. And since the stack is not unwound with
> __getblk() looping around the allocation there is no more memcg
> reclaim AND no memcg OOM kill, thus no chance of exiting.
>
> That code is weird but really old, so it may take a while to evaluate
> all the callers as to whether this can be changed.
>
> In the meantime, I would just allow __getblk to bypass the memcg limit
> when it still can't charge after reclaim. Does the below get your
> machine back on track?

Scratch that. The idea is reasonable but the implementation is not
fully cooked yet. I'll send you an update.
Johannes Weiner
2013-09-18 19:55:04 UTC
Permalink
On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
> On Wed, Sep 18, 2013 at 02:04:55PM -0400, Johannes Weiner wrote:
> > On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
> > > > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
> > > >On Tue 17-09-13 13:15:35, azurIt wrote:
> > > >[...]
> > > >> Is something unusual on this stack?
> > > >>
> > > >>
> > > >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
> > > >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
> > > >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
> > > >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
> > > >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
> > > >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
> > > >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
> > > >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
> > > >> [<ffffffff81027424>] do_page_fault+0x404/0x490
> > > >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
> > > >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> > > >
> > > >This is a regular memcg OOM killer. Which dumps messages about what is
> > > >going to do. So no, nothing unusual, except if it was like that for ever
> > > >which would mean that oom_kill_process is in the endless loop. But a
> > > >single stack doesn't tell us much.
> > > >
> > > >Just a note. When you see something hogging a cpu and you are not sure
> > > >whether it might be in an endless loop inside the kernel it makes sense
> > > >to take several snaphosts of the stack trace and see if it changes. If
> > > >not and the process is not sleeping (there is no schedule on the trace)
> > > >then it might be looping somewhere waiting for Godot. If it is sleeping
> > > >then it is slightly harder because you would have to identify what it is
> > > >waiting for which requires to know a deeper context.
> > > >--
> > > >Michal Hocko
> > > >SUSE Labs
> > >
> > >
> > >
> > > I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
> > >
> > > First (doesn't look very helpfull):
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > >
> > > Second:
> > > [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> > > [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> > > [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> > > [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> > > [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> > > [<ffffffff8114c973>] __bread+0x13/0xc0
> > > [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> > > [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> > > [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> > > [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> > > [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> > > [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> > > [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> > > [<ffffffff810d9411>] ra_submit+0x21/0x30
> > > [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> > > [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> > > [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> > > [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> > > [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> > > [<ffffffff815cb87f>] page_fault+0x1f/0x30
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > Ah, crap. I'm sorry. You even showed us this exact trace before in
> > another context, but I did not fully realize what __getblk() is doing.
> >
> > My subsequent patches made a charge attempt return -ENOMEM without
> > reclaim if the memcg is under OOM. And so the reason you have these
> > reclaim livelocks is because __getblk never fails on -ENOMEM. When
> > the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
> > tries again in an endless loop. The memcg code would previously just
> > loop inside the charge, reclaiming and killing, until the allocation
> > succeeded. But the new code relies on the fault stack being unwound
> > to complete the OOM kill. And since the stack is not unwound with
> > __getblk() looping around the allocation there is no more memcg
> > reclaim AND no memcg OOM kill, thus no chance of exiting.
> >
> > That code is weird but really old, so it may take a while to evaluate
> > all the callers as to whether this can be changed.
> >
> > In the meantime, I would just allow __getblk to bypass the memcg limit
> > when it still can't charge after reclaim. Does the below get your
> > machine back on track?
>
> Scratch that. The idea is reasonable but the implementation is not
> fully cooked yet. I'll send you an update.

Here is an update. Full replacement on top of 3.2 since we tried a
dead end and it would be more painful to revert individual changes.

The first bug you had was the same task entering OOM repeatedly and
leaking the memcg reference, thus creating undeletable memcgs. My
fixup added a condition that if the task already set up an OOM context
in that fault, another charge attempt would immediately return -ENOMEM
without even trying reclaim anymore. This dropped __getblk() into an
endless loop of waking the flushers and performing global reclaim and
memcg returning -ENOMEM regardless of free memory.

The update now basically only changes this -ENOMEM to bypass, so that
the memory is not accounted and the limit ignored. OOM killed tasks
are granted the same right, so that they can exit quickly and release
memory. Likewise, we want a task that hit the OOM condition also to
finish the fault quickly so that it can invoke the OOM killer.

Does the following work for you, azur?

---

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..314fe53 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,30 +842,22 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
+ if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+ up_read(&current->mm->mmap_sem);
+ no_context(regs, error_code, address);
+ return;
}
- if (!(fault & VM_FAULT_ERROR))
- return 0;

if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address);
- return 1;
+ return;
}

out_of_memory(regs, error_code, address);
@@ -876,7 +868,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1070,6 +1061,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1167,9 +1159,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..1b29ac5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,25 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+static inline void mem_cgroup_oom_enable(void)
+{
+ WARN_ON(current->memcg_oom.may_oom);
+ current->memcg_oom.may_oom = 1;
+}
+
+static inline void mem_cgroup_oom_disable(void)
+{
+ WARN_ON(!current->memcg_oom.may_oom);
+ current->memcg_oom.may_oom = 0;
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return p->memcg_oom.memcg;
+}
+
+bool mem_cgroup_oom_synchronize(bool handle);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +352,24 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_oom_enable(void)
+{
+}
+
+static inline void mem_cgroup_oom_disable(void)
+{
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(bool handle)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..fb1f145 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,6 +1568,11 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ struct mem_cgroup *memcg;
+ gfp_t gfp_mask;
+ unsigned int may_oom:1;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..66cc373 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1743,16 +1743,19 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
return total;
}

+static DEFINE_SPINLOCK(memcg_oom_lock);
+
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
- * Has to be called with memcg_oom_lock
*/
-static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter, *failed = NULL;
bool cond = true;

+ spin_lock(&memcg_oom_lock);
+
for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
if (iter->oom_lock) {
/*
@@ -1765,34 +1768,34 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
iter->oom_lock = true;
}

- if (!failed)
- return true;
-
- /*
- * OK, we failed to lock the whole subtree so we have to clean up
- * what we set up to the failing subtree
- */
- cond = true;
- for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
- if (iter == failed) {
- cond = false;
- continue;
+ if (failed) {
+ /*
+ * OK, we failed to lock the whole subtree so we have
+ * to clean up what we set up to the failing subtree
+ */
+ cond = true;
+ for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+ if (iter == failed) {
+ cond = false;
+ continue;
+ }
+ iter->oom_lock = false;
}
- iter->oom_lock = false;
}
- return false;
+
+ spin_unlock(&memcg_oom_lock);
+
+ return !failed;
}

-/*
- * Has to be called with memcg_oom_lock
- */
-static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter;

+ spin_lock(&memcg_oom_lock);
for_each_mem_cgroup_tree(iter, memcg)
iter->oom_lock = false;
- return 0;
+ spin_unlock(&memcg_oom_lock);
}

static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
@@ -1816,7 +1819,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
atomic_add_unless(&iter->under_oom, -1, 0);
}

-static DEFINE_SPINLOCK(memcg_oom_lock);
static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);

struct oom_wait_info {
@@ -1856,56 +1858,95 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
memcg_wakeup_oom(memcg);
}

-/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
+{
+ if (!current->memcg_oom.may_oom)
+ return;
+ /*
+ * We are in the middle of the charge context here, so we
+ * don't want to block when potentially sitting on a callstack
+ * that holds all kinds of filesystem and mm locks.
+ *
+ * Also, the caller may handle a failed allocation gracefully
+ * (like optional page cache readahead) and so an OOM killer
+ * invocation might not even be necessary.
+ *
+ * That's why we don't do anything here except remember the
+ * OOM context and then deal with it at the end of the page
+ * fault when the stack is unwound, the locks are released,
+ * and when we know whether the fault was overall successful.
+ */
+ css_get(&memcg->css);
+ current->memcg_oom.memcg = memcg;
+ current->memcg_oom.gfp_mask = mask;
+}
+
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ * @handle: actually kill/wait or just clean up the OOM state
+ *
+ * This has to be called at the end of a page fault if the memcg OOM
+ * handler was enabled.
+ *
+ * Memcg supports userspace OOM handling where failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to complete the OOM handling.
+ *
+ * Returns %true if an ongoing memcg OOM situation was detected and
+ * completed, %false otherwise.
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+bool mem_cgroup_oom_synchronize(bool handle)
{
+ struct mem_cgroup *memcg = current->memcg_oom.memcg;
struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked;
+
+ /* OOM is global, do not handle */
+ if (!memcg)
+ return false;
+
+ if (!handle)
+ goto cleanup;

owait.mem = memcg;
owait.wait.flags = 0;
owait.wait.func = memcg_oom_wake_function;
owait.wait.private = current;
INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);

- /* At first, try to OOM lock hierarchy under memcg.*/
- spin_lock(&memcg_oom_lock);
- locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
- need_to_kill = false;
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
if (locked)
mem_cgroup_oom_notify(memcg);
- spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
+ if (locked && !memcg->oom_kill_disable) {
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask);
+ mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
} else {
schedule();
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
}
- spin_lock(&memcg_oom_lock);
- if (locked)
- mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
-
- mem_cgroup_unmark_under_oom(memcg);

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
- return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+ if (locked) {
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that an OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
+ }
+cleanup:
+ current->memcg_oom.memcg = NULL;
+ css_put(&memcg->css);
return true;
}

@@ -2195,11 +2236,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2297,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2292,6 +2328,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
goto bypass;

/*
+ * Task already OOMed, let it finish quickly.
+ */
+ if (unlikely(current->memcg_oom.memcg))
+ goto bypass;
+
+ /*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
@@ -2349,7 +2391,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2399,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2412,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..3d82ef9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3439,22 +3439,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3495,43 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ /*
+ * Enable the memcg OOM handling for faults triggered in user
+ * space. Kernel faults are handled more gracefully.
+ */
+ if (flags & FAULT_FLAG_USER)
+ mem_cgroup_oom_enable();
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (flags & FAULT_FLAG_USER) {
+ mem_cgroup_oom_disable();
+ /*
+ * The task may have entered a memcg OOM situation but
+ * if the allocation error was handled gracefully (no
+ * VM_FAULT_OOM), there is no need to kill anything.
+ * Just clean up the OOM state peacefully.
+ */
+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
+ mem_cgroup_oom_synchronize(false);
+ }
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..3bf664c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize(true))
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();
azurIt
2013-09-18 20:52:19 UTC
Permalink
> CC: "Michal Hocko" <mhocko-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
>> On Wed, Sep 18, 2013 at 02:04:55PM -0400, Johannes Weiner wrote:
>> > On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
>> > > > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>> > > >On Tue 17-09-13 13:15:35, azurIt wrote:
>> > > >[...]
>> > > >> Is something unusual on this stack?
>> > > >>
>> > > >>
>> > > >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
>> > > >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
>> > > >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
>> > > >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
>> > > >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
>> > > >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
>> > > >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
>> > > >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
>> > > >> [<ffffffff81027424>] do_page_fault+0x404/0x490
>> > > >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
>> > > >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> > > >
>> > > >This is a regular memcg OOM killer. Which dumps messages about what is
>> > > >going to do. So no, nothing unusual, except if it was like that for ever
>> > > >which would mean that oom_kill_process is in the endless loop. But a
>> > > >single stack doesn't tell us much.
>> > > >
>> > > >Just a note. When you see something hogging a cpu and you are not sure
>> > > >whether it might be in an endless loop inside the kernel it makes sense
>> > > >to take several snaphosts of the stack trace and see if it changes. If
>> > > >not and the process is not sleeping (there is no schedule on the trace)
>> > > >then it might be looping somewhere waiting for Godot. If it is sleeping
>> > > >then it is slightly harder because you would have to identify what it is
>> > > >waiting for which requires to know a deeper context.
>> > > >--
>> > > >Michal Hocko
>> > > >SUSE Labs
>> > >
>> > >
>> > >
>> > > I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
>> > >
>> > > First (doesn't look very helpfull):
>> > > [<ffffffffffffffff>] 0xffffffffffffffff
>> > >
>> > >
>> > > Second:
>> > > [<ffffffff810e17d1>] shrink_zone+0x481/0x650
>> > > [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
>> > > [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
>> > > [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
>> > > [<ffffffff8114931d>] __getblk+0x14d/0x2c0
>> > > [<ffffffff8114c973>] __bread+0x13/0xc0
>> > > [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
>> > > [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
>> > > [<ffffffff81198244>] ext3_get_block+0xc4/0x120
>> > > [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
>> > > [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
>> > > [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
>> > > [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
>> > > [<ffffffff810d9411>] ra_submit+0x21/0x30
>> > > [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
>> > > [<ffffffff810ef908>] __do_fault+0x78/0x5a0
>> > > [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
>> > > [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
>> > > [<ffffffff8102715b>] do_page_fault+0x13b/0x490
>> > > [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> > > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > Ah, crap. I'm sorry. You even showed us this exact trace before in
>> > another context, but I did not fully realize what __getblk() is doing.
>> >
>> > My subsequent patches made a charge attempt return -ENOMEM without
>> > reclaim if the memcg is under OOM. And so the reason you have these
>> > reclaim livelocks is because __getblk never fails on -ENOMEM. When
>> > the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
>> > tries again in an endless loop. The memcg code would previously just
>> > loop inside the charge, reclaiming and killing, until the allocation
>> > succeeded. But the new code relies on the fault stack being unwound
>> > to complete the OOM kill. And since the stack is not unwound with
>> > __getblk() looping around the allocation there is no more memcg
>> > reclaim AND no memcg OOM kill, thus no chance of exiting.
>> >
>> > That code is weird but really old, so it may take a while to evaluate
>> > all the callers as to whether this can be changed.
>> >
>> > In the meantime, I would just allow __getblk to bypass the memcg limit
>> > when it still can't charge after reclaim. Does the below get your
>> > machine back on track?
>>
>> Scratch that. The idea is reasonable but the implementation is not
>> fully cooked yet. I'll send you an update.
>
>Here is an update. Full replacement on top of 3.2 since we tried a
>dead end and it would be more painful to revert individual changes.
>
>The first bug you had was the same task entering OOM repeatedly and
>leaking the memcg reference, thus creating undeletable memcgs. My
>fixup added a condition that if the task already set up an OOM context
>in that fault, another charge attempt would immediately return -ENOMEM
>without even trying reclaim anymore. This dropped __getblk() into an
>endless loop of waking the flushers and performing global reclaim and
>memcg returning -ENOMEM regardless of free memory.
>
>The update now basically only changes this -ENOMEM to bypass, so that
>the memory is not accounted and the limit ignored. OOM killed tasks
>are granted the same right, so that they can exit quickly and release
>memory. Likewise, we want a task that hit the OOM condition also to
>finish the fault quickly so that it can invoke the OOM killer.
>
>Does the following work for you, azur?



Compiled fine, I wil install new kernel this night. Thank you!

azur
azurIt
2013-09-25 07:26:45 UTC
Permalink
> CC: "Michal Hocko" <mhocko-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
>> On Wed, Sep 18, 2013 at 02:04:55PM -0400, Johannes Weiner wrote:
>> > On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
>> > > > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>> > > >On Tue 17-09-13 13:15:35, azurIt wrote:
>> > > >[...]
>> > > >> Is something unusual on this stack?
>> > > >>
>> > > >>
>> > > >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
>> > > >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
>> > > >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
>> > > >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
>> > > >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
>> > > >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
>> > > >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
>> > > >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
>> > > >> [<ffffffff81027424>] do_page_fault+0x404/0x490
>> > > >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
>> > > >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> > > >
>> > > >This is a regular memcg OOM killer. Which dumps messages about what is
>> > > >going to do. So no, nothing unusual, except if it was like that for ever
>> > > >which would mean that oom_kill_process is in the endless loop. But a
>> > > >single stack doesn't tell us much.
>> > > >
>> > > >Just a note. When you see something hogging a cpu and you are not sure
>> > > >whether it might be in an endless loop inside the kernel it makes sense
>> > > >to take several snaphosts of the stack trace and see if it changes. If
>> > > >not and the process is not sleeping (there is no schedule on the trace)
>> > > >then it might be looping somewhere waiting for Godot. If it is sleeping
>> > > >then it is slightly harder because you would have to identify what it is
>> > > >waiting for which requires to know a deeper context.
>> > > >--
>> > > >Michal Hocko
>> > > >SUSE Labs
>> > >
>> > >
>> > >
>> > > I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
>> > >
>> > > First (doesn't look very helpfull):
>> > > [<ffffffffffffffff>] 0xffffffffffffffff
>> > >
>> > >
>> > > Second:
>> > > [<ffffffff810e17d1>] shrink_zone+0x481/0x650
>> > > [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
>> > > [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
>> > > [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
>> > > [<ffffffff8114931d>] __getblk+0x14d/0x2c0
>> > > [<ffffffff8114c973>] __bread+0x13/0xc0
>> > > [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
>> > > [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
>> > > [<ffffffff81198244>] ext3_get_block+0xc4/0x120
>> > > [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
>> > > [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
>> > > [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
>> > > [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
>> > > [<ffffffff810d9411>] ra_submit+0x21/0x30
>> > > [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
>> > > [<ffffffff810ef908>] __do_fault+0x78/0x5a0
>> > > [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
>> > > [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
>> > > [<ffffffff8102715b>] do_page_fault+0x13b/0x490
>> > > [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> > > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > Ah, crap. I'm sorry. You even showed us this exact trace before in
>> > another context, but I did not fully realize what __getblk() is doing.
>> >
>> > My subsequent patches made a charge attempt return -ENOMEM without
>> > reclaim if the memcg is under OOM. And so the reason you have these
>> > reclaim livelocks is because __getblk never fails on -ENOMEM. When
>> > the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
>> > tries again in an endless loop. The memcg code would previously just
>> > loop inside the charge, reclaiming and killing, until the allocation
>> > succeeded. But the new code relies on the fault stack being unwound
>> > to complete the OOM kill. And since the stack is not unwound with
>> > __getblk() looping around the allocation there is no more memcg
>> > reclaim AND no memcg OOM kill, thus no chance of exiting.
>> >
>> > That code is weird but really old, so it may take a while to evaluate
>> > all the callers as to whether this can be changed.
>> >
>> > In the meantime, I would just allow __getblk to bypass the memcg limit
>> > when it still can't charge after reclaim. Does the below get your
>> > machine back on track?
>>
>> Scratch that. The idea is reasonable but the implementation is not
>> fully cooked yet. I'll send you an update.
>
>Here is an update. Full replacement on top of 3.2 since we tried a
>dead end and it would be more painful to revert individual changes.
>
>The first bug you had was the same task entering OOM repeatedly and
>leaking the memcg reference, thus creating undeletable memcgs. My
>fixup added a condition that if the task already set up an OOM context
>in that fault, another charge attempt would immediately return -ENOMEM
>without even trying reclaim anymore. This dropped __getblk() into an
>endless loop of waking the flushers and performing global reclaim and
>memcg returning -ENOMEM regardless of free memory.
>
>The update now basically only changes this -ENOMEM to bypass, so that
>the memory is not accounted and the limit ignored. OOM killed tasks
>are granted the same right, so that they can exit quickly and release
>memory. Likewise, we want a task that hit the OOM condition also to
>finish the fault quickly so that it can invoke the OOM killer.
>
>Does the following work for you, azur?


Today it is one week without any problem so i'm *disabling* several of my scripts which were suppose to fix problems related to 'my' kernel bugs (so servers won't go down). I will also install patches on several other servers and will report back in few weeks if no problems occurs. Thank you! :)

Btw, will it be then possible to include these patches to vanilla 3.2? Who can decide it?

azur
azurIt
2013-09-26 16:54:59 UTC
Permalink
> CC: "Michal Hocko" <mhocko-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
>> On Wed, Sep 18, 2013 at 02:04:55PM -0400, Johannes Weiner wrote:
>> > On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
>> > > > CC: "Johannes Weiner" <hannes-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>> > > >On Tue 17-09-13 13:15:35, azurIt wrote:
>> > > >[...]
>> > > >> Is something unusual on this stack?
>> > > >>
>> > > >>
>> > > >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
>> > > >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
>> > > >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
>> > > >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
>> > > >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
>> > > >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
>> > > >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
>> > > >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
>> > > >> [<ffffffff81027424>] do_page_fault+0x404/0x490
>> > > >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
>> > > >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> > > >
>> > > >This is a regular memcg OOM killer. Which dumps messages about what is
>> > > >going to do. So no, nothing unusual, except if it was like that for ever
>> > > >which would mean that oom_kill_process is in the endless loop. But a
>> > > >single stack doesn't tell us much.
>> > > >
>> > > >Just a note. When you see something hogging a cpu and you are not sure
>> > > >whether it might be in an endless loop inside the kernel it makes sense
>> > > >to take several snaphosts of the stack trace and see if it changes. If
>> > > >not and the process is not sleeping (there is no schedule on the trace)
>> > > >then it might be looping somewhere waiting for Godot. If it is sleeping
>> > > >then it is slightly harder because you would have to identify what it is
>> > > >waiting for which requires to know a deeper context.
>> > > >--
>> > > >Michal Hocko
>> > > >SUSE Labs
>> > >
>> > >
>> > >
>> > > I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
>> > >
>> > > First (doesn't look very helpfull):
>> > > [<ffffffffffffffff>] 0xffffffffffffffff
>> > >
>> > >
>> > > Second:
>> > > [<ffffffff810e17d1>] shrink_zone+0x481/0x650
>> > > [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
>> > > [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
>> > > [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
>> > > [<ffffffff8114931d>] __getblk+0x14d/0x2c0
>> > > [<ffffffff8114c973>] __bread+0x13/0xc0
>> > > [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
>> > > [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
>> > > [<ffffffff81198244>] ext3_get_block+0xc4/0x120
>> > > [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
>> > > [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
>> > > [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
>> > > [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
>> > > [<ffffffff810d9411>] ra_submit+0x21/0x30
>> > > [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
>> > > [<ffffffff810ef908>] __do_fault+0x78/0x5a0
>> > > [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
>> > > [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
>> > > [<ffffffff8102715b>] do_page_fault+0x13b/0x490
>> > > [<ffffffff815cb87f>] page_fault+0x1f/0x30
>> > > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > Ah, crap. I'm sorry. You even showed us this exact trace before in
>> > another context, but I did not fully realize what __getblk() is doing.
>> >
>> > My subsequent patches made a charge attempt return -ENOMEM without
>> > reclaim if the memcg is under OOM. And so the reason you have these
>> > reclaim livelocks is because __getblk never fails on -ENOMEM. When
>> > the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
>> > tries again in an endless loop. The memcg code would previously just
>> > loop inside the charge, reclaiming and killing, until the allocation
>> > succeeded. But the new code relies on the fault stack being unwound
>> > to complete the OOM kill. And since the stack is not unwound with
>> > __getblk() looping around the allocation there is no more memcg
>> > reclaim AND no memcg OOM kill, thus no chance of exiting.
>> >
>> > That code is weird but really old, so it may take a while to evaluate
>> > all the callers as to whether this can be changed.
>> >
>> > In the meantime, I would just allow __getblk to bypass the memcg limit
>> > when it still can't charge after reclaim. Does the below get your
>> > machine back on track?
>>
>> Scratch that. The idea is reasonable but the implementation is not
>> fully cooked yet. I'll send you an update.
>
>Here is an update. Full replacement on top of 3.2 since we tried a
>dead end and it would be more painful to revert individual changes.
>
>The first bug you had was the same task entering OOM repeatedly and
>leaking the memcg reference, thus creating undeletable memcgs. My
>fixup added a condition that if the task already set up an OOM context
>in that fault, another charge attempt would immediately return -ENOMEM
>without even trying reclaim anymore. This dropped __getblk() into an
>endless loop of waking the flushers and performing global reclaim and
>memcg returning -ENOMEM regardless of free memory.
>
>The update now basically only changes this -ENOMEM to bypass, so that
>the memory is not accounted and the limit ignored. OOM killed tasks
>are granted the same right, so that they can exit quickly and release
>memory. Likewise, we want a task that hit the OOM condition also to
>finish the fault quickly so that it can invoke the OOM killer.
>
>Does the following work for you, azur?


Johannes,

bad news everyone! :(

Unfortunaely, two different problems appears today:

1.) This looks like my very original problem - stucked processes inside one cgroup. I took stacks from all of them over time but server was very slow so i had to kill them soon:
http://watchdog.sk/lkmlmemcg-bug-9.tar.gz

2.) This was just like my last problem where few processes were doing huge i/o. As sever was almost unoperable i barely killed them so no more info here, sorry.

azur
Johannes Weiner
2013-09-26 19:27:43 UTC
Permalink
Hi azur,

On Thu, Sep 26, 2013 at 06:54:59PM +0200, azurIt wrote:
> On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
> >Here is an update. Full replacement on top of 3.2 since we tried a
> >dead end and it would be more painful to revert individual changes.
> >
> >The first bug you had was the same task entering OOM repeatedly and
> >leaking the memcg reference, thus creating undeletable memcgs. My
> >fixup added a condition that if the task already set up an OOM context
> >in that fault, another charge attempt would immediately return -ENOMEM
> >without even trying reclaim anymore. This dropped __getblk() into an
> >endless loop of waking the flushers and performing global reclaim and
> >memcg returning -ENOMEM regardless of free memory.
> >
> >The update now basically only changes this -ENOMEM to bypass, so that
> >the memory is not accounted and the limit ignored. OOM killed tasks
> >are granted the same right, so that they can exit quickly and release
> >memory. Likewise, we want a task that hit the OOM condition also to
> >finish the fault quickly so that it can invoke the OOM killer.
> >
> >Does the following work for you, azur?
>
>
> Johannes,
>
> bad news everyone! :(
>
> Unfortunaely, two different problems appears today:
>
> 1.) This looks like my very original problem - stucked processes inside one cgroup. I took stacks from all of them over time but server was very slow so i had to kill them soon:
> http://watchdog.sk/lkmlmemcg-bug-9.tar.gz
>
> 2.) This was just like my last problem where few processes were doing huge i/o. As sever was almost unoperable i barely killed them so no more info here, sorry.

>From one of the tasks:

1380213238/11210/stack:[<ffffffff810528f1>] sys_sched_yield+0x41/0x70
1380213238/11210/stack:[<ffffffff81148ef1>] free_more_memory+0x21/0x60
1380213238/11210/stack:[<ffffffff8114957d>] __getblk+0x14d/0x2c0
1380213238/11210/stack:[<ffffffff81198a2b>] ext3_getblk+0xeb/0x240
1380213238/11210/stack:[<ffffffff8119d2df>] ext3_find_entry+0x13f/0x480
1380213238/11210/stack:[<ffffffff8119dd6d>] ext3_lookup+0x4d/0x120
1380213238/11210/stack:[<ffffffff81122a55>] d_alloc_and_lookup+0x45/0x90
1380213238/11210/stack:[<ffffffff81122ff8>] do_lookup+0x278/0x390
1380213238/11210/stack:[<ffffffff81124c40>] path_lookupat+0x120/0x800
1380213238/11210/stack:[<ffffffff81125355>] do_path_lookup+0x35/0xd0
1380213238/11210/stack:[<ffffffff811254d9>] user_path_at_empty+0x59/0xb0
1380213238/11210/stack:[<ffffffff81125541>] user_path_at+0x11/0x20
1380213238/11210/stack:[<ffffffff81115b70>] sys_faccessat+0xd0/0x200
1380213238/11210/stack:[<ffffffff81115cb8>] sys_access+0x18/0x20
1380213238/11210/stack:[<ffffffff815ccc26>] system_call_fastpath+0x18/0x1d

Should have seen this coming... it's still in that braindead
__getblk() loop, only from a syscall this time (no OOM path). The
group's memory.stat looks like this:

cache 0
rss 0
mapped_file 0
pgpgin 0
pgpgout 0
swap 0
pgfault 0
pgmajfault 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 209715200
hierarchical_memsw_limit 209715200
total_cache 0
total_rss 209715200
total_mapped_file 0
total_pgpgin 1028153297
total_pgpgout 1028102097
total_swap 0
total_pgfault 1352903120
total_pgmajfault 45342
total_inactive_anon 0
total_active_anon 209715200
total_inactive_file 0
total_active_file 0
total_unevictable 0

with anonymous pages to the limit and you probably don't have any swap
space enabled to anything in the group.

I guess there is no way around annotating that __getblk() loop. The
best solution right now is probably to use __GFP_NOFAIL. For one, we
can let the allocation bypass the memcg limit if reclaim can't make
progress. But also, the loop is then actually happening inside the
page allocator, where it should happen, and not around ad-hoc direct
reclaim in buffer.c.

Can you try this on top of our ever-growing stack of patches?

---
fs/buffer.c | 14 ++++++++++++--
mm/memcontrol.c | 2 ++
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..9bd0e05 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -998,9 +998,19 @@ grow_dev_page(struct block_device *bdev, sector_t block,
struct inode *inode = bdev->bd_inode;
struct page *page;
struct buffer_head *bh;
+ gfp_t gfp_mask;

- page = find_or_create_page(inode->i_mapping, index,
- (mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS)|__GFP_MOVABLE);
+ gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
+ gfp_mask |= __GFP_MOVABLE;
+ /*
+ * XXX: __getblk_slow() can not really deal with failure and
+ * will endlessly loop on improvised global reclaim. Prefer
+ * looping in the allocator rather than here, at least that
+ * code knows what it's doing.
+ */
+ gfp_mask |= __GFP_NOFAIL;
+
+ page = find_or_create_page(inode->i_mapping, index, gfp_mask);
if (!page)
return NULL;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 66cc373..5aee2fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2429,6 +2429,8 @@ done:
return 0;
nomem:
*ptr = NULL;
+ if (gfp_mask & __GFP_NOFAIL)
+ return 0;
return -ENOMEM;
bypass:
*ptr = NULL;
--
1.8.4
azurIt
2013-09-27 02:04:23 UTC
Permalink
> CC: "Michal Hocko" <mhocko-***@public.gmane.org>, "Andrew Morton" <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+***@public.gmane.org>, "David Rientjes" <rientjes-hpIqsD4AKlfQT0dZR+***@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+***@public.gmane.org>, "KOSAKI Motohiro" <kosaki.motohiro-+***@public.gmane.org>, linux-mm-***@public.gmane.org, cgroups-***@public.gmane.org, x86-DgEjT+Ai2ygdnm+***@public.gmane.org, linux-arch-***@public.gmane.org, linux-kernel-***@public.gmane.org
>Hi azur,
>
>On Thu, Sep 26, 2013 at 06:54:59PM +0200, azurIt wrote:
>> On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
>> >Here is an update. Full replacement on top of 3.2 since we tried a
>> >dead end and it would be more painful to revert individual changes.
>> >
>> >The first bug you had was the same task entering OOM repeatedly and
>> >leaking the memcg reference, thus creating undeletable memcgs. My
>> >fixup added a condition that if the task already set up an OOM context
>> >in that fault, another charge attempt would immediately return -ENOMEM
>> >without even trying reclaim anymore. This dropped __getblk() into an
>> >endless loop of waking the flushers and performing global reclaim and
>> >memcg returning -ENOMEM regardless of free memory.
>> >
>> >The update now basically only changes this -ENOMEM to bypass, so that
>> >the memory is not accounted and the limit ignored. OOM killed tasks
>> >are granted the same right, so that they can exit quickly and release
>> >memory. Likewise, we want a task that hit the OOM condition also to
>> >finish the fault quickly so that it can invoke the OOM killer.
>> >
>> >Does the following work for you, azur?
>>
>>
>> Johannes,
>>
>> bad news everyone! :(
>>
>> Unfortunaely, two different problems appears today:
>>
>> 1.) This looks like my very original problem - stucked processes inside one cgroup. I took stacks from all of them over time but server was very slow so i had to kill them soon:
>> http://watchdog.sk/lkmlmemcg-bug-9.tar.gz
>>
>> 2.) This was just like my last problem where few processes were doing huge i/o. As sever was almost unoperable i barely killed them so no more info here, sorry.
>
>From one of the tasks:
>
>1380213238/11210/stack:[<ffffffff810528f1>] sys_sched_yield+0x41/0x70
>1380213238/11210/stack:[<ffffffff81148ef1>] free_more_memory+0x21/0x60
>1380213238/11210/stack:[<ffffffff8114957d>] __getblk+0x14d/0x2c0
>1380213238/11210/stack:[<ffffffff81198a2b>] ext3_getblk+0xeb/0x240
>1380213238/11210/stack:[<ffffffff8119d2df>] ext3_find_entry+0x13f/0x480
>1380213238/11210/stack:[<ffffffff8119dd6d>] ext3_lookup+0x4d/0x120
>1380213238/11210/stack:[<ffffffff81122a55>] d_alloc_and_lookup+0x45/0x90
>1380213238/11210/stack:[<ffffffff81122ff8>] do_lookup+0x278/0x390
>1380213238/11210/stack:[<ffffffff81124c40>] path_lookupat+0x120/0x800
>1380213238/11210/stack:[<ffffffff81125355>] do_path_lookup+0x35/0xd0
>1380213238/11210/stack:[<ffffffff811254d9>] user_path_at_empty+0x59/0xb0
>1380213238/11210/stack:[<ffffffff81125541>] user_path_at+0x11/0x20
>1380213238/11210/stack:[<ffffffff81115b70>] sys_faccessat+0xd0/0x200
>1380213238/11210/stack:[<ffffffff81115cb8>] sys_access+0x18/0x20
>1380213238/11210/stack:[<ffffffff815ccc26>] system_call_fastpath+0x18/0x1d
>
>Should have seen this coming... it's still in that braindead
>__getblk() loop, only from a syscall this time (no OOM path). The
>group's memory.stat looks like this:
>
>cache 0
>rss 0
>mapped_file 0
>pgpgin 0
>pgpgout 0
>swap 0
>pgfault 0
>pgmajfault 0
>inactive_anon 0
>active_anon 0
>inactive_file 0
>active_file 0
>unevictable 0
>hierarchical_memory_limit 209715200
>hierarchical_memsw_limit 209715200
>total_cache 0
>total_rss 209715200
>total_mapped_file 0
>total_pgpgin 1028153297
>total_pgpgout 1028102097
>total_swap 0
>total_pgfault 1352903120
>total_pgmajfault 45342
>total_inactive_anon 0
>total_active_anon 209715200
>total_inactive_file 0
>total_active_file 0
>total_unevictable 0
>
>with anonymous pages to the limit and you probably don't have any swap
>space enabled to anything in the group.
>
>I guess there is no way around annotating that __getblk() loop. The
>best solution right now is probably to use __GFP_NOFAIL. For one, we
>can let the allocation bypass the memcg limit if reclaim can't make
>progress. But also, the loop is then actually happening inside the
>page allocator, where it should happen, and not around ad-hoc direct
>reclaim in buffer.c.
>
>Can you try this on top of our ever-growing stack of patches?


Installed, thank you!

azur
azurIt
2013-10-07 11:01:49 UTC
Permalink
>On Thu, Sep 26, 2013 at 06:54:59PM +0200, azurIt wrote:
>> On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
>> >Here is an update. Full replacement on top of 3.2 since we tried a
>> >dead end and it would be more painful to revert individual changes.
>> >
>> >The first bug you had was the same task entering OOM repeatedly and
>> >leaking the memcg reference, thus creating undeletable memcgs. My
>> >fixup added a condition that if the task already set up an OOM context
>> >in that fault, another charge attempt would immediately return -ENOMEM
>> >without even trying reclaim anymore. This dropped __getblk() into an
>> >endless loop of waking the flushers and performing global reclaim and
>> >memcg returning -ENOMEM regardless of free memory.
>> >
>> >The update now basically only changes this -ENOMEM to bypass, so that
>> >the memory is not accounted and the limit ignored. OOM killed tasks
>> >are granted the same right, so that they can exit quickly and release
>> >memory. Likewise, we want a task that hit the OOM condition also to
>> >finish the fault quickly so that it can invoke the OOM killer.
>> >
>> >Does the following work for you, azur?
>>
>>
>> Johannes,
>>
>> bad news everyone! :(
>>
>> Unfortunaely, two different problems appears today:
>>
>> 1.) This looks like my very original problem - stucked processes inside one cgroup. I took stacks from all of them over time but server was very slow so i had to kill them soon:
>> http://watchdog.sk/lkmlmemcg-bug-9.tar.gz
>>
>> 2.) This was just like my last problem where few processes were doing huge i/o. As sever was almost unoperable i barely killed them so no more info here, sorry.
>
>From one of the tasks:
>
>1380213238/11210/stack:[<ffffffff810528f1>] sys_sched_yield+0x41/0x70
>1380213238/11210/stack:[<ffffffff81148ef1>] free_more_memory+0x21/0x60
>1380213238/11210/stack:[<ffffffff8114957d>] __getblk+0x14d/0x2c0
>1380213238/11210/stack:[<ffffffff81198a2b>] ext3_getblk+0xeb/0x240
>1380213238/11210/stack:[<ffffffff8119d2df>] ext3_find_entry+0x13f/0x480
>1380213238/11210/stack:[<ffffffff8119dd6d>] ext3_lookup+0x4d/0x120
>1380213238/11210/stack:[<ffffffff81122a55>] d_alloc_and_lookup+0x45/0x90
>1380213238/11210/stack:[<ffffffff81122ff8>] do_lookup+0x278/0x390
>1380213238/11210/stack:[<ffffffff81124c40>] path_lookupat+0x120/0x800
>1380213238/11210/stack:[<ffffffff81125355>] do_path_lookup+0x35/0xd0
>1380213238/11210/stack:[<ffffffff811254d9>] user_path_at_empty+0x59/0xb0
>1380213238/11210/stack:[<ffffffff81125541>] user_path_at+0x11/0x20
>1380213238/11210/stack:[<ffffffff81115b70>] sys_faccessat+0xd0/0x200
>1380213238/11210/stack:[<ffffffff81115cb8>] sys_access+0x18/0x20
>1380213238/11210/stack:[<ffffffff815ccc26>] system_call_fastpath+0x18/0x1d
>
>Should have seen this coming... it's still in that braindead
>__getblk() loop, only from a syscall this time (no OOM path). The
>group's memory.stat looks like this:
>
>cache 0
>rss 0
>mapped_file 0
>pgpgin 0
>pgpgout 0
>swap 0
>pgfault 0
>pgmajfault 0
>inactive_anon 0
>active_anon 0
>inactive_file 0
>active_file 0
>unevictable 0
>hierarchical_memory_limit 209715200
>hierarchical_memsw_limit 209715200
>total_cache 0
>total_rss 209715200
>total_mapped_file 0
>total_pgpgin 1028153297
>total_pgpgout 1028102097
>total_swap 0
>total_pgfault 1352903120
>total_pgmajfault 45342
>total_inactive_anon 0
>total_active_anon 209715200
>total_inactive_file 0
>total_active_file 0
>total_unevictable 0
>
>with anonymous pages to the limit and you probably don't have any swap
>space enabled to anything in the group.
>
>I guess there is no way around annotating that __getblk() loop. The
>best solution right now is probably to use __GFP_NOFAIL. For one, we
>can let the allocation bypass the memcg limit if reclaim can't make
>progress. But also, the loop is then actually happening inside the
>page allocator, where it should happen, and not around ad-hoc direct
>reclaim in buffer.c.
>
>Can you try this on top of our ever-growing stack of patches?




Joahnnes,

looks like the problem is completely resolved :) Thank you, Michal Hocko and everyone involved for help and time.

One more thing:
I see that your patches are going into 3.12. Is there a chance to get them also into 3.2? Is Ben Hutchings (current maintainer of 3.2 branch) competent to decide this? Should i contact him directly? I can't upgrade to 3.12 because stable grsecurity is for 3.2 and i don't think this will change in near future.


azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2013-10-07 19:23:36 UTC
Permalink
Hi azur,

On Mon, Oct 07, 2013 at 01:01:49PM +0200, azurIt wrote:
> >On Thu, Sep 26, 2013 at 06:54:59PM +0200, azurIt wrote:
> >> On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
> >> >Here is an update. Full replacement on top of 3.2 since we tried a
> >> >dead end and it would be more painful to revert individual changes.
> >> >
> >> >The first bug you had was the same task entering OOM repeatedly and
> >> >leaking the memcg reference, thus creating undeletable memcgs. My
> >> >fixup added a condition that if the task already set up an OOM context
> >> >in that fault, another charge attempt would immediately return -ENOMEM
> >> >without even trying reclaim anymore. This dropped __getblk() into an
> >> >endless loop of waking the flushers and performing global reclaim and
> >> >memcg returning -ENOMEM regardless of free memory.
> >> >
> >> >The update now basically only changes this -ENOMEM to bypass, so that
> >> >the memory is not accounted and the limit ignored. OOM killed tasks
> >> >are granted the same right, so that they can exit quickly and release
> >> >memory. Likewise, we want a task that hit the OOM condition also to
> >> >finish the fault quickly so that it can invoke the OOM killer.
> >> >
> >> >Does the following work for you, azur?
> >>
> >>
> >> Johannes,
> >>
> >> bad news everyone! :(
> >>
> >> Unfortunaely, two different problems appears today:
> >>
> >> 1.) This looks like my very original problem - stucked processes inside one cgroup. I took stacks from all of them over time but server was very slow so i had to kill them soon:
> >> http://watchdog.sk/lkmlmemcg-bug-9.tar.gz
> >>
> >> 2.) This was just like my last problem where few processes were doing huge i/o. As sever was almost unoperable i barely killed them so no more info here, sorry.
> >
> >From one of the tasks:
> >
> >1380213238/11210/stack:[<ffffffff810528f1>] sys_sched_yield+0x41/0x70
> >1380213238/11210/stack:[<ffffffff81148ef1>] free_more_memory+0x21/0x60
> >1380213238/11210/stack:[<ffffffff8114957d>] __getblk+0x14d/0x2c0
> >1380213238/11210/stack:[<ffffffff81198a2b>] ext3_getblk+0xeb/0x240
> >1380213238/11210/stack:[<ffffffff8119d2df>] ext3_find_entry+0x13f/0x480
> >1380213238/11210/stack:[<ffffffff8119dd6d>] ext3_lookup+0x4d/0x120
> >1380213238/11210/stack:[<ffffffff81122a55>] d_alloc_and_lookup+0x45/0x90
> >1380213238/11210/stack:[<ffffffff81122ff8>] do_lookup+0x278/0x390
> >1380213238/11210/stack:[<ffffffff81124c40>] path_lookupat+0x120/0x800
> >1380213238/11210/stack:[<ffffffff81125355>] do_path_lookup+0x35/0xd0
> >1380213238/11210/stack:[<ffffffff811254d9>] user_path_at_empty+0x59/0xb0
> >1380213238/11210/stack:[<ffffffff81125541>] user_path_at+0x11/0x20
> >1380213238/11210/stack:[<ffffffff81115b70>] sys_faccessat+0xd0/0x200
> >1380213238/11210/stack:[<ffffffff81115cb8>] sys_access+0x18/0x20
> >1380213238/11210/stack:[<ffffffff815ccc26>] system_call_fastpath+0x18/0x1d
> >
> >Should have seen this coming... it's still in that braindead
> >__getblk() loop, only from a syscall this time (no OOM path). The
> >group's memory.stat looks like this:
> >
> >cache 0
> >rss 0
> >mapped_file 0
> >pgpgin 0
> >pgpgout 0
> >swap 0
> >pgfault 0
> >pgmajfault 0
> >inactive_anon 0
> >active_anon 0
> >inactive_file 0
> >active_file 0
> >unevictable 0
> >hierarchical_memory_limit 209715200
> >hierarchical_memsw_limit 209715200
> >total_cache 0
> >total_rss 209715200
> >total_mapped_file 0
> >total_pgpgin 1028153297
> >total_pgpgout 1028102097
> >total_swap 0
> >total_pgfault 1352903120
> >total_pgmajfault 45342
> >total_inactive_anon 0
> >total_active_anon 209715200
> >total_inactive_file 0
> >total_active_file 0
> >total_unevictable 0
> >
> >with anonymous pages to the limit and you probably don't have any swap
> >space enabled to anything in the group.
> >
> >I guess there is no way around annotating that __getblk() loop. The
> >best solution right now is probably to use __GFP_NOFAIL. For one, we
> >can let the allocation bypass the memcg limit if reclaim can't make
> >progress. But also, the loop is then actually happening inside the
> >page allocator, where it should happen, and not around ad-hoc direct
> >reclaim in buffer.c.
> >
> >Can you try this on top of our ever-growing stack of patches?
>
>
>
>
> Joahnnes,
>
> looks like the problem is completely resolved :) Thank you, Michal
> Hocko and everyone involved for help and time.

Thanks a lot for your patience. I will send out the fixes for 3.12.

> One more thing: I see that your patches are going into 3.12. Is
> there a chance to get them also into 3.2? Is Ben Hutchings (current
> maintainer of 3.2 branch) competent to decide this? Should i contact
> him directly? I can't upgrade to 3.12 because stable grsecurity is
> for 3.2 and i don't think this will change in near future.

Yes, I'll send them to stable. The original OOM killer rework was not
tagged for stable, but since we have a known deadlock problem, I think
it makes sense to include them after all.
azurIt
2013-10-09 18:44:50 UTC
Permalink
>Hi azur,
>
>On Mon, Oct 07, 2013 at 01:01:49PM +0200, azurIt wrote:
>> >On Thu, Sep 26, 2013 at 06:54:59PM +0200, azurIt wrote:
>> >> On Wed, Sep 18, 2013 at 02:19:46PM -0400, Johannes Weiner wrote:
>> >> >Here is an update. Full replacement on top of 3.2 since we tried a
>> >> >dead end and it would be more painful to revert individual changes.
>> >> >
>> >> >The first bug you had was the same task entering OOM repeatedly and
>> >> >leaking the memcg reference, thus creating undeletable memcgs. My
>> >> >fixup added a condition that if the task already set up an OOM context
>> >> >in that fault, another charge attempt would immediately return -ENOMEM
>> >> >without even trying reclaim anymore. This dropped __getblk() into an
>> >> >endless loop of waking the flushers and performing global reclaim and
>> >> >memcg returning -ENOMEM regardless of free memory.
>> >> >
>> >> >The update now basically only changes this -ENOMEM to bypass, so that
>> >> >the memory is not accounted and the limit ignored. OOM killed tasks
>> >> >are granted the same right, so that they can exit quickly and release
>> >> >memory. Likewise, we want a task that hit the OOM condition also to
>> >> >finish the fault quickly so that it can invoke the OOM killer.
>> >> >
>> >> >Does the following work for you, azur?
>> >>
>> >>
>> >> Johannes,
>> >>
>> >> bad news everyone! :(
>> >>
>> >> Unfortunaely, two different problems appears today:
>> >>
>> >> 1.) This looks like my very original problem - stucked processes inside one cgroup. I took stacks from all of them over time but server was very slow so i had to kill them soon:
>> >> http://watchdog.sk/lkmlmemcg-bug-9.tar.gz
>> >>
>> >> 2.) This was just like my last problem where few processes were doing huge i/o. As sever was almost unoperable i barely killed them so no more info here, sorry.
>> >
>> >From one of the tasks:
>> >
>> >1380213238/11210/stack:[<ffffffff810528f1>] sys_sched_yield+0x41/0x70
>> >1380213238/11210/stack:[<ffffffff81148ef1>] free_more_memory+0x21/0x60
>> >1380213238/11210/stack:[<ffffffff8114957d>] __getblk+0x14d/0x2c0
>> >1380213238/11210/stack:[<ffffffff81198a2b>] ext3_getblk+0xeb/0x240
>> >1380213238/11210/stack:[<ffffffff8119d2df>] ext3_find_entry+0x13f/0x480
>> >1380213238/11210/stack:[<ffffffff8119dd6d>] ext3_lookup+0x4d/0x120
>> >1380213238/11210/stack:[<ffffffff81122a55>] d_alloc_and_lookup+0x45/0x90
>> >1380213238/11210/stack:[<ffffffff81122ff8>] do_lookup+0x278/0x390
>> >1380213238/11210/stack:[<ffffffff81124c40>] path_lookupat+0x120/0x800
>> >1380213238/11210/stack:[<ffffffff81125355>] do_path_lookup+0x35/0xd0
>> >1380213238/11210/stack:[<ffffffff811254d9>] user_path_at_empty+0x59/0xb0
>> >1380213238/11210/stack:[<ffffffff81125541>] user_path_at+0x11/0x20
>> >1380213238/11210/stack:[<ffffffff81115b70>] sys_faccessat+0xd0/0x200
>> >1380213238/11210/stack:[<ffffffff81115cb8>] sys_access+0x18/0x20
>> >1380213238/11210/stack:[<ffffffff815ccc26>] system_call_fastpath+0x18/0x1d
>> >
>> >Should have seen this coming... it's still in that braindead
>> >__getblk() loop, only from a syscall this time (no OOM path). The
>> >group's memory.stat looks like this:
>> >
>> >cache 0
>> >rss 0
>> >mapped_file 0
>> >pgpgin 0
>> >pgpgout 0
>> >swap 0
>> >pgfault 0
>> >pgmajfault 0
>> >inactive_anon 0
>> >active_anon 0
>> >inactive_file 0
>> >active_file 0
>> >unevictable 0
>> >hierarchical_memory_limit 209715200
>> >hierarchical_memsw_limit 209715200
>> >total_cache 0
>> >total_rss 209715200
>> >total_mapped_file 0
>> >total_pgpgin 1028153297
>> >total_pgpgout 1028102097
>> >total_swap 0
>> >total_pgfault 1352903120
>> >total_pgmajfault 45342
>> >total_inactive_anon 0
>> >total_active_anon 209715200
>> >total_inactive_file 0
>> >total_active_file 0
>> >total_unevictable 0
>> >
>> >with anonymous pages to the limit and you probably don't have any swap
>> >space enabled to anything in the group.
>> >
>> >I guess there is no way around annotating that __getblk() loop. The
>> >best solution right now is probably to use __GFP_NOFAIL. For one, we
>> >can let the allocation bypass the memcg limit if reclaim can't make
>> >progress. But also, the loop is then actually happening inside the
>> >page allocator, where it should happen, and not around ad-hoc direct
>> >reclaim in buffer.c.
>> >
>> >Can you try this on top of our ever-growing stack of patches?
>>
>>
>>
>>
>> Joahnnes,
>>
>> looks like the problem is completely resolved :) Thank you, Michal
>> Hocko and everyone involved for help and time.
>
>Thanks a lot for your patience. I will send out the fixes for 3.12.
>
>> One more thing: I see that your patches are going into 3.12. Is
>> there a chance to get them also into 3.2? Is Ben Hutchings (current
>> maintainer of 3.2 branch) competent to decide this? Should i contact
>> him directly? I can't upgrade to 3.12 because stable grsecurity is
>> for 3.2 and i don't think this will change in near future.
>
>Yes, I'll send them to stable. The original OOM killer rework was not
>tagged for stable, but since we have a known deadlock problem, I think
>it makes sense to include them after all.



Joahnnes,

i'm very sorry to say it but today something strange happened.. :) i was just right at the computer so i noticed it almost immediately but i don't have much info. Server stoped to respond from the net but i was already logged on ssh which was working quite fine (only a little slow). I was able to run commands on shell but i didn't do much because i was afraid that it will goes down for good soon. I noticed few things:
- htop was strange because all CPUs were doing nothing (totally nothing)
- there were enough of free memory
- server load was about 90 and was raising slowly
- i didn't see ANY process in 'run' state
- i also didn't see any process with strange behavior (taking much CPU, memory or so) so it wasn't obvious what to do to fix it
- i started to kill Apache processes, everytime i killed some, CPUs did some work, but it wasn't fixing the problem
- finally i did 'skill -kill apache2' in shell and everything started to work
- server monitoring wasn't sending any data so i have no graphs
- nothing interesting in logs

I will send more info when i get some.

azur
Johannes Weiner
2013-10-10 00:14:22 UTC
Permalink
Hi azur,

On Wed, Oct 09, 2013 at 08:44:50PM +0200, azurIt wrote:
> Joahnnes,
>
> i'm very sorry to say it but today something strange happened.. :) i was just right at the computer so i noticed it almost immediately but i don't have much info. Server stoped to respond from the net but i was already logged on ssh which was working quite fine (only a little slow). I was able to run commands on shell but i didn't do much because i was afraid that it will goes down for good soon. I noticed few things:
> - htop was strange because all CPUs were doing nothing (totally nothing)
> - there were enough of free memory
> - server load was about 90 and was raising slowly
> - i didn't see ANY process in 'run' state
> - i also didn't see any process with strange behavior (taking much CPU, memory or so) so it wasn't obvious what to do to fix it
> - i started to kill Apache processes, everytime i killed some, CPUs did some work, but it wasn't fixing the problem
> - finally i did 'skill -kill apache2' in shell and everything started to work
> - server monitoring wasn't sending any data so i have no graphs
> - nothing interesting in logs
>
> I will send more info when i get some.

Somebody else reported a problem on the upstream patches as well. Any
chance you can confirm the stacks of the active but not running tasks?

It sounds like they are stuck on a waitqueue, the question is which
one. I forgot to disable OOM for __GFP_NOFAIL allocations, so they
could succeed and leak an OOM context. task structs are not
reinitialized between alloc & free so a different task could later try
to oom trylock a memcg that has been freed, fail, and wait
indefinitely on the OOM waitqueue. There might be a simpler
explanation but I can't think of anything right now.

But the OOM context is definitely being leaked, so please apply the
following for your next reboot:

---
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5aee2fa..83ad39b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2341,6 +2341,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
*/
if (!*ptr && !mm)
goto bypass;
+
+ if (gfp_mask & __GFP_NOFAIL)
+ oom = false;
again:
if (*ptr) { /* css should be a valid one */
memcg = *ptr;
--
1.8.4
azurIt
2013-10-10 22:59:45 UTC
Permalink
>On Wed, Oct 09, 2013 at 08:44:50PM +0200, azurIt wrote:
>> Joahnnes,
>>
>> i'm very sorry to say it but today something strange happened.. :) i was just right at the computer so i noticed it almost immediately but i don't have much info. Server stoped to respond from the net but i was already logged on ssh which was working quite fine (only a little slow). I was able to run commands on shell but i didn't do much because i was afraid that it will goes down for good soon. I noticed few things:
>> - htop was strange because all CPUs were doing nothing (totally nothing)
>> - there were enough of free memory
>> - server load was about 90 and was raising slowly
>> - i didn't see ANY process in 'run' state
>> - i also didn't see any process with strange behavior (taking much CPU, memory or so) so it wasn't obvious what to do to fix it
>> - i started to kill Apache processes, everytime i killed some, CPUs did some work, but it wasn't fixing the problem
>> - finally i did 'skill -kill apache2' in shell and everything started to work
>> - server monitoring wasn't sending any data so i have no graphs
>> - nothing interesting in logs
>>
>> I will send more info when i get some.
>
>Somebody else reported a problem on the upstream patches as well. Any
>chance you can confirm the stacks of the active but not running tasks?



Unfortunately i don't have any stacks but i will try to take some next time.



>It sounds like they are stuck on a waitqueue, the question is which
>one. I forgot to disable OOM for __GFP_NOFAIL allocations, so they
>could succeed and leak an OOM context. task structs are not
>reinitialized between alloc & free so a different task could later try
>to oom trylock a memcg that has been freed, fail, and wait
>indefinitely on the OOM waitqueue. There might be a simpler
>explanation but I can't think of anything right now.
>
>But the OOM context is definitely being leaked, so please apply the
>following for your next reboot:


It's installed, thank you!

azur
azurIt
2013-09-17 11:20:17 UTC
Permalink
> CC: "Michal Hocko" <***@suse.cz>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Mon, Sep 16, 2013 at 10:52:46PM +0200, azurIt wrote:
>> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >On Mon 16-09-13 17:05:43, azurIt wrote:
>> >> > CC: "Johannes Weiner" <***@cmpxchg.org>, "Andrew Morton" <***@linux-foundation.org>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>> >> >On Mon 16-09-13 16:13:16, azurIt wrote:
>> >> >[...]
>> >> >> >You can use sysrq+l via serial console to see tasks hogging the CPU or
>> >> >> >sysrq+t to see all the existing tasks.
>> >> >>
>> >> >>
>> >> >> Doesn't work here, it just prints 'l' resp. 't'.
>> >> >
>> >> >I am using telnet for accessing my serial consoles exported by
>> >> >the multiplicator or KVM and it can send sysrq via ctrl+t (Send
>> >> >Break). Check your serial console setup.
>> >>
>> >>
>> >>
>> >> I'm using Raritan KVM and i created keyboard macro 'sysrq + l' resp.
>> >> 'sysrq + t'. I'm also unable to use it on my local PC. Maybe it needs
>> >> to be enabled somehow?
>> >
>> >Probably yes. echo 1 > /proc/sys/kernel/sysrq should enable all sysrq
>> >commands. You can select also some of them (have a look at
>> >Documentation/sysrq.txt for more information)
>>
>>
>> Now it happens again and i was just looking on the server's
>> htop. I'm sure that this time it was only one process (apache)
>> running under user account (not root). It was taking about 100% CPU
>> (about 100% of one core). I was able to kill it by hand inside htop
>> but everything was very slow, server load was immediately on
>> 500. I'm sure it must be related to that Johannes kernel patches
>> because i'm also using i/o throttling in cgroups via Block IO
>> controller so users are unable to create such a huge I/O. I will try
>> to take stacks of processes but i'm not able to identify the
>> problematic process so i will have to take them from *all* apache
>> processes while killing them.
>
>It would be fantastic if you could capture those stacks. sysrq+t
>captures ALL of them in one go and drops them into your syslog.
>
>/proc/<pid>/stack for individual tasks works too.



Btw, this is how it looked like:
http://watchdog.sk/lkml/htop2.jpg

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-16 10:22:59 UTC
Permalink
> CC: "Andrew Morton" <***@linux-foundation.org>, "Michal Hocko" <***@suse.cz>, "David Rientjes" <***@google.com>, "KAMEZAWA Hiroyuki" <***@jp.fujitsu.com>, "KOSAKI Motohiro" <***@jp.fujitsu.com>, linux-***@kvack.org, ***@vger.kernel.org, ***@kernel.org, linux-***@vger.kernel.org, linux-***@vger.kernel.org
>On Wed, Sep 11, 2013 at 09:41:18PM +0200, azurIt wrote:
>> >On Wed, Sep 11, 2013 at 08:54:48PM +0200, azurIt wrote:
>> >> >On Wed, Sep 11, 2013 at 02:33:05PM +0200, azurIt wrote:
>> >> >> >On Tue, Sep 10, 2013 at 11:32:47PM +0200, azurIt wrote:
>> >> >> >> >On Tue, Sep 10, 2013 at 11:08:53PM +0200, azurIt wrote:
>> >> >> >> >> >On Tue, Sep 10, 2013 at 09:32:53PM +0200, azurIt wrote:
>> >> >> >> >> >> Here is full kernel log between 6:00 and 7:59:
>> >> >> >> >> >> http://watchdog.sk/lkml/kern6.log
>> >> >> >> >> >
>> >> >> >> >> >Wow, your apaches are like the hydra. Whenever one is OOM killed,
>> >> >> >> >> >more show up!
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Yeah, it's supposed to do this ;)
>> >> >> >
>> >> >> >How are you expecting the machine to recover from an OOM situation,
>> >> >> >though? I guess I don't really understand what these machines are
>> >> >> >doing. But if you are overloading them like crazy, isn't that the
>> >> >> >expected outcome?
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> There's no global OOM, server has enough of memory. OOM is occuring only in cgroups (customers who simply don't want to pay for more memory).
>> >> >
>> >> >Yes, sure, but when the cgroups are thrashing, they use the disk and
>> >> >CPU to the point where the overall system is affected.
>> >>
>> >>
>> >>
>> >>
>> >> Didn't know that there is a disk usage because of this, i never noticed anything yet.
>> >
>> >You said there was heavy IO going on...?
>>
>>
>>
>> Yes, there usually was a big IO but it was related to that
>> deadlocking bug in kernel (or i assume it was). I never saw a big IO
>> in normal conditions even when there were lots of OOM in
>> cgroups. I'm even not using swap because of this so i was assuming
>> that lacks of memory is not doing any additional IO (or am i
>> wrong?). And if you mean that last problem with IO from Monday, i
>> don't exactly know what happens but it's really long time when we
>> had so big problem with IO that it disables also root login on
>> console.
>
>The deadlocking problem should be separate from this.
>
>Even without swap, the binaries and libraries of the running tasks can
>get reclaimed (and immediately faulted back from disk, i.e thrashing).
>
>Usually the OOM killer should kick in before tasks cannibalize each
>other like that.
>
>The patch you were using did in fact have the side effect of widening
>the window between tasks entering heavy reclaim and the OOM killer
>kicking in, so it could explain the IO worsening while fixing the dead
>lock problem.
>
>That followup patch tries to narrow this window by quite a bit and
>tries to stop concurrent reclaim when the group is already OOM.
>

Johannes,

it's, unfortunately, happening several times per day and we cannot work like this :( i will boot previous kernel this night. If you have any patches which can help me or you, please send them so i can install them with this reboot. Thank you.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-04 09:45:23 UTC
Permalink
>Hello azur,
>
>On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >>Hi azur,
>> >>
>> >>here is the x86-only rollup of the series for 3.2.
>> >>
>> >>Thanks!
>> >>Johannes
>> >>---
>> >
>> >
>> >Johannes,
>> >
>> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>
>Did the OOM killer go off in this group?
>
>Was there a warning in the syslog ("Fixing unhandled memcg OOM
>context")?
>
>If it happens again, could you check if there are tasks left in the
>cgroup? And provide /proc/<pid>/stack of the hung task trying to
>delete the cgroup?
>
>> Now i can definitely confirm that problem is NOT fixed :( it happened again but i don't have any data because i already disabled all debug output.
>
>Which debug output?
>
>Do you still have access to the syslog?
>
>It's possible that, as your system does not deadlock on the OOMing
>cgroup anymore, you hit a separate bug...
>
>Thanks!



My script has just detected (and killed) another freezed cgroup. I must say that i'm not 100% sure that cgroup was really freezed but it has 99% or more memory usage for at least 30 seconds (well, or it has 99% memory usage in both two cases the script was checking it). Here are stacks of processes inside it before they were killed:



pid: 26490
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26503
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26517
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26518
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26519
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26520
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26521
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26522
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26523
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26524
stack:
[<ffffffff81052671>] sys_sched_yield+0x41/0x70
[<ffffffff81148d91>] free_more_memory+0x21/0x60
[<ffffffff8114941d>] __getblk+0x14d/0x2c0
[<ffffffff8119888b>] ext3_getblk+0xeb/0x240
[<ffffffff811989f9>] ext3_bread+0x19/0x90
[<ffffffff8119cea3>] ext3_dx_find_entry+0x83/0x1e0
[<ffffffff8119d2e4>] ext3_find_entry+0x2e4/0x480
[<ffffffff8119dbcd>] ext3_lookup+0x4d/0x120
[<ffffffff811228f5>] d_alloc_and_lookup+0x45/0x90
[<ffffffff81125578>] __lookup_hash+0xa8/0xf0
[<ffffffff81127852>] do_last+0x312/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26526
stack:
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26531
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26533
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26536
stack:
[<ffffffff81080a45>] refrigerator+0x95/0x160
[<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
[<ffffffff8100188b>] do_signal+0x6b/0x750
[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
[<ffffffff815cb662>] retint_signal+0x3d/0x7b
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26539
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-04 11:57:41 UTC
Permalink
On Wed 04-09-13 11:45:23, azurIt wrote:
[...]
> My script has just detected (and killed) another freezed cgroup. I
> must say that i'm not 100% sure that cgroup was really freezed but it
> has 99% or more memory usage for at least 30 seconds (well, or it has
> 99% memory usage in both two cases the script was checking it). Here
> are stacks of processes inside it before they were killed:
[...]
> pid: 26536
> stack:
> [<ffffffff81080a45>] refrigerator+0x95/0x160
> [<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
> [<ffffffff8100188b>] do_signal+0x6b/0x750
> [<ffffffff81001fc5>] do_notify_resume+0x55/0x80
> [<ffffffff815cb662>] retint_signal+0x3d/0x7b
> [<ffffffffffffffff>] 0xffffffffffffffff

[...]

This task is sitting in the refigerator which means it has been frozen
by the freezer cgroup most probably. I am not familiar with the
implementation but my recollection is that you have to thaw that group
in order the killed process can pass away.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-04 12:10:00 UTC
Permalink
>[...]
>> My script has just detected (and killed) another freezed cgroup. I
>> must say that i'm not 100% sure that cgroup was really freezed but it
>> has 99% or more memory usage for at least 30 seconds (well, or it has
>> 99% memory usage in both two cases the script was checking it). Here
>> are stacks of processes inside it before they were killed:
>[...]
>> pid: 26536
>> stack:
>> [<ffffffff81080a45>] refrigerator+0x95/0x160
>> [<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
>> [<ffffffff8100188b>] do_signal+0x6b/0x750
>> [<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>> [<ffffffff815cb662>] retint_signal+0x3d/0x7b
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
>[...]
>
>This task is sitting in the refigerator which means it has been frozen
>by the freezer cgroup most probably. I am not familiar with the
>implementation but my recollection is that you have to thaw that group
>in order the killed process can pass away.
>--
>Michal Hocko
>SUSE Labs
>



Yes, my script is freezing the cgroup before killing processes inside it. Stacks are taken after the freeze, it that problem?

azur
Michal Hocko
2013-09-04 12:26:32 UTC
Permalink
On Wed 04-09-13 14:10:00, azurIt wrote:
> >[...]
> >> My script has just detected (and killed) another freezed cgroup. I
> >> must say that i'm not 100% sure that cgroup was really freezed but it
> >> has 99% or more memory usage for at least 30 seconds (well, or it has
> >> 99% memory usage in both two cases the script was checking it). Here
> >> are stacks of processes inside it before they were killed:
> >[...]
> >> pid: 26536
> >> stack:
> >> [<ffffffff81080a45>] refrigerator+0x95/0x160
> >> [<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
> >> [<ffffffff8100188b>] do_signal+0x6b/0x750
> >> [<ffffffff81001fc5>] do_notify_resume+0x55/0x80
> >> [<ffffffff815cb662>] retint_signal+0x3d/0x7b
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >[...]
> >
> >This task is sitting in the refigerator which means it has been frozen
> >by the freezer cgroup most probably. I am not familiar with the
> >implementation but my recollection is that you have to thaw that group
> >in order the killed process can pass away.
>
> Yes, my script is freezing the cgroup before killing processes inside
> it. Stacks are taken after the freeze, it that problem?

I thought you had a problem to remove this particular group...
--
Michal Hocko
SUSE Labs
azurIt
2013-09-04 12:39:25 UTC
Permalink
>> >[...]
>> >> My script has just detected (and killed) another freezed cgroup. I
>> >> must say that i'm not 100% sure that cgroup was really freezed but it
>> >> has 99% or more memory usage for at least 30 seconds (well, or it has
>> >> 99% memory usage in both two cases the script was checking it). Here
>> >> are stacks of processes inside it before they were killed:
>> >[...]
>> >> pid: 26536
>> >> stack:
>> >> [<ffffffff81080a45>] refrigerator+0x95/0x160
>> >> [<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
>> >> [<ffffffff8100188b>] do_signal+0x6b/0x750
>> >> [<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>> >> [<ffffffff815cb662>] retint_signal+0x3d/0x7b
>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> >[...]
>> >
>> >This task is sitting in the refigerator which means it has been frozen
>> >by the freezer cgroup most probably. I am not familiar with the
>> >implementation but my recollection is that you have to thaw that group
>> >in order the killed process can pass away.
>>
>> Yes, my script is freezing the cgroup before killing processes inside
>> it. Stacks are taken after the freeze, it that problem?
>
>I thought you had a problem to remove this particular group...



No, this one is different from the unremovable one. This was, probably, hanged just like when i was originaly reporting this problem (but, as i said, i'm not 100% sure because of reasons i described). Sorry for confusion.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-05 09:14:30 UTC
Permalink
>> >[...]
>> >> My script has just detected (and killed) another freezed cgroup. I
>> >> must say that i'm not 100% sure that cgroup was really freezed but it
>> >> has 99% or more memory usage for at least 30 seconds (well, or it has
>> >> 99% memory usage in both two cases the script was checking it). Here
>> >> are stacks of processes inside it before they were killed:
>> >[...]
>> >> pid: 26536
>> >> stack:
>> >> [<ffffffff81080a45>] refrigerator+0x95/0x160
>> >> [<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
>> >> [<ffffffff8100188b>] do_signal+0x6b/0x750
>> >> [<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>> >> [<ffffffff815cb662>] retint_signal+0x3d/0x7b
>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> >[...]
>> >
>> >This task is sitting in the refigerator which means it has been frozen
>> >by the freezer cgroup most probably. I am not familiar with the
>> >implementation but my recollection is that you have to thaw that group
>> >in order the killed process can pass away.
>>
>> Yes, my script is freezing the cgroup before killing processes inside
>> it. Stacks are taken after the freeze, it that problem?
>
>I thought you had a problem to remove this particular group...
>--
>Michal Hocko
>SUSE Labs




My script detected another freezed cgroup today, sending stacks. Is there anything interesting?



pid: 947
stack:
[<ffffffff810ceefe>] sleep_on_page_killable+0xe/0x40
[<ffffffff810cee57>] __lock_page_killable+0x67/0x70
[<ffffffff810d1067>] generic_file_aio_read+0x4d7/0x790
[<ffffffff81116a8a>] do_sync_read+0xea/0x130
[<ffffffff81117a40>] vfs_read+0xf0/0x220
[<ffffffff81117c71>] sys_read+0x51/0x90
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 949
stack:
[<ffffffff810ceefe>] sleep_on_page_killable+0xe/0x40
[<ffffffff810cee57>] __lock_page_killable+0x67/0x70
[<ffffffff810d1067>] generic_file_aio_read+0x4d7/0x790
[<ffffffff81116a8a>] do_sync_read+0xea/0x130
[<ffffffff81117a40>] vfs_read+0xf0/0x220
[<ffffffff81117c71>] sys_read+0x51/0x90
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 985
stack:
[<ffffffff810ceefe>] sleep_on_page_killable+0xe/0x40
n[<ffffffff810cee57>] __lock_page_killable+0x67/0x70
[<ffffffff810d1067>] generic_file_aio_read+0x4d7/0x790
[<ffffffff81116a8a>] do_sync_read+0xea/0x130
[<ffffffff81117a40>] vfs_read+0xf0/0x220
[<ffffffff81117c71>] sys_read+0x51/0x90
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 987
stack:
[<ffffffff810ceefe>] sleep_on_page_killable+0xe/0x40
[<ffffffff810cee57>] __lock_page_killable+0x67/0x70
[<ffffffff810d1067>] generic_file_aio_read+0x4d7/0x790
[<ffffffff81116a8a>] do_sync_read+0xea/0x130
[<ffffffff81117a40>] vfs_read+0xf0/0x220
[<ffffffff81117c71>] sys_read+0x51/0x90
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 1031
stack:
[<ffffffff8110f255>] mem_cgroup_oom_synchronize+0x165/0x190
[<ffffffff810d269e>] pagefault_out_of_memory+0xe/0x120
[<ffffffff81026f5e>] mm_fault_error+0x9e/0x150
[<ffffffff81027414>] do_page_fault+0x404/0x490
[<ffffffff815cb7bf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 1032
stack:
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 1036
stack:
[<ffffffff8110f255>] mem_cgroup_oom_synchronize+0x165/0x190
[<ffffffff810d269e>] pagefault_out_of_memory+0xe/0x120
[<ffffffff81026f5e>] mm_fault_error+0x9e/0x150
[<ffffffff81027414>] do_page_fault+0x404/0x490
[<ffffffff815cb7bf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 1038
stack:
[<ffffffff8110f255>] mem_cgroup_oom_synchronize+0x165/0x190
[<ffffffff810d269e>] pagefault_out_of_memory+0xe/0x120
[<ffffffff81026f5e>] mm_fault_error+0x9e/0x150
[<ffffffff81027414>] do_page_fault+0x404/0x490
[<ffffffff815cb7bf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-05 09:53:31 UTC
Permalink
On Thu 05-09-13 11:14:30, azurIt wrote:
[...]
> My script detected another freezed cgroup today, sending stacks. Is
> there anything interesting?

3 tasks are sleeping and waiting for somebody to take an action to
resolve memcg OOM. The memcg oom killer is enabled for that group? If
yes, which task has been selected to be killed? You can find that in oom
report in dmesg.

I can see a way how this might happen. If the killed task happened to
allocate a memory while it is exiting then it would get to the oom
condition again without freeing any memory so nobody waiting on the
memcg_oom_waitq gets woken. We have a report like that:
https://lkml.org/lkml/2013/7/31/94

The issue got silent in the meantime so it is time to wake it up.
It would be definitely good to see what happened in your case though.
If any of the bellow tasks was the oom victim then it is very probable
this is the same issue.

> pid: 1031
[...]
> stack:
> [<ffffffff8110f255>] mem_cgroup_oom_synchronize+0x165/0x190
> [<ffffffff810d269e>] pagefault_out_of_memory+0xe/0x120
> [<ffffffff81026f5e>] mm_fault_error+0x9e/0x150
> [<ffffffff81027414>] do_page_fault+0x404/0x490
> [<ffffffff815cb7bf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
[...]
> pid: 1036
> stack:
> [<ffffffff8110f255>] mem_cgroup_oom_synchronize+0x165/0x190
> [<ffffffff810d269e>] pagefault_out_of_memory+0xe/0x120
> [<ffffffff81026f5e>] mm_fault_error+0x9e/0x150
> [<ffffffff81027414>] do_page_fault+0x404/0x490
> [<ffffffff815cb7bf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> pid: 1038
> stack:
> [<ffffffff8110f255>] mem_cgroup_oom_synchronize+0x165/0x190
> [<ffffffff810d269e>] pagefault_out_of_memory+0xe/0x120
> [<ffffffff81026f5e>] mm_fault_error+0x9e/0x150
> [<ffffffff81027414>] do_page_fault+0x404/0x490
> [<ffffffff815cb7bf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-05 10:17:00 UTC
Permalink
>[...]
>> My script detected another freezed cgroup today, sending stacks. Is
>> there anything interesting?
>
>3 tasks are sleeping and waiting for somebody to take an action to
>resolve memcg OOM. The memcg oom killer is enabled for that group? If
>yes, which task has been selected to be killed? You can find that in oom
>report in dmesg.
>
>I can see a way how this might happen. If the killed task happened to
>allocate a memory while it is exiting then it would get to the oom
>condition again without freeing any memory so nobody waiting on the
>memcg_oom_waitq gets woken. We have a report like that:
>https://lkml.org/lkml/2013/7/31/94
>
>The issue got silent in the meantime so it is time to wake it up.
>It would be definitely good to see what happened in your case though.
>If any of the bellow tasks was the oom victim then it is very probable
>this is the same issue.

Here it is:
http://watchdog.sk/lkml/kern5.log

Processes were killed by my script at about 11:05:35.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-05 11:17:42 UTC
Permalink
On Thu 05-09-13 12:17:00, azurIt wrote:
> >[...]
> >> My script detected another freezed cgroup today, sending stacks. Is
> >> there anything interesting?
> >
> >3 tasks are sleeping and waiting for somebody to take an action to
> >resolve memcg OOM. The memcg oom killer is enabled for that group? If
> >yes, which task has been selected to be killed? You can find that in oom
> >report in dmesg.
> >
> >I can see a way how this might happen. If the killed task happened to
> >allocate a memory while it is exiting then it would get to the oom
> >condition again without freeing any memory so nobody waiting on the
> >memcg_oom_waitq gets woken. We have a report like that:
> >https://lkml.org/lkml/2013/7/31/94
> >
> >The issue got silent in the meantime so it is time to wake it up.
> >It would be definitely good to see what happened in your case though.
> >If any of the bellow tasks was the oom victim then it is very probable
> >this is the same issue.
>
> Here it is:
> http://watchdog.sk/lkml/kern5.log

$ grep "Killed process \<103[168]\>" kern5.log
$

So none of the sleeping tasks has been killed previously.

> Processes were killed by my script

OK, I am really confused now. The log contains a lot of in-kernel memcg
oom killer messages:
$ grep "Memory cgroup out of memory:" kern5.log | wc -l
809

This suggests that the oom killer is not disabled. What exactly has you
script done?

> at about 11:05:35.

There is an oom killer striking at 11:05:35:
Sep 5 11:05:35 server02 kernel: [1751856.433101] Task in /1066/uid killed as a result of limit of /1066
[...]
Sep 5 11:05:35 server02 kernel: [1751856.539356] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Sep 5 11:05:35 server02 kernel: [1751856.539745] [ 1046] 1066 1046 228537 95491 3 0 0 apache2
Sep 5 11:05:35 server02 kernel: [1751856.539894] [ 1047] 1066 1047 228604 95488 6 0 0 apache2
Sep 5 11:05:35 server02 kernel: [1751856.540043] [ 1050] 1066 1050 228470 95452 5 0 0 apache2
Sep 5 11:05:35 server02 kernel: [1751856.540191] [ 1051] 1066 1051 228592 95521 6 0 0 apache2
Sep 5 11:05:35 server02 kernel: [1751856.540340] [ 1052] 1066 1052 228594 95546 5 0 0 apache2
Sep 5 11:05:35 server02 kernel: [1751856.540489] [ 1054] 1066 1054 228470 95453 5 0 0 apache2
Sep 5 11:05:35 server02 kernel: [1751856.540646] Memory cgroup out of memory: Kill process 1046 (apache2) score 1000 or sacrifice child

And this doesn't list any of the tasks sleeping and waiting for oom
resolving so they must have been created after this OOM. Is this the
same group?
--
Michal Hocko
SUSE Labs
azurIt
2013-09-05 11:47:02 UTC
Permalink
>On Thu 05-09-13 12:17:00, azurIt wrote:
>> >[...]
>> >> My script detected another freezed cgroup today, sending stacks. Is
>> >> there anything interesting?
>> >
>> >3 tasks are sleeping and waiting for somebody to take an action to
>> >resolve memcg OOM. The memcg oom killer is enabled for that group? If
>> >yes, which task has been selected to be killed? You can find that in oom
>> >report in dmesg.
>> >
>> >I can see a way how this might happen. If the killed task happened to
>> >allocate a memory while it is exiting then it would get to the oom
>> >condition again without freeing any memory so nobody waiting on the
>> >memcg_oom_waitq gets woken. We have a report like that:
>> >https://lkml.org/lkml/2013/7/31/94
>> >
>> >The issue got silent in the meantime so it is time to wake it up.
>> >It would be definitely good to see what happened in your case though.
>> >If any of the bellow tasks was the oom victim then it is very probable
>> >this is the same issue.
>>
>> Here it is:
>> http://watchdog.sk/lkml/kern5.log
>
>$ grep "Killed process \<103[168]\>" kern5.log
>$
>
>So none of the sleeping tasks has been killed previously.
>
>> Processes were killed by my script
>
>OK, I am really confused now. The log contains a lot of in-kernel memcg
>oom killer messages:
>$ grep "Memory cgroup out of memory:" kern5.log | wc -l
>809
>
>This suggests that the oom killer is not disabled. What exactly has you
>script done?
>
>> at about 11:05:35.
>
>There is an oom killer striking at 11:05:35:
>Sep 5 11:05:35 server02 kernel: [1751856.433101] Task in /1066/uid killed as a result of limit of /1066
>[...]
>Sep 5 11:05:35 server02 kernel: [1751856.539356] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
>Sep 5 11:05:35 server02 kernel: [1751856.539745] [ 1046] 1066 1046 228537 95491 3 0 0 apache2
>Sep 5 11:05:35 server02 kernel: [1751856.539894] [ 1047] 1066 1047 228604 95488 6 0 0 apache2
>Sep 5 11:05:35 server02 kernel: [1751856.540043] [ 1050] 1066 1050 228470 95452 5 0 0 apache2
>Sep 5 11:05:35 server02 kernel: [1751856.540191] [ 1051] 1066 1051 228592 95521 6 0 0 apache2
>Sep 5 11:05:35 server02 kernel: [1751856.540340] [ 1052] 1066 1052 228594 95546 5 0 0 apache2
>Sep 5 11:05:35 server02 kernel: [1751856.540489] [ 1054] 1066 1054 228470 95453 5 0 0 apache2
>Sep 5 11:05:35 server02 kernel: [1751856.540646] Memory cgroup out of memory: Kill process 1046 (apache2) score 1000 or sacrifice child
>
>And this doesn't list any of the tasks sleeping and waiting for oom
>resolving so they must have been created after this OOM. Is this the
>same group?




cgroup was 1066. My script is doing this:
1.) It checks memory usage of all cgroups and is searching for those whos memory usage is >= 99% of their limit.
2.) If any are found, they are saved in an array of 'candidates for killing'.
3.) It sleep for 30 seconds.
4.) Do (1) and if any of found cgorups were also found in (2), it kills all processes inside it.
5.) Clear array of saved cgroups and continue.
...

In other words, if any cgroup has memory usage >= 99% of it's limit for more than 30 seconds, it is considered as 'freezed' and all it's processes are killed. This script is tested and was really able to resolve my original problem automatically without need of restarting the server or doing any outage of services. But, of course, i cannot guarantee that the killed cgroup was really freezed (because of bug in linux kernel), there could be some false positives - for example, cgroup has 99% usage of memory, my script detected it, OOM successfully resolved the problem and, after 30 seconds, the same cgroup has again 99% usage of it's memory and my script detected it again. This is why i'm sending stacks here, i simply cannot tell if there was or wasn't a problem. I can disable the script and wait until the problem really occurs but when it happens, our services will go down. Hope i was clear enough - if not, i can post the source code of that script.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2013-09-05 12:03:47 UTC
Permalink
On Thu 05-09-13 13:47:02, azurIt wrote:
> >On Thu 05-09-13 12:17:00, azurIt wrote:
> >> >[...]
> >> >> My script detected another freezed cgroup today, sending stacks. Is
> >> >> there anything interesting?
> >> >
> >> >3 tasks are sleeping and waiting for somebody to take an action to
> >> >resolve memcg OOM. The memcg oom killer is enabled for that group? If
> >> >yes, which task has been selected to be killed? You can find that in oom
> >> >report in dmesg.
> >> >
> >> >I can see a way how this might happen. If the killed task happened to
> >> >allocate a memory while it is exiting then it would get to the oom
> >> >condition again without freeing any memory so nobody waiting on the
> >> >memcg_oom_waitq gets woken. We have a report like that:
> >> >https://lkml.org/lkml/2013/7/31/94
> >> >
> >> >The issue got silent in the meantime so it is time to wake it up.
> >> >It would be definitely good to see what happened in your case though.
> >> >If any of the bellow tasks was the oom victim then it is very probable
> >> >this is the same issue.
> >>
> >> Here it is:
> >> http://watchdog.sk/lkml/kern5.log
> >
> >$ grep "Killed process \<103[168]\>" kern5.log
> >$
> >
> >So none of the sleeping tasks has been killed previously.
> >
> >> Processes were killed by my script
> >
> >OK, I am really confused now. The log contains a lot of in-kernel memcg
> >oom killer messages:
> >$ grep "Memory cgroup out of memory:" kern5.log | wc -l
> >809
> >
> >This suggests that the oom killer is not disabled. What exactly has you
> >script done?
> >
> >> at about 11:05:35.
> >
> >There is an oom killer striking at 11:05:35:
> >Sep 5 11:05:35 server02 kernel: [1751856.433101] Task in /1066/uid killed as a result of limit of /1066
> >[...]
> >Sep 5 11:05:35 server02 kernel: [1751856.539356] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> >Sep 5 11:05:35 server02 kernel: [1751856.539745] [ 1046] 1066 1046 228537 95491 3 0 0 apache2
> >Sep 5 11:05:35 server02 kernel: [1751856.539894] [ 1047] 1066 1047 228604 95488 6 0 0 apache2
> >Sep 5 11:05:35 server02 kernel: [1751856.540043] [ 1050] 1066 1050 228470 95452 5 0 0 apache2
> >Sep 5 11:05:35 server02 kernel: [1751856.540191] [ 1051] 1066 1051 228592 95521 6 0 0 apache2
> >Sep 5 11:05:35 server02 kernel: [1751856.540340] [ 1052] 1066 1052 228594 95546 5 0 0 apache2
> >Sep 5 11:05:35 server02 kernel: [1751856.540489] [ 1054] 1066 1054 228470 95453 5 0 0 apache2
> >Sep 5 11:05:35 server02 kernel: [1751856.540646] Memory cgroup out of memory: Kill process 1046 (apache2) score 1000 or sacrifice child
> >
> >And this doesn't list any of the tasks sleeping and waiting for oom
> >resolving so they must have been created after this OOM. Is this the
> >same group?
>
> cgroup was 1066. My script is doing this:
> 1.) It checks memory usage of all cgroups and is searching for those whos memory usage is >= 99% of their limit.
> 2.) If any are found, they are saved in an array of 'candidates for killing'.
> 3.) It sleep for 30 seconds.
> 4.) Do (1) and if any of found cgorups were also found in (2), it kills all processes inside it.
> 5.) Clear array of saved cgroups and continue.

This is racy and doesn't really tell you anything about any group being
frozen.

[...]
> But, of course, i cannot guarantee that the killed cgroup was really
> freezed (because of bug in linux kernel), there could be some false
> positives - for example, cgroup has 99% usage of memory, my script
> detected it, OOM successfully resolved the problem and, after 30
> seconds, the same cgroup has again 99% usage of it's memory and my
> script detected it again.

Exactly

> This is why i'm sending stacks here, i simply cannot tell if
> there was or wasn't a problem.

On the other hand if those processes would be stuck waiting for somebody
to resolve the OOM for a long time without any change then yes we have a
problem.

Just to be sure I got you right. You have killed all the processes from
the group you have sent stacks for, right? If that is the case I am
really curious about processes sitting in sleep_on_page_killable because
those are killable by definition.

> I can disable the script and wait until the problem really occurs but
> when it happens, our services will go down.

I definitely do not want to encourage you to let your services down...

> Hope i was clear enough - if not, i can post the source code of that
> script.

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-05 12:33:43 UTC
Permalink
>On Thu 05-09-13 13:47:02, azurIt wrote:
>> >On Thu 05-09-13 12:17:00, azurIt wrote:
>> >> >[...]
>> >> >> My script detected another freezed cgroup today, sending stacks. Is
>> >> >> there anything interesting?
>> >> >
>> >> >3 tasks are sleeping and waiting for somebody to take an action to
>> >> >resolve memcg OOM. The memcg oom killer is enabled for that group? If
>> >> >yes, which task has been selected to be killed? You can find that in oom
>> >> >report in dmesg.
>> >> >
>> >> >I can see a way how this might happen. If the killed task happened to
>> >> >allocate a memory while it is exiting then it would get to the oom
>> >> >condition again without freeing any memory so nobody waiting on the
>> >> >memcg_oom_waitq gets woken. We have a report like that:
>> >> >https://lkml.org/lkml/2013/7/31/94
>> >> >
>> >> >The issue got silent in the meantime so it is time to wake it up.
>> >> >It would be definitely good to see what happened in your case though.
>> >> >If any of the bellow tasks was the oom victim then it is very probable
>> >> >this is the same issue.
>> >>
>> >> Here it is:
>> >> http://watchdog.sk/lkml/kern5.log
>> >
>> >$ grep "Killed process \<103[168]\>" kern5.log
>> >$
>> >
>> >So none of the sleeping tasks has been killed previously.
>> >
>> >> Processes were killed by my script
>> >
>> >OK, I am really confused now. The log contains a lot of in-kernel memcg
>> >oom killer messages:
>> >$ grep "Memory cgroup out of memory:" kern5.log | wc -l
>> >809
>> >
>> >This suggests that the oom killer is not disabled. What exactly has you
>> >script done?
>> >
>> >> at about 11:05:35.
>> >
>> >There is an oom killer striking at 11:05:35:
>> >Sep 5 11:05:35 server02 kernel: [1751856.433101] Task in /1066/uid killed as a result of limit of /1066
>> >[...]
>> >Sep 5 11:05:35 server02 kernel: [1751856.539356] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
>> >Sep 5 11:05:35 server02 kernel: [1751856.539745] [ 1046] 1066 1046 228537 95491 3 0 0 apache2
>> >Sep 5 11:05:35 server02 kernel: [1751856.539894] [ 1047] 1066 1047 228604 95488 6 0 0 apache2
>> >Sep 5 11:05:35 server02 kernel: [1751856.540043] [ 1050] 1066 1050 228470 95452 5 0 0 apache2
>> >Sep 5 11:05:35 server02 kernel: [1751856.540191] [ 1051] 1066 1051 228592 95521 6 0 0 apache2
>> >Sep 5 11:05:35 server02 kernel: [1751856.540340] [ 1052] 1066 1052 228594 95546 5 0 0 apache2
>> >Sep 5 11:05:35 server02 kernel: [1751856.540489] [ 1054] 1066 1054 228470 95453 5 0 0 apache2
>> >Sep 5 11:05:35 server02 kernel: [1751856.540646] Memory cgroup out of memory: Kill process 1046 (apache2) score 1000 or sacrifice child
>> >
>> >And this doesn't list any of the tasks sleeping and waiting for oom
>> >resolving so they must have been created after this OOM. Is this the
>> >same group?
>>
>> cgroup was 1066. My script is doing this:
>> 1.) It checks memory usage of all cgroups and is searching for those whos memory usage is >= 99% of their limit.
>> 2.) If any are found, they are saved in an array of 'candidates for killing'.
>> 3.) It sleep for 30 seconds.
>> 4.) Do (1) and if any of found cgorups were also found in (2), it kills all processes inside it.
>> 5.) Clear array of saved cgroups and continue.
>
>This is racy and doesn't really tell you anything about any group being
>frozen.
>
>[...]
>> But, of course, i cannot guarantee that the killed cgroup was really
>> freezed (because of bug in linux kernel), there could be some false
>> positives - for example, cgroup has 99% usage of memory, my script
>> detected it, OOM successfully resolved the problem and, after 30
>> seconds, the same cgroup has again 99% usage of it's memory and my
>> script detected it again.
>
>Exactly
>
>> This is why i'm sending stacks here, i simply cannot tell if
>> there was or wasn't a problem.
>
>On the other hand if those processes would be stuck waiting for somebody
>to resolve the OOM for a long time without any change then yes we have a
>problem.
>
>Just to be sure I got you right. You have killed all the processes from
>the group you have sent stacks for, right? If that is the case I am
>really curious about processes sitting in sleep_on_page_killable because
>those are killable by definition.


Yes, my script killed all of that processes right after taking stack. Here is part of the code (python):
http://pastebin.com/WryGKxyF

Function get_tasks() is reading pids from 'tasks' file of a cgroup and returning them in list (array).


azur
Michal Hocko
2013-09-05 12:45:23 UTC
Permalink
On Thu 05-09-13 14:33:43, azurIt wrote:
[...]
> >Just to be sure I got you right. You have killed all the processes from
> >the group you have sent stacks for, right? If that is the case I am
> >really curious about processes sitting in sleep_on_page_killable because
> >those are killable by definition.
>
> Yes, my script killed all of that processes right after taking
> stack.

OK, _after_ part is important. Has the group gone away after then?
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
azurIt
2013-09-05 13:00:44 UTC
Permalink
>On Thu 05-09-13 14:33:43, azurIt wrote:
>[...]
>> >Just to be sure I got you right. You have killed all the processes from
>> >the group you have sent stacks for, right? If that is the case I am
>> >really curious about processes sitting in sleep_on_page_killable because
>> >those are killable by definition.
>>
>> Yes, my script killed all of that processes right after taking
>> stack.
>
>OK, _after_ part is important. Has the group gone away after then?



If you mean if it wasn't making problems after killing it's processes, then yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Loading...