[PATCH 00 of 11] mmu notifier #v16

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115127 -7200
# Node ID c5badbefeee07518d9d1acca13e94c981420317c
# Parent e20917dcc8284b6a07cfcced13dda4cbca850a9c
get_task_mm

get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter <***@sgi.com>
Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas
if (task->flags & PF_BORROWED_MM)
mm = NULL;
else
- atomic_inc(&mm->mm_users);
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ mm = NULL;
}
task_unlock(task);
return mm;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-07 16:00:33 UTC

You can drop this patch.

This turned out to be a race in xpmem. It "appeared" as if it were a
race in get_task_mm, but it really is not. The current->mm field is
cleared under the task_lock and the task_lock is grabbed by get_task_mm.

I have been testing you v15 version without this patch and not
encountere the problem again (now that I fixed my xpmem race).

Thanks,
Robin

Post by Andrea Arcangeli
# HG changeset patch
# Date 1210115127 -7200
# Node ID c5badbefeee07518d9d1acca13e94c981420317c
# Parent e20917dcc8284b6a07cfcced13dda4cbca850a9c
get_task_mm

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 16:20:43 UTC

Post by Robin Holt
You can drop this patch.
This turned out to be a race in xpmem. It "appeared" as if it were a
race in get_task_mm, but it really is not. The current->mm field is
cleared under the task_lock and the task_lock is grabbed by get_task_mm.

100% agreed, I'll nuke it as it seems really a noop.

Post by Robin Holt
I have been testing you v15 version without this patch and not
encountere the problem again (now that I fixed my xpmem race).

Great. About your other deadlock I'm curious if my deadlock fix for
the i_mmap_sem patch helped. That was crashing kvm with a VM 2G in the
swap + a swaphog allocating and freeing another 2G of swap in a
loop. I couldn't reproduce any other problem with KVM since I fixed
that bit regardless if I apply only mmu-notifier-core (2.6.26 version)
or the full patchset (post 2.6.26).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:39:18 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210096013 -7200
# Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
# Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to
map any spte or secondary tlb reference, while the VM is in the
middle of range_begin/end as any page returned by get_user_pages in
that critical section could later immediately be freed without any
further ->invalidate_page notification (invalidate_range_begin/end
works on ranges and ->invalidate_page isn't called immediately
before freeing the page). To stop all page freeing and pagetable
overwrites the mmap_sem must be taken in write mode and all other
anon_vma/i_mmap locks must be taken in virtual address order. The
order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
advantage of mmu notifiers, but this already allows to compile a
KVM external module against a kernel with mmu notifiers enabled and
from the next pull from kvm.git we'll start using them. And
GRU/XPMEM will also be able to continue the development by enabling
KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
the same way KVM does it (even if KVM=n). This guarantees nobody
selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_lock may not
allocate the required vmalloc space. See the comment on top of
mm_lock() implementation for the worst case memory requirements.
Because mmu_notifier_reigster is used when a driver startup, a failure
can be gracefully handled. Here an example of the change applied to
kvm to register the mmu notifiers. Usually when a driver startups
other allocations are required anyway and -ENOMEM failure paths exists
already.

struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;

if (!kvm)
return ERR_PTR(-ENOMEM);

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}

mmu_notifier_unregister returns void and it's reliable.

Signed-off-by: Andrea Arcangeli <***@qumranet.com>
Signed-off-by: Nick Piggin <***@suse.de>
Signed-off-by: Christoph Lameter <***@sgi.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
+ select MMU_NOTIFIER
select ANON_INODES
---help---
Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
* or hlist_del_rcu(), running on this same list.
* However, it is perfectly legal to run concurrently with
* the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
*/
static inline void hlist_del_rcu(struct hlist_node *n)
{
@@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
if (!hlist_unhashed(n)) {
__hlist_del(n);
INIT_HLIST_NODE(n);
+ }
+}
+
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list. However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+ if (!hlist_unhashed(n)) {
+ __hlist_del(n);
+ n->pprev = NULL;
}
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc
unsigned long addr, unsigned long len,
unsigned long flags, struct page **pages);

+struct mm_lock_data {
+ spinlock_t **i_mmap_locks;
+ spinlock_t **anon_vma_locks;
+ size_t nr_i_mmap_locks;
+ size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);

extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
+#include <linux/cpumask.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -19,6 +20,7 @@
#define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))

struct address_space;
+struct mmu_notifier_mm;

#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
typedef atomic_long_t mm_counter_t;
@@ -235,6 +237,9 @@ struct mm_struct {
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
+#ifdef CONFIG_MMU_NOTIFIER
+ struct mmu_notifier_mm *mmu_notifier_mm;
+#endif
};

#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,282 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+#include <linux/srcu.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_lock() protected critical section
+ * and it's released only when mm_count reaches zero in mmdrop().
+ */
+struct mmu_notifier_mm {
+ /* all mmu notifiers registerd in this mm are queued in this list */
+ struct hlist_head list;
+ /* srcu structure for this mm */
+ struct srcu_struct srcu;
+ /* to serialize the list modifications and hlist_unhashed */
+ spinlock_t lock;
+};
+
+struct mmu_notifier_ops {
+ /*
+ * Called either by mmu_notifier_unregister or when the mm is
+ * being destroyed by exit_mmap, always before all pages are
+ * freed. This can run concurrently with other mmu notifier
+ * methods (the ones invoked outside the mm context) and it
+ * should tear down all secondary mmu mappings and freeze the
+ * secondary mmu. If this method isn't implemented you've to
+ * be sure that nothing could possibly write to the pages
+ * through the secondary mmu by the time the last thread with
+ * tsk->mm == mm exits.
+ *
+ * As side note: the pages freed after ->release returns could
+ * be immediately reallocated by the gart at an alias physical
+ * address with a different cache model, so if ->release isn't
+ * implemented because all _software_ driven memory accesses
+ * through the secondary mmu are terminated by the time the
+ * last thread of this mm quits, you've also to be sure that
+ * speculative _hardware_ operations can't allocate dirty
+ * cachelines in the cpu that could not be snooped and made
+ * coherent with the other read and write operations happening
+ * through the gart alias address, so leading to memory
+ * corruption.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ /*
+ * clear_flush_young is called after the VM is
+ * test-and-clearing the young/accessed bitflag in the
+ * pte. This way the VM will provide proper aging to the
+ * accesses to the page through the secondary MMUs and not
+ * only to the ones through the Linux pte.
+ */
+ int (*clear_flush_young)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * Before this is invoked any secondary MMU is still ok to
+ * read/write to the page previously pointed to by the Linux
+ * pte because the page hasn't been freed yet and it won't be
+ * freed until this returns. If required set_page_dirty has to
+ * be called internally to this method.
+ */
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * invalidate_range_start() and invalidate_range_end() must be
+ * paired and are called only when the mmap_sem and/or the
+ * locks protecting the reverse maps are held. Both functions
+ * may sleep. The subsystem must guarantee that no additional
+ * references are taken to the pages in the range established
+ * between the call to invalidate_range_start() and the
+ * matching call to invalidate_range_end().
+ *
+ * Invalidation of multiple concurrent ranges may be
+ * optionally permitted by the driver. Either way the
+ * establishment of sptes is forbidden in the range passed to
+ * invalidate_range_begin/end for the whole duration of the
+ * invalidate_range_begin/end critical section.
+ *
+ * invalidate_range_start() is called when all pages in the
+ * range are still mapped and have at least a refcount of one.
+ *
+ * invalidate_range_end() is called when all pages in the
+ * range have been unmapped and the pages have been freed by
+ * the VM.
+ *
+ * The VM will remove the page table entries and potentially
+ * the page between invalidate_range_start() and
+ * invalidate_range_end(). If the page must not be freed
+ * because of pending I/O or other circumstances then the
+ * invalidate_range_start() callback (or the initial mapping
+ * by the driver) must make sure that the refcount is kept
+ * elevated.
+ *
+ * If the driver increases the refcount when the pages are
+ * initially mapped into an address space then either
+ * invalidate_range_start() or invalidate_range_end() may
+ * decrease the refcount. If the refcount is decreased on
+ * invalidate_range_start() then the VM can free pages as page
+ * table entries are removed. If the refcount is only
+ * droppped on invalidate_range_end() then the driver itself
+ * will drop the last refcount but it must take care to flush
+ * any secondary tlb before doing the final free on the
+ * page. Pages will no longer be referenced by the linux
+ * address space but may still be referenced by sptes until
+ * the last refcount is dropped.
+ */
+ void (*invalidate_range_start)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+ void (*invalidate_range_end)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+ return unlikely(mm->mmu_notifier_mm);
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern int __mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_clear_flush_young(mm, address);
+ return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+ mm->mmu_notifier_mm = NULL;
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_mm_destroy(mm);
+}
+
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
+#define ptep_clear_flush_notify(__vma, __address, __ptep) \
+({ \
+ pte_t __pte; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __pte = ptep_clear_flush(___vma, ___address, __ptep); \
+ mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
+ __pte; \
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
+({ \
+ int __young; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
+ __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
+ ___address); \
+ __young; \
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address)
+{
+ return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
#ifndef _LINUX_SRCU_H
#define _LINUX_SRCU_H

+#include <linux/mutex.h>
+
struct srcu_struct_array {
int c[2];
};
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
#include <linux/tty.h>
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_mm_init(mm);
return mm;
}

@@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm)
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
destroy_context(mm);
+ mmu_notifier_mm_destroy(mm);
free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -205,3 +205,6 @@ config VIRT_TO_BUS
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ bool
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
+ pteval = ptep_clear_flush_notify(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>

#include <asm/mmu_context.h>
#include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ mmu_notifier_invalidate_range_start(mm, start, start + size);
err = populate_range(mm, vma, start, size, pgoff);
+ mmu_notifier_invalidate_range_end(mm, start, start + size);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/mmu_notifier.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);

+ mmu_notifier_invalidate_range_start(mm, start, end);
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
@@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
+ int ret;

/*
* Don't copy ptes where a page fault will fill them correctly.
@@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);

+ /*
+ * We need to invalidate the secondary MMU mappings only when
+ * there could be a permission downgrade on the ptes of the
+ * parent mm. And a permission downgrade will only happen if
+ * is_cow_mapping() returns true.
+ */
+ if (is_cow_mapping(vma->vm_flags))
+ mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
+ ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
- if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
- vma, addr, next))
- return -ENOMEM;
+ if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ vma, addr, next))) {
+ ret = -ENOMEM;
+ break;
+ }
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
- return 0;
+
+ if (is_cow_mapping(vma->vm_flags))
+ mmu_notifier_invalidate_range_end(src_mm,
+ vma->vm_start, end);
+ return ret;
}

static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
int fullmm = (*tlbp)->fullmm;
+ struct mm_struct *mm = vma->vm_mm;

+ mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long end;

@@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
}
}
out:
+ mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start; /* which is now the end (or restart) address */
}

@@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + size;
+ unsigned long start = addr, end = addr + size;
int err;

BUG_ON(addr >= end);
+ mmu_notifier_invalidate_range_start(mm, start, end);
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
@@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1753,7 +1776,7 @@ gotten:
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush(vma, address, page_table);
+ ptep_clear_flush_notify(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,9 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <linux/mmu_notifier.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm)

/* mm's last user has gone, and its about to be pulled down */
arch_exit_mmap(mm);
+ mmu_notifier_release(mm);

lru_add_drain();
flush_cache_mm(mm);
@@ -2255,3 +2259,194 @@ int install_special_mapping(struct mm_st

return 0;
}
+
+static int mm_lock_cmp(const void *a, const void *b)
+{
+ unsigned long _a = (unsigned long)*(spinlock_t **)a;
+ unsigned long _b = (unsigned long)*(spinlock_t **)b;
+
+ cond_resched();
+ if (_a < _b)
+ return -1;
+ if (_a > _b)
+ return 1;
+ return 0;
+}
+
+static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+ int anon)
+{
+ struct vm_area_struct *vma;
+ size_t i = 0;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (anon) {
+ if (vma->anon_vma)
+ locks[i++] = &vma->anon_vma->lock;
+ } else {
+ if (vma->vm_file && vma->vm_file->f_mapping)
+ locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+ }
+ }
+
+ if (!i)
+ goto out;
+
+ sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+
+out:
+ return i;
+}
+
+static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
+ spinlock_t **locks)
+{
+ return mm_lock_sort(mm, locks, 1);
+}
+
+static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
+ spinlock_t **locks)
+{
+ return mm_lock_sort(mm, locks, 0);
+}
+
+static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+{
+ spinlock_t *last = NULL;
+ size_t i;
+
+ for (i = 0; i < nr; i++)
+ /* Multiple vmas may use the same lock. */
+ if (locks[i] != last) {
+ BUG_ON((unsigned long) last > (unsigned long) locks[i]);
+ last = locks[i];
+ if (lock)
+ spin_lock(last);
+ else
+ spin_unlock(last);
+ }
+}
+
+static inline void __mm_lock(spinlock_t **locks, size_t nr)
+{
+ mm_lock_unlock(locks, nr, 1);
+}
+
+static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+{
+ mm_lock_unlock(locks, nr, 0);
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults.
+ *
+ * The caller must take the mmap_sem in read or write mode before
+ * calling mm_lock(). The caller isn't allowed to release the mmap_sem
+ * until mm_unlock() returns.
+ *
+ * While mm_lock() itself won't strictly require the mmap_sem in write
+ * mode to be safe, in order to block all operations that could modify
+ * pagetables and free pages without need of altering the vma layout
+ * (for example populate_range() with nonlinear vmas) the mmap_sem
+ * must be taken in write mode by the caller.
+ *
+ * A single task can't take more than one mm_lock in a row or it would
+ * deadlock.
+ *
+ * The sorting is needed to avoid lock inversion deadlocks if two
+ * tasks run mm_lock at the same time on different mm that happen to
+ * share some anon_vmas/inodes but mapped in different order.
+ *
+ * mm_lock and mm_unlock are expensive operations that may have to
+ * take thousand of locks. Thanks to sort() the complexity is
+ * O(N*log(N)) where N is the number of VMAs in the mm. The max number
+ * of vmas is defined in /proc/sys/vm/max_map_count.
+ *
+ * mm_lock() can fail if memory allocation fails. The worst case
+ * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
+ * so around 1Mbyte, but in practice it'll be much less because
+ * normally there won't be max_map_count vmas allocated in the task
+ * that runs mm_lock().
+ *
+ * The vmalloc memory allocated by mm_lock is stored in the
+ * mm_lock_data structure that must be allocated by the caller and it
+ * must be later passed to mm_unlock that will free it after using it.
+ * Allocating the mm_lock_data structure on the stack is fine because
+ * it's only a couple of bytes in size.
+ *
+ * If mm_lock() returns -ENOMEM no memory has been allocated and the
+ * mm_lock_data structure can be freed immediately, and mm_unlock must
+ * not be called.
+ */
+int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+ spinlock_t **anon_vma_locks, **i_mmap_locks;
+
+ if (mm->map_count) {
+ anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+ if (unlikely(!anon_vma_locks))
+ return -ENOMEM;
+
+ i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+ if (unlikely(!i_mmap_locks)) {
+ vfree(anon_vma_locks);
+ return -ENOMEM;
+ }
+
+ /*
+ * When mm_lock_sort_anon_vma/i_mmap returns zero it
+ * means there's no lock to take and so we can free
+ * the array here without waiting mm_unlock. mm_unlock
+ * will do nothing if nr_i_mmap/anon_vma_locks is
+ * zero.
+ */
+ data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
+ data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+
+ if (data->nr_anon_vma_locks) {
+ __mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
+ data->anon_vma_locks = anon_vma_locks;
+ } else
+ vfree(anon_vma_locks);
+
+ if (data->nr_i_mmap_locks) {
+ __mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
+ data->i_mmap_locks = i_mmap_locks;
+ } else
+ vfree(i_mmap_locks);
+ }
+ return 0;
+}
+
+static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+{
+ __mm_unlock(locks, nr);
+ vfree(locks);
+}
+
+/*
+ * mm_unlock doesn't require any memory allocation and it won't fail.
+ *
+ * The mmap_sem cannot be released until mm_unlock returns.
+ *
+ * All memory has been previously allocated by mm_lock and it'll be
+ * all freed before returning. Only after mm_unlock returns, the
+ * caller is allowed to free and forget the mm_lock_data structure.
+ *
+ * mm_unlock runs in O(N) where N is the max number of VMAs in the
+ * mm. The max number of vmas is defined in
+ * /proc/sys/vm/max_map_count.
+ */
+void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+ if (mm->map_count) {
+ if (data->nr_anon_vma_locks)
+ mm_unlock_vfree(data->anon_vma_locks,
+ data->nr_anon_vma_locks);
+ if (data->nr_i_mmap_locks)
+ mm_unlock_vfree(data->i_mmap_locks,
+ data->nr_i_mmap_locks);
+ }
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,292 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <***@sgi.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/srcu.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+/*
+ * This function can't run concurrently against mmu_notifier_register
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds an mm_count pin
+ * itself.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ int srcu;
+
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
+ mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+ struct mmu_notifier,
+ hlist);
+ /*
+ * We arrived before mmu_notifier_unregister so
+ * mmu_notifier_unregister will do nothing other than
+ * to wait ->release to finish and
+ * mmu_notifier_unregister to return.
+ */
+ hlist_del_init_rcu(&mn->hlist);
+ /*
+ * SRCU here will block mmu_notifier_unregister until
+ * ->release returns.
+ */
+ srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ /*
+ * if ->release runs before mmu_notifier_unregister it
+ * must be handled as it's the only way for the driver
+ * to flush all existing sptes and stop the driver
+ * from establishing any more sptes before all the
+ * pages in the mm are freed.
+ */
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ }
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
+ /*
+ * synchronize_srcu here prevents mmu_notifier_release to
+ * return to exit_mmap (which would proceed freeing all pages
+ * in the mm) until the ->release method returns, if it was
+ * invoked by mmu_notifier_unregister.
+ *
+ * The mmu_notifier_mm can't go away from under us because one
+ * mm_count is hold by exit_mmap.
+ */
+ synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0, srcu;
+
+ srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->clear_flush_young)
+ young |= mn->ops->clear_flush_young(mn, mm, address);
+ }
+ srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+ return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int srcu;
+
+ srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->invalidate_page)
+ mn->ops->invalidate_page(mn, mm, address);
+ }
+ srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int srcu;
+
+ srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->invalidate_range_start)
+ mn->ops->invalidate_range_start(mn, mm, start, end);
+ }
+ srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int srcu;
+
+ srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->invalidate_range_end)
+ mn->ops->invalidate_range_end(mn, mm, start, end);
+ }
+ srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+static int do_mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ int take_mmap_sem)
+{
+ struct mm_lock_data data;
+ struct mmu_notifier_mm * mmu_notifier_mm;
+ int ret;
+
+ BUG_ON(atomic_read(&mm->mm_users) <= 0);
+
+ ret = -ENOMEM;
+ mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+ if (unlikely(!mmu_notifier_mm))
+ goto out;
+
+ ret = init_srcu_struct(&mmu_notifier_mm->srcu);
+ if (unlikely(ret))
+ goto out_kfree;
+
+ if (take_mmap_sem)
+ down_write(&mm->mmap_sem);
+ ret = mm_lock(mm, &data);
+ if (unlikely(ret))
+ goto out_cleanup;
+
+ if (!mm_has_notifiers(mm)) {
+ INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+ spin_lock_init(&mmu_notifier_mm->lock);
+ mm->mmu_notifier_mm = mmu_notifier_mm;
+ mmu_notifier_mm = NULL;
+ }
+ atomic_inc(&mm->mm_count);
+
+ /*
+ * Serialize the update against mmu_notifier_unregister. A
+ * side note: mmu_notifier_release can't run concurrently with
+ * us because we hold the mm_users pin (either implicitly as
+ * current->mm or explicitly with get_task_mm() or similar).
+ * We can't race against any other mmu notifiers either thanks
+ * to mm_lock().
+ */
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
+ mm_unlock(mm, &data);
+out_cleanup:
+ if (take_mmap_sem)
+ up_write(&mm->mmap_sem);
+ if (mmu_notifier_mm)
+ cleanup_srcu_struct(&mmu_notifier_mm->srcu);
+out_kfree:
+ /* kfree() does nothing if mmu_notifier_mm is NULL */
+ kfree(mmu_notifier_mm);
+out:
+ BUG_ON(atomic_read(&mm->mm_users) <= 0);
+ return ret;
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ return do_mmu_notifier_register(mn, mm, 1);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * Same as mmu_notifier_register but here the caller must hold the
+ * mmap_sem in write mode.
+ */
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ return do_mmu_notifier_register(mn, mm, 0);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+/* this is called after the last mmu_notifier_unregister() returned */
+void __mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+ BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
+ cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu);
+ kfree(mm->mmu_notifier_mm);
+ mm->mmu_notifier_mm = LIST_POISON1; /* debug */
+}
+
+/*
+ * This releases the mm_count pin automatically and frees the mm
+ * structure if it was the last user of it. It serializes against
+ * running mmu notifiers with SRCU and against mmu_notifier_unregister
+ * with the unregister lock + SRCU. All sptes must be dropped before
+ * calling mmu_notifier_unregister. ->release or any other notifier
+ * method may be invoked concurrently with mmu_notifier_unregister,
+ * and only after mmu_notifier_unregister returned we're guaranteed
+ * that ->release or any other method can't run anymore.
+ */
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ if (!hlist_unhashed(&mn->hlist)) {
+ int srcu;
+
+ hlist_del_rcu(&mn->hlist);
+
+ /*
+ * SRCU here will force exit_mmap to wait ->release to finish
+ * before freeing the pages.
+ */
+ srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ /*
+ * exit_mmap will block in mmu_notifier_release to
+ * guarantee ->release is called before freeing the
+ * pages.
+ */
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+ } else
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
+ /*
+ * Wait any running method to finish, of course including
+ * ->release if it was run by mmu_notifier_relase instead of us.
+ */
+ synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+
+ BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+ mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
dirty_accountable = 1;
}

+ mmu_notifier_invalidate_range_start(mm, start, end);
if (is_vm_hugetlb_page(vma))
hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+ mmu_notifier_invalidate_range_end(mm, start, end);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
+ unsigned long old_start;

+ old_start = old_addr;
+ mmu_notifier_invalidate_range_start(vma->vm_mm,
+ old_start, old_end);
if (vma->vm_file) {
/*
* Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
pte_unmap_unlock(old_pte - 1, old_ptl);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
}

#define LATENCY_LIMIT (64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
#include <linux/module.h>
#include <linux/kallsyms.h>
#include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>

#include <asm/tlbflush.h>

@@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
if (vma->vm_flags & VM_LOCKED) {
referenced++;
*mapcount = 1; /* break early from loop */
- } else if (ptep_clear_flush_young(vma, address, pte))
+ } else if (ptep_clear_flush_young_notify(vma, address, pte))
referenced++;

/* Pretend the page is referenced if the task has the
@@ -457,7 +458,7 @@ static int page_mkclean_one(struct page
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
- entry = ptep_clear_flush(vma, address, pte);
+ entry = ptep_clear_flush_notify(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page
* skipped over this mm) then we should reactivate it.
*/
if (!migration && ((vma->vm_flags & VM_LOCKED) ||
- (ptep_clear_flush_young(vma, address, pte)))) {
+ (ptep_clear_flush_young_notify(vma, address, pte)))) {
ret = SWAP_FAIL;
goto out_unmap;
}

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush(vma, address, pte);
+ pteval = ptep_clear_flush_notify(vma, address, pte);

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));

- if (ptep_clear_flush_young(vma, address, pte))
+ if (ptep_clear_flush_young_notify(vma, address, pte))
continue;

/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
+ pteval = ptep_clear_flush_notify(vma, address, pte);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rik van Riel

2008-05-07 17:52:55 UTC

On Wed, 07 May 2008 16:35:51 +0200
Acked-by: Rik van Riel <***@redhat.com>

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

2008-05-07 20:04:46 UTC

On Wed, 07 May 2008 16:35:51 +0200

Post by Andrea Arcangeli
# HG changeset patch
# Date 1210096013 -7200
# Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
# Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core
...
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
* or hlist_del_rcu(), running on this same list.
* However, it is perfectly legal to run concurrently with
* the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
*/
static inline void hlist_del_rcu(struct hlist_node *n)
{
@@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
if (!hlist_unhashed(n)) {
__hlist_del(n);
INIT_HLIST_NODE(n);
+ }
+}
+
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ *
+ * Note: list_unhashed() on entry does return true after this. It is

Should that be "does" or "does not". "does", I suppose.

It should refer to hlist_unhashed()

The term "on entry" is a bit ambiguous - we normally use that as shorthand
to mean "on entry to the function".

So I'll change this to

Post by Andrea Arcangeli
+ * Note: hlist_unhashed() on the node returns true after this. It is

OK?

<oh, that was copied-and-pasted from similarly errant comments in that file>

Post by Andrea Arcangeli
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
+#include <linux/cpumask.h>

OK, unrelated bugfix ;)

Post by Andrea Arcangeli
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
#ifndef _LINUX_SRCU_H
#define _LINUX_SRCU_H
+#include <linux/mutex.h>

And another. Fair enough.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

2008-05-07 20:13:06 UTC

On Wed, 07 May 2008 16:35:51 +0200

Post by Andrea Arcangeli
# HG changeset patch
# Date 1210096013 -7200
# Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
# Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

The patch looks OK to me.

The proposal is that we sneak this into 2.6.26. Are there any
sufficiently-serious objections to this?

The patch will be a no-op for 2.6.26.

This is all rather unusual. For the record, could we please review the
reasons for wanting to do this?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 20:32:12 UTC

Post by Andrew Morton
The patch looks OK to me.

As far as I can tell, authorship has been destroyed by at least two of the
patches (ie Christoph seems to be the author, but Andrea seems to have
dropped that fact).

Post by Andrew Morton
The proposal is that we sneak this into 2.6.26. Are there any
sufficiently-serious objections to this?

Yeah, too late and no upside.

That "locking" code is also too ugly to live, at least without some
serious arguments for why it has to be done that way. Sorting the locks?
In a vmalloc'ed area? And calling this something innocuous like
"mm_lock()"? Hell no.

That code needs some serious re-thinking.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 21:59:38 UTC

Post by Andrew Morton
The patch looks OK to me.

As far as I can tell, authorship has been destroyed by at least two of the
patches (ie Christoph seems to be the author, but Andrea seems to have
dropped that fact).

I can't follow this, please be more specific.

About the patches I merged from Christoph, I didn't touch them at all
(except for fixing a kernel crashing bug in them plus some reject
fix). Initially I didn't even add a signed-off-by: andrea, and I only
had the signed-off-by: christoph. But then he said I had to add my
signed-off-by too, while I thought at most an acked-by was
required. So if I got any attribution on Christoph work it's only
because he explicitly requested it as it was passing through my
maintenance line. In any case, all patches except mmu-notifier-core
are irrelevant in this context and I'm entirely fine to give Christoph
the whole attribution of the whole patchset including the whole
mmu-notifier-core where most of the code is mine.

We had many discussions with Christoph, Robin and Jack, but I can
assure you nobody had a single problem with regard to attribution.

About all patches except mmu-notifier-core: Christoph, Robin and
everyone (especially myself) agrees those patches can't yet be merged
in 2.6.26.

With regard to the post-2.6.26 material, I think adding a config
option to make the change at compile time, is ok. And there's no other
way to deal with it in a clean way, as vmtrunate has to teardown
pagetables, and if the i_mmap_lock is a spinlock there's no way to
notify secondary mmus about it, if the ->invalidate_range_start method
has to allocate an skb, send it through the network and wait for I/O
completion with schedule().

Post by Linus Torvalds
Yeah, too late and no upside.

No upside to all people setting CONFIG_KVM=n true, but no downside
for them either, that's the important fact!

And for all the people setting CONFIG_KVM!=n, I should provide some
background here. KVM MM development is halted without this, that
includes: paging, ballooning, tlb flushing at large, pci-passthrough
removing page pin as a whole, etc...

Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d
patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle
things without mmu notifiers (mlock whatever).

Rusty agreed we had to get mmu notifiers in 2.6.26 so much that he has
gone as far as writing his own ultrasimple mmu notifier
implementation, unfortunately too simple as invalidate_range_start was
missing and we can't remove the page pinning and avoid doing
spte=invalid;tlbflush;unpin for every group of sptes released without
it. And without mm_lock invalidate_range_start can't be implemented in
a generic way (to work for GRU/XPMEM too).

Post by Linus Torvalds
That "locking" code is also too ugly to live, at least without some
serious arguments for why it has to be done that way. Sorting the locks?
In a vmalloc'ed area? And calling this something innocuous like
"mm_lock()"? Hell no.

That's only invoked in mmu_notifier_register, mm_lock is explicitly
documented as heavyweight function. In the KVM case it's only called
when a VM is created, that's irrelevant cpu cost compared to the time
it takes to the OS to boot in the VM... (especially without real mode
emulation with direct NPT-like secondary-mmu paging).

mm_lock solved the fundamental race in the range_start/end
invalidation model (that will allow GRU to do a single tlb flush for
the whole range that is going to be freed by
zap_page_range/unmap_vmas/whatever). Christoph merged mm_lock in his
EMM versions of mmu notifiers, moments after I released it, I think he
wouldn't have done it if there was a better way.

Post by Linus Torvalds
That code needs some serious re-thinking.

Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu
notifiers, my original mmu notifiers, Christoph's first version of my
mmu notifiers, with my new mmu notifiers, with christoph EMM version
of my new mmu notifiers, with my latest mmu notifiers, and all people
making suggestions and testing the code and needing the code badly,
and further patches waiting inclusion during 2.6.27 in this area, it
must be obvious for everyone, that there's zero chance this code won't
evolve over time to perfection, but we can't wait it to be perfect
before start using it or we're screwed. Even if it's entirely broken
this will allow kvm development to continue and then we'll fix it (but
don't worry it works great at runtime and there are no race
conditions, Jack and Robin are also using it with zero problems with
GRU and XPMEM just in case the KVM testing going great isn't enough).

Furthermore the API is freezed for almost months, everyone agrees with
all fundamental blocks in mmu-notifier-core patch (to be complete
Christoph would like to replace invalidate_page with an
invalidate_range_start/end but that's a minor detail).

And most important we need something in now, regardless of which
API. We can handle a change of API totally fine later.

mm_lock() is not even part of the mmu notifier API, it's just an
internal implementation detail, so whatever problem it has, or
whatever better name we can find, isn't an high priority right now.

If you suggest a better name now I'll fix it up immediately. I hope
the mm_lock name and whatever signed-off-by error in patches after
mmu-notifier-core won't be really why this doesn't go in.

Thanks a lot for your time to review even if it wasn't as positive as
I hoped,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 22:18:32 UTC

Post by Linus Torvalds
As far as I can tell, authorship has been destroyed by at least two of the
patches (ie Christoph seems to be the author, but Andrea seems to have
dropped that fact).

I can't follow this, please be more specific.

The patches were sent to lkml without *any* indication that you weren't
actually the author.

So if Andrew had merged them, they would have been merged as yours.

That's only invoked in mmu_notifier_register, mm_lock is explicitly
documented as heavyweight function.

Is that an excuse for UTTER AND TOTAL CRAP?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 22:29:14 UTC

Post by Linus Torvalds
As far as I can tell, authorship has been destroyed by at least two of the
patches (ie Christoph seems to be the author, but Andrea seems to have
dropped that fact).

I can't follow this, please be more specific.

The patches were sent to lkml without *any* indication that you weren't
actually the author.
So if Andrew had merged them, they would have been merged as yours.

I rechecked and I guarantee that the patches where Christoph isn't
listed are developed by myself and he didn't write a single line on
them. In any case I expect Christoph to review (he's CCed) and to
point me to any attribution error. The only mistake I did once in that
area was to give too _few_ attribution to myself and he asked me to
add myself in the signed-off so I added myself by Christoph own
request, but be sure I didn't remove him!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Roland Dreier

2008-05-07 22:32:03 UTC

Post by Andrea Arcangeli
I rechecked and I guarantee that the patches where Christoph isn't
listed are developed by myself and he didn't write a single line on
them. In any case I expect Christoph to review (he's CCed) and to
point me to any attribution error. The only mistake I did once in that
area was to give too _few_ attribution to myself and he asked me to
add myself in the signed-off so I added myself by Christoph own
request, but be sure I didn't remove him!

I think the point you're missing is that any patches written by
Christoph need a line like

From: Christoph Lameter <***@sgi.com>

at the top of the body so that Christoph becomes the author when it is
committed into git. The Signed-off-by: line needs to be preserved too
of course, but it is not sufficient by itself.

- R.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 22:39:59 UTC

Post by Roland Dreier
I think the point you're missing is that any patches written by
Christoph need a line like
at the top of the body so that Christoph becomes the author when it is
committed into git. The Signed-off-by: line needs to be preserved too
of course, but it is not sufficient by itself.

Ok so I see the problem Linus is referring to now (I received the hint
by PM too), I thought the order of the signed-off-by was relevant, it
clearly isn't or we're wasting space ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 23:05:44 UTC

Post by Andrea Arcangeli
Ok so I see the problem Linus is referring to now (I received the hint
by PM too), I thought the order of the signed-off-by was relevant, it
clearly isn't or we're wasting space ;)

The order of the signed-offs are somewhat relevant, but no, sign-offs
don't mean authorship.

See the rules for sign-off: you can sign off on another persons patches,
even if they didn't sign off on them themselves. That's clause (b) in
particular.

So yes, quite often you'd _expect_ the first sign-off to match the author,
but that's a correlation, not a causal relationship.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 22:38:13 UTC

By PM (guess he's scared to post to this thread ;) Chris is telling
me, what you mean perhaps is I should add a From: Christoph in the
body of the email if the first signed-off-by is from Christoph, to
indicate the first signoff was by him and the patch in turn was
started by him. I thought the order of the signoffs was enough, but if
that From was mandatory and missing, if there's any error it obviously
wasn't intentional especially given I only left a signed-off-by:
christoph on his patches until he asked me to add my signoff
too. Correcting it is trivial given I carefully ordered the signoff so
that the author is at the top of the signoff list.

At least for mmu-notifier-core given I obviously am the original
author of that code, I hope the From: of the email was enough even if
an additional From: andrea was missing in the body.

Also you can be sure that Christoph and especially Robin (XPMEM) will
be more than happy if all patches with Christoph at the top of the
signed-off-by will be merged in 2.6.26 despite there wasn't From:
christoph at the top of the body ;). So I don't see a big deal here...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 23:42:19 UTC

Post by Andrea Arcangeli
At least for mmu-notifier-core given I obviously am the original
author of that code, I hope the From: of the email was enough even if
an additional From: andrea was missing in the body.

Ok, this whole series of patches have just been such a disaster that I'm
(a) disgusted that _anybody_ sent an Acked-by: for any of it, and (b) that
I'm still looking at it at all, but I am.

And quite frankly, the more I look, and the more answers from you I get,
the less I like it. And I didn't like it that much to start with, as you
may have noticed.

You say that "At least for mmu-notifier-core given I obviously am the
original author of that code", but that is not at all obvious either. One
of the reasons I stated that authorship seems to have been thrown away is
very much exactly in that first mmu-notifier-core patch:

+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <***@sgi.com>

so I would very strongly dispute that it's "obvious" that you are the
original author of the code there.

So there was a reason why I said that I thought authorship had been lost
somewhere along the way.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 23:08:03 UTC

Post by Andrea Arcangeli
I rechecked and I guarantee that the patches where Christoph isn't
listed are developed by myself and he didn't write a single line on
them.

How long have you been doing kernel development?

How about you read SubmittingPatches a few times before you show just how
clueless you are?

Hint: look for the string that says "From:".

Also look at the section that talks about "summary phrase". You got it all
wrong, and you don't even seem to realize that you got it wrong, even when
I told you.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:39:45 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115129 -7200
# Node ID d60d200565abde6a8ed45271e53cde9c5c75b426
# Parent c5badbefeee07518d9d1acca13e94c981420317c
invalidate_page outside PT lock

Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -210,35 +210,6 @@ static inline void mmu_notifier_mm_destr
__mmu_notifier_mm_destroy(mm);
}

-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is impleemnted as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep) \
-({ \
- pte_t __pte; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __pte = ptep_clear_flush(___vma, ___address, __ptep); \
- mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
- __pte; \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
- __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
- ___address); \
- __young; \
-})
-
#else /* CONFIG_MMU_NOTIFIER */

static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -274,9 +245,6 @@ static inline void mmu_notifier_mm_destr
{
}

-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
#endif /* CONFIG_MMU_NOTIFIER */

#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush_notify(vma, address, pte);
+ pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
+ /* must invalidate_page _before_ freeing the page */
+ mmu_notifier_invalidate_page(mm, address);
page_cache_release(page);
}
}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct *
*/
page_table = pte_offset_map_lock(mm, pmd, address,
&ptl);
- page_cache_release(old_page);
+ new_page = NULL;
if (!pte_same(*page_table, orig_pte))
goto unlock;
+ page_cache_release(old_page);

page_mkwrite = 1;
}
@@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct *
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+ old_page = new_page = NULL;
goto unlock;
}

@@ -1776,7 +1778,7 @@ gotten:
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
+ ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1788,12 +1790,18 @@ gotten:
} else
mem_cgroup_uncharge_page(new_page);

- if (new_page)
+unlock:
+ pte_unmap_unlock(page_table, ptl);
+
+ if (new_page) {
+ if (new_page == old_page)
+ /* cow happened, notify before releasing old_page */
+ mmu_notifier_invalidate_page(mm, address);
page_cache_release(new_page);
+ }
if (old_page)
page_cache_release(old_page);
-unlock:
- pte_unmap_unlock(page_table, ptl);
+
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -275,7 +275,7 @@ static int page_referenced_one(struct pa
unsigned long address;
pte_t *pte;
spinlock_t *ptl;
- int referenced = 0;
+ int referenced = 0, clear_flush_young = 0;

address = vma_address(page, vma);
if (address == -EFAULT)
@@ -288,8 +288,11 @@ static int page_referenced_one(struct pa
if (vma->vm_flags & VM_LOCKED) {
referenced++;
*mapcount = 1; /* break early from loop */
- } else if (ptep_clear_flush_young_notify(vma, address, pte))
- referenced++;
+ } else {
+ clear_flush_young = 1;
+ if (ptep_clear_flush_young(vma, address, pte))
+ referenced++;
+ }

/* Pretend the page is referenced if the task has the
swap token and is in the middle of a page fault. */
@@ -299,6 +302,10 @@ static int page_referenced_one(struct pa

(*mapcount)--;
pte_unmap_unlock(pte, ptl);
+
+ if (clear_flush_young)
+ referenced += mmu_notifier_clear_flush_young(mm, address);
+
out:
return referenced;
}
@@ -458,7 +465,7 @@ static int page_mkclean_one(struct page
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
- entry = ptep_clear_flush_notify(vma, address, pte);
+ entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -466,6 +473,10 @@ static int page_mkclean_one(struct page
}

pte_unmap_unlock(pte, ptl);
+
+ if (ret)
+ mmu_notifier_invalidate_page(mm, address);
+
out:
return ret;
}
@@ -717,15 +728,14 @@ static int try_to_unmap_one(struct page
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
*/
- if (!migration && ((vma->vm_flags & VM_LOCKED) ||
- (ptep_clear_flush_young_notify(vma, address, pte)))) {
+ if (!migration && (vma->vm_flags & VM_LOCKED)) {
ret = SWAP_FAIL;
goto out_unmap;
}

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush_notify(vma, address, pte);
+ pteval = ptep_clear_flush(vma, address, pte);

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -780,6 +790,8 @@ static int try_to_unmap_one(struct page

out_unmap:
pte_unmap_unlock(pte, ptl);
+ if (ret != SWAP_FAIL)
+ mmu_notifier_invalidate_page(mm, address);
out:
return ret;
}
@@ -818,7 +830,7 @@ static void try_to_unmap_cluster(unsigne
spinlock_t *ptl;
struct page *page;
unsigned long address;
- unsigned long end;
+ unsigned long start, end;

address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
@@ -839,6 +851,8 @@ static void try_to_unmap_cluster(unsigne
if (!pmd_present(*pmd))
return;

+ start = address;
+ mmu_notifier_invalidate_range_start(mm, start, end);
pte = pte_offset_map_lock(mm, pmd, address, &ptl);

/* Update high watermark before we lower rss */
@@ -850,12 +864,12 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));

- if (ptep_clear_flush_young_notify(vma, address, pte))
+ if (ptep_clear_flush_young(vma, address, pte))
continue;

/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush_notify(vma, address, pte);
+ pteval = ptep_clear_flush(vma, address, pte);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
@@ -871,6 +885,7 @@ static void try_to_unmap_cluster(unsigne
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
+ mmu_notifier_invalidate_range_end(mm, start, end);
}

static int try_to_unmap_anon(struct page *page, int migration)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rik van Riel

2008-05-07 17:42:04 UTC

On Wed, 07 May 2008 16:35:53 +0200

Post by Andrea Arcangeli
# HG changeset patch
# Date 1210115129 -7200
# Node ID d60d200565abde6a8ed45271e53cde9c5c75b426
# Parent c5badbefeee07518d9d1acca13e94c981420317c
invalidate_page outside PT lock
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

This patch appears to undo some of the changes made by patch 01/11.

Would it be an idea to merge them into one, so the first patch
introduces the right conventions directly?

Andrea Arcangeli

2008-05-07 17:57:41 UTC

Post by Rik van Riel
Would it be an idea to merge them into one, so the first patch
introduces the right conventions directly?

The only reason this isn't merged into one, is that this requires
non obvious (not difficult though) to the core VM code. I wanted to
keep an obviously safe approach for 2.6.26. The other conventions are
only needed by XPMEM and XPMEM can't work without all other patches anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:40:05 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115130 -7200
# Node ID 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
# Parent d60d200565abde6a8ed45271e53cde9c5c75b426
free-pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <***@sgi.com>
Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
void *private);
void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
- unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+ unsigned long ceiling);
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
} while (pgd++, addr = next, addr != end);
}

-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
- unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+ unsigned long ceiling)
{
+ struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t
unlink_file_vma(vma);

if (is_vm_hugetlb_page(vma)) {
- hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+ tlb = tlb_gather_mmu(vma->vm_mm, 0);
+ hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
} else {
/*
@@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
- free_pgd_range(tlb, addr, vma->vm_end,
+ tlb = tlb_gather_mmu(vma->vm_mm, 0);
+ free_pgd_range(&tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
}
+ tlb_finish_mmu(tlb, addr, vma->vm_end);
vma = next;
}
}
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
- free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+ tlb_finish_mmu(tlb, start, end);
+ free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
- tlb_finish_mmu(tlb, start, end);
}

/*
@@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
- free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ free_pgtables(vma, FIRST_USER_ADDRESS, 0);

/*
* Walk the list again, actually closing and freeing it,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rik van Riel

2008-05-07 17:42:56 UTC

On Wed, 07 May 2008 16:35:54 +0200
Acked-by: Rik van Riel <***@redhat.com>

Andrea Arcangeli

2008-05-07 14:40:43 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115131 -7200
# Node ID 20bc6a66a86ef6bd60919cc77ff51d4af741b057
# Parent 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
unmap vmas tlb flushing

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <***@sgi.com>
Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar

unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
- struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st

/**
* unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
* @vma: the starting vma
* @start_addr: virtual address at which to start unmapping
* @end_addr: virtual address at which to end unmapping
@@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st
* Unmap all pages in the vma list.
*
* We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
*
* Only addresses between `start' and `end' will be unmapped.
*
* The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns. So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
*/
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
- struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
{
@@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
- int fullmm = (*tlbp)->fullmm;
+ int fullmm;
+ struct mmu_gather *tlb;
struct mm_struct *mm = vma->vm_mm;

+ lru_add_drain();
+ tlb = tlb_gather_mmu(mm, 0);
+ update_hiwater_rss(mm);
+ fullmm = tlb->fullmm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long end;
@@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
- start = unmap_page_range(*tlbp, vma,
+ start = unmap_page_range(tlb, vma,
start, end, &zap_work, details);

if (zap_work > 0) {
@@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}

- tlb_finish_mmu(*tlbp, tlb_start, start);
+ tlb_finish_mmu(tlb, tlb_start, start);

if (need_resched() ||
(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
- *tlbp = NULL;
+ tlb = NULL;
goto out;
}
cond_resched();
}

- *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+ tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+ tlb_finish_mmu(tlb, start_addr, end_addr);
out:
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start; /* which is now the end (or restart) address */
@@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
{
- struct mm_struct *mm = vma->vm_mm;
- struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;

- lru_add_drain();
- tlb = tlb_gather_mmu(mm, 0);
- update_hiwater_rss(mm);
- end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
- if (tlb)
- tlb_finish_mmu(tlb, address, end);
- return end;
+ return unmap_vmas(vma, address, end, &nr_accounted, details);
}

/*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1751,15 +1751,10 @@ static void unmap_region(struct mm_struc
unsigned long start, unsigned long end)
{
struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
- struct mmu_gather *tlb;
unsigned long nr_accounted = 0;

- lru_add_drain();
- tlb = tlb_gather_mmu(mm, 0);
- update_hiwater_rss(mm);
- unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+ unmap_vmas(vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
- tlb_finish_mmu(tlb, start, end);
free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
}
@@ -2044,7 +2039,6 @@ EXPORT_SYMBOL(do_brk);
/* Release all mmaps. */
void exit_mmap(struct mm_struct *mm)
{
- struct mmu_gather *tlb;
struct vm_area_struct *vma = mm->mmap;
unsigned long nr_accounted = 0;
unsigned long end;
@@ -2055,12 +2049,11 @@ void exit_mmap(struct mm_struct *mm)

lru_add_drain();
flush_cache_mm(mm);
- tlb = tlb_gather_mmu(mm, 1);
+
/* Don't update_hiwater_rss(mm) here, do_exit already did */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+ end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
- tlb_finish_mmu(tlb, 0, end);
free_pgtables(vma, FIRST_USER_ADDRESS, 0);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rik van Riel

2008-05-07 17:56:52 UTC

On Wed, 07 May 2008 16:35:55 +0200
Acked-by: Rik van Riel <***@redhat.com>

Andrea Arcangeli

2008-05-07 14:40:58 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115132 -7200
# Node ID 0621238970155f8ff2d60ca4996dcdd470f9c6ce
# Parent 20bc6a66a86ef6bd60919cc77ff51d4af741b057
rwsem contended

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <***@sgi.com>
Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
*/
extern void downgrade_write(struct rw_semaphore *sem);

+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
* nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
#endif
}

+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+ return rwsem_is_contended(sem);
+#else
+ return 0;
+#endif
+}
+
/*
* Reevaluate whether the task has signals pending delivery.
* Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
spin_unlock_irqrestore(&sem->wait_lock, flags);
}

+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ /*
+ * Racy check for an empty list. False positives or negatives
+ * would be okay. False positive may cause a useless dropping of
+ * locks. False negatives may cause locks to be held a bit
+ * longer until the next check.
+ */
+ return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
EXPORT_SYMBOL(__init_rwsem);
EXPORT_SYMBOL(__down_read);
EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
return sem;
}

+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ /*
+ * Racy check for an empty list. False positives or negatives
+ * would be okay. False positive may cause a useless dropping of
+ * locks. False negatives may cause locks to be held a bit
+ * longer until the next check.
+ */
+ return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
EXPORT_SYMBOL(rwsem_down_read_failed);
EXPORT_SYMBOL(rwsem_down_write_failed);
EXPORT_SYMBOL(rwsem_wake);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:41:32 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115135 -7200
# Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d
# Parent 0621238970155f8ff2d60ca4996dcdd470f9c6ce
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <***@qumranet.com>
Signed-off-by: Christoph Lameter <***@sgi.com>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.

-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
if (!vma_shareable(vma, addr))
return;

- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
out:
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
}

/*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
pgoff = offset >> PAGE_SHIFT;

i_size_write(inode, offset);
- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
if (!prio_tree_empty(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
}
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(&inode->i_devices);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
rwlock_init(&inode->i_data.tree_lock);
- spin_lock_init(&inode->i_data.i_mmap_lock);
+ init_rwsem(&inode->i_data.i_mmap_sem);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
- spinlock_t i_mmap_lock; /* protect tree, count, list */
+ struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned int truncate_count; /* Cover race condition with truncate */
unsigned long nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
- spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
+ struct rw_semaphore *i_mmap_sem; /* For unmap_mapping_range: */
unsigned long truncate_count; /* Compare vm_truncate_count */
};

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
atomic_dec(&inode->i_writecount);

/* insert tmp into the share list, just after mpnt */
- spin_lock(&file->f_mapping->i_mmap_lock);
+ down_write(&file->f_mapping->i_mmap_sem);
tmp->vm_truncate_count = mpnt->vm_truncate_count;
flush_dcache_mmap_lock(file->f_mapping);
vma_prio_tree_add(tmp, mpnt);
flush_dcache_mmap_unlock(file->f_mapping);
- spin_unlock(&file->f_mapping->i_mmap_lock);
+ up_write(&file->f_mapping->i_mmap_sem);
}

/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
/*
* Lock ordering:
*
- * ->i_mmap_lock (vmtruncate)
+ * ->i_mmap_sem (vmtruncate)
* ->private_lock (__free_pte->__set_page_dirty_buffers)
* ->swap_lock (exclusive_swap_page, others)
* ->mapping->tree_lock
*
* ->i_mutex
- * ->i_mmap_lock (truncate->unmap_mapping_range)
+ * ->i_mmap_sem (truncate->unmap_mapping_range)
*
* ->mmap_sem
- * ->i_mmap_lock
+ * ->i_mmap_sem
* ->page_table_lock or pte_lock (various, mainly in memory.c)
* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
*
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
* ->sb_lock (fs/fs-writeback.c)
* ->mapping->tree_lock (__sync_single_inode)
*
- * ->i_mmap_lock
+ * ->i_mmap_sem
* ->anon_vma.lock (vma_adjust)
*
* ->anon_vma.lock
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp
if (!page)
return;

- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
mm = vma->vm_mm;
address = vma->vm_start +
@@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp
page_cache_release(page);
}
}
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
}

/*
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns
}
goto out;
}
- spin_lock(&mapping->i_mmap_lock);
+ down_write(&mapping->i_mmap_sem);
flush_dcache_mmap_lock(mapping);
vma->vm_flags |= VM_NONLINEAR;
vma_prio_tree_remove(vma, &mapping->i_mmap);
vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
flush_dcache_mmap_unlock(mapping);
- spin_unlock(&mapping->i_mmap_lock);
+ up_write(&mapping->i_mmap_sem);
}

mmu_notifier_invalidate_range_start(mm, start, start + size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar
struct page *page;
struct page *tmp;
/*
- * A page gathering list, protected by per file i_mmap_lock. The
+ * A page gathering list, protected by per file i_mmap_sem. The
* lock is used to avoid list corruption from multiple unmapping
* of the same page since we are using page->lru.
*/
@@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area
* do nothing in this case.
*/
if (vma->vm_file) {
- spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+ down_write(&vma->vm_file->f_mapping->i_mmap_sem);
__unmap_hugepage_range(vma, start, end);
- spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+ up_write(&vma->vm_file->f_mapping->i_mmap_sem);
}
}

@@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm
BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+ down_write(&vma->vm_file->f_mapping->i_mmap_sem);
spin_lock(&mm->page_table_lock);
for (; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
@@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm
}
}
spin_unlock(&mm->page_table_lock);
- spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+ up_write(&vma->vm_file->f_mapping->i_mmap_sem);

flush_tlb_range(vma, start, end);
}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
int tlb_start_valid = 0;
unsigned long start = start_addr;
- spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+ struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
int fullmm;
struct mmu_gather *tlb;
struct mm_struct *mm = vma->vm_mm;
@@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_
tlb_finish_mmu(tlb, tlb_start, start);

if (need_resched() ||
- (i_mmap_lock && spin_needbreak(i_mmap_lock))) {
- if (i_mmap_lock) {
+ (i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+ if (i_mmap_sem) {
tlb = NULL;
goto out;
}
@@ -1829,7 +1829,7 @@ unwritable_page:
/*
* Helper functions for unmap_mapping_range().
*
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
*
* We have to restart searching the prio_tree whenever we drop the lock,
* since the iterator is only valid while the lock is held, and anyway
@@ -1848,7 +1848,7 @@ unwritable_page:
* can't efficiently keep all vmas in step with mapping->truncate_count:
* so instead reset them all whenever it wraps back to 0 (then go to 1).
* mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
*
* In order to make forward progress despite repeatedly restarting some
* large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1898,7 +1898,7 @@ again:

restart_addr = zap_page_range(vma, start_addr,
end_addr - start_addr, details);
- need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+ need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);

if (restart_addr >= end_addr) {
/* We have now completed this vma: mark it so */
@@ -1912,9 +1912,9 @@ again:
goto again;
}

- spin_unlock(details->i_mmap_lock);
+ up_write(details->i_mmap_sem);
cond_resched();
- spin_lock(details->i_mmap_lock);
+ down_write(details->i_mmap_sem);
return -EINTR;
}

@@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_
details.last_index = hba + hlen - 1;
if (details.last_index < details.first_index)
details.last_index = ULONG_MAX;
- details.i_mmap_lock = &mapping->i_mmap_lock;
+ details.i_mmap_sem = &mapping->i_mmap_sem;

- spin_lock(&mapping->i_mmap_lock);
+ down_write(&mapping->i_mmap_sem);

/* Protect against endless unmapping loops */
mapping->truncate_count++;
@@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_
unmap_mapping_range_tree(&mapping->i_mmap, &details);
if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
- spin_unlock(&mapping->i_mmap_lock);
+ up_write(&mapping->i_mmap_sem);
}
EXPORT_SYMBOL(unmap_mapping_range);

diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
if (!mapping)
return;

- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
remove_migration_pte(vma, old, new);

- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
}

/*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,7 @@ error:
}

/*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
*/
static void __remove_shared_vm_struct(struct vm_area_struct *vma,
struct file *file, struct address_space *mapping)
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru

if (file) {
struct address_space *mapping = file->f_mapping;
- spin_lock(&mapping->i_mmap_lock);
+ down_write(&mapping->i_mmap_sem);
__remove_shared_vm_struct(vma, file, mapping);
- spin_unlock(&mapping->i_mmap_lock);
+ up_write(&mapping->i_mmap_sem);
}
}

@@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
mapping = vma->vm_file->f_mapping;

if (mapping) {
- spin_lock(&mapping->i_mmap_lock);
+ down_write(&mapping->i_mmap_sem);
vma->vm_truncate_count = mapping->truncate_count;
}
anon_vma_lock(vma);
@@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m

anon_vma_unlock(vma);
if (mapping)
- spin_unlock(&mapping->i_mmap_lock);
+ up_write(&mapping->i_mmap_sem);

mm->map_count++;
validate_mm(mm);
@@ -542,7 +542,7 @@ again: remove_next = 1 + (end > next->
mapping = file->f_mapping;
if (!(vma->vm_flags & VM_NONLINEAR))
root = &mapping->i_mmap;
- spin_lock(&mapping->i_mmap_lock);
+ down_write(&mapping->i_mmap_sem);
if (importer &&
vma->vm_truncate_count != next->vm_truncate_count) {
/*
@@ -626,7 +626,7 @@ again: remove_next = 1 + (end > next->
if (anon_vma)
spin_unlock(&anon_vma->lock);
if (mapping)
- spin_unlock(&mapping->i_mmap_lock);
+ up_write(&mapping->i_mmap_sem);

if (remove_next) {
if (file) {
@@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm)

/* Insert vm structure into process list sorted by address
* and into the inode's i_mmap tree. If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
*/
int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
{
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str
* and we propagate stale pages into the dst afterward.
*/
mapping = vma->vm_file->f_mapping;
- spin_lock(&mapping->i_mmap_lock);
+ down_write(&mapping->i_mmap_sem);
if (new_vma->vm_truncate_count &&
new_vma->vm_truncate_count != vma->vm_truncate_count)
new_vma->vm_truncate_count = 0;
@@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str
pte_unmap_nested(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
if (mapping)
- spin_unlock(&mapping->i_mmap_lock);
+ up_write(&mapping->i_mmap_sem);
mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
}

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
* inode->i_alloc_sem (vmtruncate_range)
* mm->mmap_sem
* page->flags PG_locked (lock_page)
- * mapping->i_mmap_lock
+ * mapping->i_mmap_sem
* anon_vma->lock
* mm->page_table_lock or pte_lock
* zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -373,14 +373,14 @@ static int page_referenced_file(struct p
* The page lock not only makes sure that page->mapping cannot
* suddenly be NULLified by truncation, it makes sure that the
* structure at mapping cannot be freed and reused yet,
- * so we can safely take mapping->i_mmap_lock.
+ * so we can safely take mapping->i_mmap_sem.
*/
BUG_ON(!PageLocked(page));

- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);

/*
- * i_mmap_lock does not stabilize mapcount at all, but mapcount
+ * i_mmap_sem does not stabilize mapcount at all, but mapcount
* is more likely to be accurate if we note it after spinning.
*/
mapcount = page_mapcount(page);
@@ -403,7 +403,7 @@ static int page_referenced_file(struct p
break;
}

- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
return referenced;
}

@@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr

BUG_ON(PageAnon(page));

- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
if (vma->vm_flags & VM_SHARED)
ret += page_mkclean_one(page, vma);
}
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
return ret;
}

@@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page
unsigned long max_nl_size = 0;
unsigned int mapcount;

- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
ret = try_to_unmap_one(page, vma, migration);
if (ret == SWAP_FAIL || !page_mapped(page))
@@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page
mapcount = page_mapcount(page);
if (!mapcount)
goto out;
- cond_resched_lock(&mapping->i_mmap_lock);

max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
if (max_nl_cursor == 0)
@@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page
}
vma->vm_private_data = (void *) max_nl_cursor;
}
- cond_resched_lock(&mapping->i_mmap_lock);
max_nl_cursor += CLUSTER_SIZE;
} while (max_nl_cursor <= max_nl_size);

@@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page
list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
vma->vm_private_data = NULL;
out:
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
return ret;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:41:56 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115136 -7200
# Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2
# Parent 58f716ad4d067afb6bdd1b5f7042e19d854aae0d
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
to the anon_vma has to be established or removed. Overhead also increases
when a speculative reference is used (try_to_unmap,
page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
letting waiting tasks run first. This results in f.e. the Aim9 brk
performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <***@sgi.com>
Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
* pointing to this anon_vma once its vma list is empty.
*/
struct anon_vma {
- spinlock_t lock; /* Serialize access to vma list */
+ atomic_t refcount; /* vmas on the list */
+ struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
};

@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct
kmem_cache_free(anon_vma_cachep, anon_vma);
}

+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+ atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+ if (atomic_dec_and_test(&anon_vma->refcount))
+ anon_vma_free(anon_vma);
+}
+
static inline void anon_vma_lock(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
if (anon_vma)
- spin_lock(&anon_vma->lock);
+ down_write(&anon_vma->sem);
}

static inline void anon_vma_unlock(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
if (anon_vma)
- spin_unlock(&anon_vma->lock);
+ up_write(&anon_vma->sem);
}

/*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
return;

/*
- * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+ * We hold either the mmap_sem lock or a reference on the
+ * anon_vma. So no need to call page_lock_anon_vma.
*/
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
- spin_lock(&anon_vma->lock);
+ down_read(&anon_vma->sem);

list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
remove_migration_pte(vma, old, new);

- spin_unlock(&anon_vma->lock);
+ up_read(&anon_vma->sem);
}

/*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
- int rcu_locked = 0;
+ struct anon_vma *anon_vma = NULL;
int charge = 0;

if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
}
/*
* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
- * we cannot notice that anon_vma is freed while we migrates a page.
+ * we cannot notice that anon_vma is freed while we migrate a page.
* This rcu_read_lock() delays freeing anon_vma pointer until the end
* of migration. File cache pages are no problem because of page_lock()
* File Caches may use write_page() or lock_page() in migration, then,
* just care Anon page here.
*/
- if (PageAnon(page)) {
- rcu_read_lock();
- rcu_locked = 1;
- }
+ if (PageAnon(page))
+ anon_vma = grab_anon_vma(page);

/*
* Corner case handling:
@@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get
if (!PageAnon(page) && PagePrivate(page)) {
/*
* Go direct to try_to_free_buffers() here because
- * a) that's what try_to_release_page() would do anyway
- * b) we may be under rcu_read_lock() here, so we can't
- * use GFP_KERNEL which is what try_to_release_page()
- * needs to be effective.
+ * that's what try_to_release_page() would do anyway
*/
try_to_free_buffers(page);
}
@@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get
} else if (charge)
mem_cgroup_end_migration(newpage);
rcu_unlock:
- if (rcu_locked)
- rcu_read_unlock();
+ if (anon_vma)
+ put_anon_vma(anon_vma);

unlock:

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -570,7 +570,7 @@ again: remove_next = 1 + (end > next->
if (vma->anon_vma)
anon_vma = vma->anon_vma;
if (anon_vma) {
- spin_lock(&anon_vma->lock);
+ down_write(&anon_vma->sem);
/*
* Easily overlooked: when mprotect shifts the boundary,
* make sure the expanding vma has anon_vma set if the
@@ -624,7 +624,7 @@ again: remove_next = 1 + (end > next->
}

if (anon_vma)
- spin_unlock(&anon_vma->lock);
+ up_write(&anon_vma->sem);
if (mapping)
up_write(&mapping->i_mmap_sem);

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru
if (anon_vma) {
allocated = NULL;
locked = anon_vma;
- spin_lock(&locked->lock);
+ down_write(&locked->sem);
} else {
anon_vma = anon_vma_alloc();
if (unlikely(!anon_vma))
@@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru
/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
if (likely(!vma->anon_vma)) {
+ get_anon_vma(anon_vma);
vma->anon_vma = anon_vma;
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
allocated = NULL;
@@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru
spin_unlock(&mm->page_table_lock);

if (locked)
- spin_unlock(&locked->lock);
+ up_write(&locked->sem);
if (unlikely(allocated))
anon_vma_free(allocated);
}
@@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str
{
BUG_ON(vma->anon_vma != next->anon_vma);
list_del(&next->anon_vma_node);
+ put_anon_vma(vma->anon_vma);
}

void __anon_vma_link(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;

- if (anon_vma)
+ if (anon_vma) {
+ get_anon_vma(anon_vma);
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+ }
}

void anon_vma_link(struct vm_area_struct *vma)
@@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct
struct anon_vma *anon_vma = vma->anon_vma;

if (anon_vma) {
- spin_lock(&anon_vma->lock);
+ get_anon_vma(anon_vma);
+ down_write(&anon_vma->sem);
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
- spin_unlock(&anon_vma->lock);
+ up_write(&anon_vma->sem);
}
}

void anon_vma_unlink(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
- int empty;

if (!anon_vma)
return;

- spin_lock(&anon_vma->lock);
+ down_write(&anon_vma->sem);
list_del(&vma->anon_vma_node);
-
- /* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
- spin_unlock(&anon_vma->lock);
-
- if (empty)
- anon_vma_free(anon_vma);
+ up_write(&anon_vma->sem);
+ put_anon_vma(anon_vma);
}

static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
{
struct anon_vma *anon_vma = data;

- spin_lock_init(&anon_vma->lock);
+ init_rwsem(&anon_vma->sem);
+ atomic_set(&anon_vma->refcount, 0);
INIT_LIST_HEAD(&anon_vma->head);
}

@@ -157,9 +157,9 @@ void __init anon_vma_init(void)
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
{
- struct anon_vma *anon_vma;
+ struct anon_vma *anon_vma = NULL;
unsigned long anon_mapping;

rcu_read_lock();
@@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v
goto out;

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
- spin_lock(&anon_vma->lock);
- return anon_vma;
+ if (!atomic_inc_not_zero(&anon_vma->refcount))
+ anon_vma = NULL;
out:
rcu_read_unlock();
- return NULL;
+ return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+ struct anon_vma *anon_vma = grab_anon_vma(page);
+
+ if (anon_vma)
+ down_read(&anon_vma->sem);
+ return anon_vma;
}

static void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
- spin_unlock(&anon_vma->lock);
- rcu_read_unlock();
+ up_read(&anon_vma->sem);
+ put_anon_vma(anon_vma);
}

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 20:58:41 UTC

Post by Andrea Arcangeli
Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

This also looks very debatable indeed. The only performance numbers quoted

Post by Andrea Arcangeli
This results in f.e. the Aim9 brk performance test to got down by 10-15%.

which just seems like a total disaster.

The whole series looks bad, in fact. Lack of authorship, bad single-line
description, and the code itself sucks so badly that it's not even funny.

NAK NAK NAK. All of it. It stinks.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 21:27:39 UTC

Post by Linus Torvalds
This also looks very debatable indeed. The only performance numbers quoted

Post by Andrea Arcangeli
This results in f.e. the Aim9 brk performance test to got down by 10-15%.

which just seems like a total disaster.
The whole series looks bad, in fact. Lack of authorship, bad single-line

Glad you agree. Note that the fact the whole series looks bad, is
_exactly_ why I couldn't let Christoph keep going with
mmu-notifier-core at the very end of his patchset. I had to move it at
the top to have a chance to get the KVM and GRU requirements merged
in 2.6.26.

I think the spinlock->rwsem conversion is ok under config option, as
you can see I complained myself to various of those patches and I'll
take care they're in a mergeable state the moment I submit them. What
XPMEM requires are different semantics for the methods, and we never
had to do any blocking I/O during vmtruncate before, now we have to.
And I don't see a problem in making the conversion from
spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
anything but ia64.

Please ignore all patches but mmu-notifier-core. I regularly forward
_only_ mmu-notifier-core to Andrew, that's the only one that is in
merge-ready status, everything else is just so XPMEM can test and we
can keep discussing it to bring it in a mergeable state like
mmu-notifier-core already is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 21:38:52 UTC

Post by Andrea Arcangeli
I think the spinlock->rwsem conversion is ok under config option, as
you can see I complained myself to various of those patches and I'll
take care they're in a mergeable state the moment I submit them. What
XPMEM requires are different semantics for the methods, and we never
had to do any blocking I/O during vmtruncate before, now we have to.

I really suspect we don't really have to, and that it would be better to
just fix the code that does that.

Post by Andrea Arcangeli
Please ignore all patches but mmu-notifier-core. I regularly forward
_only_ mmu-notifier-core to Andrew, that's the only one that is in
merge-ready status, everything else is just so XPMEM can test and we
can keep discussing it to bring it in a mergeable state like
mmu-notifier-core already is.

The thing is, I didn't like that one *either*. I thought it was the
biggest turd in the series (and by "biggest", I literally mean "most lines
of turd-ness" rather than necessarily "ugliest per se").

I literally think that mm_lock() is an unbelievable piece of utter and
horrible CRAP.

There's simply no excuse for code like that.

If you want to avoid the deadlock from taking multiple locks in order, but
there is really just a single operation that needs it, there's a really
really simple solution.

And that solution is *not* to sort the whole damn f*cking list in a
vmalloc'ed data structure prior to locking!

Damn.

No, the simple solution is to just make up a whole new upper-level lock,
and get that lock *first*. You can then take all the multiple locks at a
lower level in any order you damn well please.

And yes, it's one more lock, and yes, it serializes stuff, but:

- that code had better not be critical anyway, because if it was, then
the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_

- parallelism is overrated: it doesn't matter one effing _whit_ if
something is a hundred times more parallel, if it's also a hundred
times *SLOWER*.

So dang it, flush the whole damn series down the toilet and either forget
the thing entirely, or re-do it sanely.

And here's an admission that I lied: it wasn't *all* clearly crap. I did
like one part, namely list_del_init_rcu(), but that one should have been
in a separate patch. I'll happily apply that one.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 22:22:40 UTC

Post by Andrea Arcangeli
had to do any blocking I/O during vmtruncate before, now we have to.

I really suspect we don't really have to, and that it would be better to
just fix the code that does that.

I'll let you discuss with Christoph and Robin about it. The moment I
heard the schedule inside ->invalidate_page() requirement I reacted
the same way you did. But I don't see any other real solution for XPMEM
other than spin-looping for ages halting the scheduler for ages, while
the ack is received from the network device.

But mm_lock is required even without XPMEM. And srcu is also required
without XPMEM to allow ->release to schedule (however downgrading srcu
to rcu will result in a very small patch, srcu and rcu are about the
same with a kernel supporting preempt=y like 2.6.26).

Post by Linus Torvalds
I literally think that mm_lock() is an unbelievable piece of utter and
horrible CRAP.
There's simply no excuse for code like that.

I think it's a great smp scalability optimization over the global lock
you're proposing below.

Post by Linus Torvalds
No, the simple solution is to just make up a whole new upper-level lock,
and get that lock *first*. You can then take all the multiple locks at a
lower level in any order you damn well please.

Unfortunately the lock you're talking about would be:

static spinlock_t global_lock = ...

There's no way to make it more granular.

So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock
we'd need to take that extremely wide spinlock first (and even worse,
later it would become a rwsem when XPMEM is selected making the VM
even slower than it already becomes when XPMEM support is selected at
compile time).

Post by Linus Torvalds
- that code had better not be critical anyway, because if it was, then
the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_

mmu_notifier_register can take ages. No problem.

Post by Linus Torvalds
- parallelism is overrated: it doesn't matter one effing _whit_ if
something is a hundred times more parallel, if it's also a hundred
times *SLOWER*.

mmu_notifier_register is fine to be hundred times slower (preempt-rt
will turn all locks in spinlocks so no problem).

Post by Linus Torvalds
And here's an admission that I lied: it wasn't *all* clearly crap. I did
like one part, namely list_del_init_rcu(), but that one should have been
in a separate patch. I'll happily apply that one.

Sure, I'll split it from the rest if the mmu-notifier-core isn't merged.

My objective has been:

1) add zero overhead to the VM before anybody starts a VM with kvm and
still zero overhead for all other tasks except the task where the
VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm)
check that is optimized away too when CONFIG_KVM=n. And even for
that check my invalidate_page reduces the number of branches to the
absolute minimum possible.

2) avoid any new cacheline collision in the fast paths to allow numa
systems not to nearly-crash (mm->mmu_notifier_mm will be shared and
never written, except during the first mmu_notifier_register)

3) avoid any risk to introduce regressions in 2.6.26 (the patch must
be obviously safe). Even if mm_lock would be a bad idea like you
say, it's order of magnitude safer even if entirely broken then
messing with the VM core locking in 2.6.26.

mm_lock (or whatever name you like to give it, I admit mm_lock may not
be worrysome enough for people to have an idea to call it in a fast
path) is going to be the real deal for the long term to allow
mmu_notifier_register to serialize against
invalidate_page_start/end. If I fail in 2.6.26 I'll offer
maintainership to Christoph as promised, and you'll find him pushing
for mm_lock to be merged (as XPMEM/GRU aren't technologies running on
cellphones where your global wide spinlocks is optimized away at
compile time, and he also has to deal with XPMEM where such a spinlock
would need to become a rwsem as the anon_vma->sem has to be taken
after it), but let's assume you're right entirely right here that
mm_lock is going to be dropped and there's a better way: it's still a
fine solution for 2.6.26.

And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to
mmu_notifier.[ch] so you don't get any pollution in the core VM, and
mm_lock will be invisible to everything but anybody calling
mmu_notifier_register() then and it will be trivial to remove later if
you really want to add a global spinlock as there's no way to be more
granular than a _global_ numa-wide spinlock taken before any
i_mmap_lock/anon_vma->lock, without my mm_lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

2008-05-07 22:33:39 UTC

On Thu, 8 May 2008 00:22:05 +0200

static spinlock_t global_lock = ...
There's no way to make it more granular.
So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock
we'd need to take that extremely wide spinlock first (and even worse,
later it would become a rwsem when XPMEM is selected making the VM
even slower than it already becomes when XPMEM support is selected at
compile time).

Nope. We only need to take the global lock before taking *two or more* of
the per-vma locks.

I really wish I'd thought of that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 22:44:56 UTC

Post by Andrew Morton
Nope. We only need to take the global lock before taking *two or more* of
the per-vma locks.
I really wish I'd thought of that.

I don't see how you can avoid taking the system-wide-global lock
before every single anon_vma->lock/i_mmap_lock out there without
mm_lock.

Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
than one lock and we've to still take the global-system-wide lock
_before_ this single i_mmap_lock and no other lock at all.

Please elaborate, thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

2008-05-07 23:01:38 UTC

On Thu, 8 May 2008 00:44:06 +0200

Post by Andrew Morton
Nope. We only need to take the global lock before taking *two or more* of
the per-vma locks.
I really wish I'd thought of that.

umm...

CPU0: CPU1:

spin_lock(a->lock); spin_lock(b->lock);
spin_lock(b->lock); spin_lock(a->lock);

bad.

CPU0: CPU1:

spin_lock(global_lock) spin_lock(global_lock);
spin_lock(a->lock); spin_lock(b->lock);
spin_lock(b->lock); spin_lock(a->lock);

Is OK.

CPU0: CPU1:

spin_lock(global_lock)
spin_lock(a->lock); spin_lock(b->lock);
spin_lock(b->lock); spin_unlock(b->lock);
spin_lock(a->lock);
spin_unlock(a->lock);

also OK.

As long as all code paths which can take two-or-more locks are all covered
by the global lock there is no deadlock scenario. If a thread takes just a
single instance of one of these locks without taking the global_lock then
there is also no deadlock.

Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly
added mm_lock() thing and we also take both those locks at the same time in
regular code, we're probably screwed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 23:20:42 UTC

Post by Andrew Morton
Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly
added mm_lock() thing and we also take both those locks at the same time in
regular code, we're probably screwed.

No, just use the normal static ordering for that case: one type of lock
goes before the other kind. If those locks nest in regular code, you have
to do that *anyway*.

The code that can take many locks, will have to get the global lock *and*
order the types, but that's still trivial. It's something like

spin_lock(&global_lock);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (vma->anon_vma)
spin_lock(&vma->anon_vma->lock);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping)
spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
}
spin_unlock(&global_lock);

and now everybody follows the rule that "anon_vma->lock" precedes
"i_mmap_lock". So there can be no ABBA deadlock between the normal users
and the many-locks version, and there can be no ABBA deadlock between
many-locks-takers because they use the global_lock to serialize.

This really isn't rocket science, guys.

(I really hope and believe that they don't nest anyway, and that you can
just use a single for-loop for the many-lock case)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-07 23:40:19 UTC

Post by Linus Torvalds
The code that can take many locks, will have to get the global lock *and*
order the types, but that's still trivial. It's something like
spin_lock(&global_lock);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (vma->anon_vma)
spin_lock(&vma->anon_vma->lock);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping)
spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
}
spin_unlock(&global_lock);

Multiple vmas may share the same mapping or refer to the same anonymous
vma. The above code will deadlock since we may take some locks multiple
times.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 00:10:55 UTC

Post by Christoph Lameter
Multiple vmas may share the same mapping or refer to the same anonymous
vma. The above code will deadlock since we may take some locks multiple
times.

Ok, so that actually _is_ a problem. It would be easy enough to also add
just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing
a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's
suggestion of just using stop-machine is actually the right one just
because it's _so_ simple.

(That said, we're not running out of vm flags yet, and if we were, we
could just add another word. We're already wasting that space right now on
64-bit by calling it "unsigned long").

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-08 00:53:37 UTC

Post by Christoph Lameter
Multiple vmas may share the same mapping or refer to the same anonymous
vma. The above code will deadlock since we may take some locks multiple
times.

Also, stop-machine will not work if we come to the conclusion that
i_mmap_lock and anon_vma->lock need to be sleepable locks.

Thanks,
Robin Holt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-08 00:56:39 UTC

Post by Christoph Lameter
Multiple vmas may share the same mapping or refer to the same anonymous
vma. The above code will deadlock since we may take some locks multiple
times.

Set the vma flag when we locked it and then skip when we find it locked
right? This would be in addition to the global lock?

stop-machine would work for KVM since its a once in a Guest OS time of
thing. But GRU, KVM and eventually Infiniband need the ability to attach
in a reasonable timeframe without causing major hiccups for other
processes.

Post by Linus Torvalds
(That said, we're not running out of vm flags yet, and if we were, we
could just add another word. We're already wasting that space right now on
64-bit by calling it "unsigned long").

We sure have enough flags.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 01:09:26 UTC

Post by Christoph Lameter
Set the vma flag when we locked it and then skip when we find it locked
right? This would be in addition to the global lock?

Yes. And clear it before unlocking (and again, testing if it's already
clear - you mustn't unlock twice, so you must only unlock when the bit
was set).

You also (obviously) need to have somethign that guarantees that the lists
themselves are stable over the whole sequence, but I assume you already
have mmap_sem for reading (since you'd need it anyway just to follow the
list).

And if you have it for writing, it can obviously *act* as the global lock,
since it would already guarantee mutual exclusion on that mm->mmap list.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 01:42:11 UTC

We sure have enough flags.

Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are
unique), we need to mark the address spaces/anonvma's. So the flag would
need to be in the "struct anon_vma" (and struct address_space), not in the
vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and
would require adding a new field to "struct anon_vma()"

And related to that brain-fart of mine, that obviously also means that
yes, the locking has to be stronger than "mm->mmap_sem" held for writing,
so yeah, it would have be a separate global spinlock (or perhaps a
blocking lock if you have some reason to protect anything else with this
too).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 01:53:27 UTC

We sure have enough flags.

Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are
unique), we need to mark the address spaces/anonvma's. So the flag would
need to be in the "struct anon_vma" (and struct address_space), not in the
vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and
would require adding a new field to "struct anon_vma()"
And related to that brain-fart of mine, that obviously also means that
yes, the locking has to be stronger than "mm->mmap_sem" held for writing,
so yeah, it would have be a separate global spinlock (or perhaps a
blocking lock if you have some reason to protect anything else with this

So because the bitflag can't prevent taking the same lock twice on two
different vmas in the same mm, we still can't remove the sorting, and
the global lock won't buy much other than reducing the collisions. I
can add that though.

I think it's more interesting to put a cap on the number of vmas to
min(1024,max_map_count). The sort time on an 8k array runs in constant
time. kvm runs with 127 vmas allocated...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 01:58:41 UTC

Post by Andrea Arcangeli
So because the bitflag can't prevent taking the same lock twice on two
different vmas in the same mm, we still can't remove the sorting

Andrea.

Take five minutes. Take a deep breadth. And *think* about actually reading
what I wrote.

The bitflag *can* prevent taking the same lock twice. It just needs to be
in the right place.

Post by Andrea Arcangeli
So the flag wouldn't be one of the VM_xyzzy flags, and would require
adding a new field to "struct anon_vma()"

IOW, just make it be in that anon_vma (and the address_space). No sorting
required.

Post by Andrea Arcangeli
I think it's more interesting to put a cap on the number of vmas to
min(1024,max_map_count). The sort time on an 8k array runs in constant
time.

Shut up already. It's not constant time just because you can cap the
overhead. We're not in a university, and we care about performance, not
your made-up big-O notation.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 02:25:07 UTC

Post by Linus Torvalds
Take five minutes. Take a deep breadth. And *think* about actually reading
what I wrote.
The bitflag *can* prevent taking the same lock twice. It just needs to be
in the right place.

It's not that I didn't read it, but to do it I've to grow every
anon_vma by 8 bytes. I thought it was implicit that the conclusion of
your email is that it couldn't possibly make sense to grow the size of
each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem
kernel modules. It surely isn't my preferred solution, while capping
the number of vmas to 1024 means sort() will make around 10240 steps,
Matt call tell the exact number. The big cost shouldn't be in
sort. Even 512 vmas will be more than enough for us infact. Note that
I've a cond_resched in the sort compare function and I can re-add the
signal_pending check. I had the signal_pending check in the original
version that didn't use sort() but was doing an inner loop, I thought
signal_pending wasn't necessary after speeding it up with sort(). But
I can add it again, so then we'll only fail to abort inside sort() and
we'll be able to break the loop while taking all the spinlocks, but
with such as small array that can't be an issue and the result will
surely run faster than stop_machine with zero ram and cpu overhead for
the VM (besides stop_machine can't work or we'd need to disable
preemption between invalidate_range_start/end, even removing the xpmem
schedule-inside-mmu-notifier requirement).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 02:33:30 UTC

Andrea, I'm not interested. I've stated my standpoint: the code being
discussed is crap. We're not doing that. Not in the core VM.

I gave solutions that I think aren't crap, but I already also stated that
I have no problems not merging it _ever_ if no solution can be found. The
whole issue simply isn't even worth the pain, imnsho.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 23:41:29 UTC

Hi Andrew,

Post by Andrew Morton
spin_lock(global_lock)
spin_lock(a->lock); spin_lock(b->lock);

================== mmu_notifier_register()

Post by Andrew Morton
spin_lock(b->lock); spin_unlock(b->lock);
spin_lock(a->lock);
spin_unlock(a->lock);
also OK.

But the problem is that we've to stop the critical section in the
place I marked with "========" while mmu_notifier_register
runs. Otherwise the driver calling mmu_notifier_register won't know if
it's safe to start establishing secondary sptes/tlbs. If the driver
will establish sptes/tlbs with get_user_pages/follow_page the page
could be freed immediately later when zap_page_range starts.

So if CPU1 doesn't take the global_lock before proceeding in
zap_page_range (inside vmtruncate i_mmap_lock that is represented as
b->lock above) we're in trouble.

What we can do is to replace the mm_lock with a
spin_lock(&global_lock) only if all places that takes i_mmap_lock
takes the global lock first and that hurts scalability of the fast
paths that are performance critical like vmtruncate and
anon_vma->lock. Perhaps they're not so performance critical, but
surely much more performant critical than mmu_notifier_register ;).

The idea of polluting various scalable paths like truncate() syscall
in the VM with a global spinlock frightens me, I'd rather return to
invalidate_page() inside the PT lock removing both
invalidate_range_start/end. Then all serialization against the mmu
notifiers will be provided by the PT lock that the secondary mmu page
fault also has to take in get_user_pages (or follow_page). In any case
that is a better solution that won't slowdown the VM when
MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance
is about the same with or without invalidate_range_start/end. I didn't
think anybody could care about how long mmu_notifier_register takes
until it returns compared to all heavyweight operations that happens
to start a VM (not only in the kernel but in the guest too).

Infact if it's security that we worry about here, can put a cap of
_time_ that mmu_notifier_register can take before it fails, and we
fail to start a VM if it takes more than 5sec, that's still fine as
the failure could happen for other reasons too like vmalloc shortage
and we already handle it just fine. This 5sec delay can't possibly happen in
practice anyway in the only interesting scenario, just like the
vmalloc shortage. This is obviously a superior solution than polluting
the VM with an useless global spinlock that will destroy truncate/AIM
on numa.

Anyway Christoph, I uploaded my last version here:

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16

(applies and runs fine on 26-rc1)

You're more than welcome to takeover from it, I kind of feel my time
now may be better spent to emulate the mmu-notifier-core with kprobes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 01:04:57 UTC

Post by Andrea Arcangeli
Hi Andrew,

Post by Andrew Morton
spin_lock(global_lock)
spin_lock(a->lock); spin_lock(b->lock);

================== mmu_notifier_register()

If mmy_notifier_register() takes the global lock, it cannot happen here.
It will be blocked (by CPU0), so there's no way it can then cause an ABBA
deadlock. It will be released when CPU0 has taken *all* the locks it
needed to take.

Post by Andrea Arcangeli
What we can do is to replace the mm_lock with a
spin_lock(&global_lock) only if all places that takes i_mmap_lock

NO!

You replace mm_lock() with the sequence that Andrew gave you (and I
described):

spin_lock(&global_lock)
.. get all locks UNORDERED ..
spin_unlock(&global_lock)

and you're now done. You have your "mm_lock()" (which still needs to be
renamed - it should be a "mmu_notifier_lock()" or something like that),
but you don't need the insane sorting. At most you apparently need a way
to recognize duplicates (so that you don't deadlock on yourself), which
looks like a simple bit-per-vma.

The global lock doesn't protect any data structures itself - it just
protects two of these mm_lock() functions from ABBA'ing on each other!

Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-08 01:12:58 UTC

Post by Linus Torvalds
and you're now done. You have your "mm_lock()" (which still needs to be
renamed - it should be a "mmu_notifier_lock()" or something like that),
but you don't need the insane sorting. At most you apparently need a way
to recognize duplicates (so that you don't deadlock on yourself), which
looks like a simple bit-per-vma.

Andrea's mm_lock could have wider impact. It is the first effective
way that I have seen of temporarily holding off reclaim from an address
space. It sure is a brute force approach.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 01:34:30 UTC

Andrea's mm_lock could have wider impact. It is the first effective
way that I have seen of temporarily holding off reclaim from an address
space. It sure is a brute force approach.

Well, I don't think the naming necessarily has to be about notifiers, but
it should be at least a *bit* more scary than "mm_lock()", to make it
clear that it's pretty dang expensive.

Even without the vmalloc and sorting, if it would be used by "normal"
things it would still be very expensive for some cases - running thngs
like ElectricFence, for example, will easily generate thousands and
thousands of vma's in a process.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 02:57:38 UTC

Post by Christoph Lameter
Andrea's mm_lock could have wider impact. It is the first effective
way that I have seen of temporarily holding off reclaim from an address
space. It sure is a brute force approach.

The only improvement I can imagine on mm_lock, is after changing the
name to global_mm_lock() to reestablish the signal_pending check in
the loop that takes the spinlock and to backoff and put the cap to 512
vmas so the ram wasted on anon-vmas wouldn't save more than 10-100usec
at most (plus the vfree that may be a bigger cost but we're ok to pay
it and it surely isn't security related).

Then on the long term we need to talk to Matt on returning a parameter
to the sort function to break the loop. After that we remove the 512
vma cap and mm_lock is free to run as long as it wants like
/dev/urandom, nobody can care less how long it will run before
returning as long as it reacts to signals.

This is the right way if we want to support XPMEM/GRU efficiently and
without introducing unnecessary regressions in the VM fastpaths and VM
footprint.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-08 03:11:12 UTC

Post by Andrea Arcangeli
to the sort function to break the loop. After that we remove the 512
vma cap and mm_lock is free to run as long as it wants like
/dev/urandom, nobody can care less how long it will run before
returning as long as it reacts to signals.

Look Linus has told you what to do. Why not simply do it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 03:42:08 UTC

Look Linus has told you what to do. Why not simply do it?

When it looked like we could use vm_flags to remove sort, that looked
an ok optimization, no problem with optimizations, I'm all for
optimizations if they cost nothing to the VM fast paths and VM footprint.

But removing sort isn't worth it if it takes away ram from the VM even
when global_mm_lock will never be called.

sort is like /dev/urandom so after sort is fixed to handle signals
(and I expect Matt will help with this) we'll remove the 512 vmas cap.
In the meantime we can live with the 512 vmas cap that guarantees sort
won't take more than a few dozen usec. Removing sort() is the only
thing that the anon vma bitflag can achieve and it's clearly not worth
it and it would go in the wrong direction (fixing sort to handle
signals is clearly the right direction, if sort is a concern at all).

Adding the global lock around global_mm_lock to avoid one
global_mm_lock to collide against another global_mm_lock is sure ok
with me, if that's still wanted now that it's clear removing sort
isn't worth it, I'm neutral on this.

Christoph please go ahead and add the bitflag to anon-vma yourself if
you want. If something isn't technically right I don't do it no matter
who asks it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 04:16:39 UTC

Post by Andrea Arcangeli
But removing sort isn't worth it if it takes away ram from the VM even
when global_mm_lock will never be called.

Andrea, you really are a piece of work. Your arguments have been bogus
crap that didn't even understand what was going on from the beginning, and
now you continue to do that.

What exactly "takes away ram" from the VM?

The notion of adding a flag to "struct anon_vma"?

The one that already has a 4 byte padding thing on x86-64 just after the
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
bytes of padding if we didn't just make the spinlock type unconditionally
32 bits rather than the 16 bits we actually _use_?

IOW, you didn't even look at it, did you?

But whatever. I clearly don't want a patch from you anyway, so ..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 05:20:49 UTC

Post by Linus Torvalds
IOW, you didn't even look at it, did you?

Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.

2.6.26-rc1

# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
anon_vma 260 576 24 144 1 : tunables 120 60 8 : slabdata 4 4 0
^^ ^^^

2.6.26-rc1 + below patch

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -27,6 +27,7 @@ struct anon_vma {
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+ int flag:1;
};

#ifdef CONFIG_MMU

# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
anon_vma 250 560 32 112 1 : tunables 120 60 8 : slabdata 5 5 0
^^ ^^^

Not a big deal sure to grow it 33%, it's so small anyway, but I don't
see the point in growing it. sort() can be interrupted by signals, and
until it can we can cap it to 512 vmas making the worst case taking
dozen usecs, I fail to see what you have against sort().

Again: if a vma bitflag + global lock could have avoided sort and run
O(N) instead of current O(N*log(N)) I would have done that
immediately, infact I was in the process of doing it when you posted
the followup. Nothing personal here, just staying technical. Hope you
too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pekka Enberg

2008-05-08 05:28:08 UTC

Post by Andrea Arcangeli
Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.
@@ -27,6 +27,7 @@ struct anon_vma {
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+ int flag:1;
};
The one that already has a 4 byte padding thing on x86-64 just after the
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
bytes of padding if we didn't just make the spinlock type unconditionally
32 bits rather than the 16 bits we actually _use_?

So you need to add the flag _after_ ->lock and _before_ ->head....

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pekka Enberg

2008-05-08 05:30:40 UTC

Post by Pekka Enberg

Post by Linus Torvalds
The one that already has a 4 byte padding thing on x86-64 just after the
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
bytes of padding if we didn't just make the spinlock type unconditionally
32 bits rather than the 16 bits we actually _use_?

So you need to add the flag _after_ ->lock and _before_ ->head....

Oh should have taken my morning coffee first, before ->lock works
obviously as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 05:50:17 UTC

Post by Pekka Enberg

So you need to add the flag _after_ ->lock and _before_ ->head....

Oh should have taken my morning coffee first, before ->lock works
obviously as well.

Sorry, Linus's right: I didn't realize the "after the spinlock" was
literally after the spinlock, I didn't see the 4 byte padding when I
read the code and put the flag:1 in. If put between ->lock and ->head
it doesn't take more memory on x86-64 as described literlly. So the
next would be to find another place like that in the address
space. Perhaps after the private_lock using the same trick or perhaps
the slab alignment won't actually alter the number of slabs per page
regardless.

I leave that to Christoph, he's surely better than me at doing this, I
give it up entirely and I consider my attempt to merge a total failure
and I strongly regret it.

On a side note the anon_vma will change to this when XPMEM support is
compiled in:

struct anon_vma {
- spinlock_t lock; /* Serialize access to vma list */
+ atomic_t refcount; /* vmas on the list */
+ struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas
*/
};

not sure if it'll grow in size or not after that but let's say it's
not a big deal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 15:11:01 UTC

Post by Andrea Arcangeli
Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.

Put the flag after the spinlock, not after the "list_head".

Also, we'd need to make it

unsigned short flag:1;

_and_ change spinlock_types.h to make the spinlock size actually match the
required size (right now we make it an "unsigned int slock" even when we
actually only use 16 bits). See the

#if (NR_CPUS < 256)

code in <asm-x86/spinlock.h>.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 16:14:10 UTC

Post by Linus Torvalds
Also, we'd need to make it
unsigned short flag:1;
_and_ change spinlock_types.h to make the spinlock size actually match the
required size (right now we make it an "unsigned int slock" even when we
actually only use 16 bits).

Btw, this is an issue only on 32-bit x86, because on 64-bit one we already
have the padding due to the alignment of the 64-bit pointers in the
list_head (so there's already empty space there).

On 32-bit, the alignment of list-head is obviously just 32 bits, so right
now the structure is "perfectly packed" and doesn't have any empty space.
But that's just because the spinlock is unnecessarily big.

(Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the
structure really will grow. That's a very odd configuration, though, and
not one I feel we really need to care about).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 22:01:47 UTC

Post by Linus Torvalds
Btw, this is an issue only on 32-bit x86, because on 64-bit one we already
have the padding due to the alignment of the 64-bit pointers in the
list_head (so there's already empty space there).
On 32-bit, the alignment of list-head is obviously just 32 bits, so right
now the structure is "perfectly packed" and doesn't have any empty space.
But that's just because the spinlock is unnecessarily big.
(Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the
structure really will grow. That's a very odd configuration, though, and
not one I feel we really need to care about).

I see two ways to implement it:

1) use #ifdef and make it zero overhead for 64bit only without playing
any non obvious trick.

struct anon_vma {
spinlock_t lock;
#ifdef CONFIG_MMU_NOTIFIER
int global_mm_lock:1;
#endif

struct address_space {
spinlock_t private_lock;
#ifdef CONFIG_MMU_NOTIFIER
int global_mm_lock:1;
#endif

2) add a:

#define AS_GLOBAL_MM_LOCK (__GFP_BITS_SHIFT + 2) /* global_mm_locked */

and use address_space->flags with bitops

And as Andrew pointed me out by PM, for the anon_vma we can use the
LSB of the list.next/prev because the list can't be browsed when the
lock is taken, so taking the lock and then setting the bit and
clearing the bit before unlocking is safe. The LSB will always read 0
even if it's under list_add modification when the global spinlock isn't
taken. And after taking the anon_vma lock we can switch it the LSB
from 0 to 1 without races and the 1 will be protected by the
global spinlock.

The above solution is zero cost for 32bit too, so I prefer it.

So I now agree with you this is a great idea on how to remove sort()
and vmalloc and especially vfree without increasing the VM footprint.

I'll send an update with this for review very shortly and I hope this
goes in so KVM will be able to swap and do many other things very well
starting in 2.6.26.

Thanks a lot,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

2008-05-09 18:37:50 UTC

Btw, this is an issue only on 32-bit x86, because on 64-bit one we already
have the padding due to the alignment of the 64-bit pointers in the
list_head (so there's already empty space there).
On 32-bit, the alignment of list-head is obviously just 32 bits, so right
now the structure is "perfectly packed" and doesn't have any empty space.
But that's just because the spinlock is unnecessarily big.
(Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the
structure really will grow. That's a very odd configuration, though, and
not one I feel we really need to care about).

Another possibility, would something like this work?

/*
* null out the begin function, no new begin calls can be made
*/
rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL);

/*
* lock/unlock all rmap locks in any order - this ensures that any
* pending start() will have its end() function called.
*/
mm_barrier(mm);

/*
* now that no new start() call can be made and all start()/end() pairs
* are complete we can remove the notifier.
*/
mmu_notifier_remove(mm, my_notifier);

This requires a mmu_notifier instance per attached mm and that
__mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
the function.

But I think its enough to ensure that:

for each start an end will be called

It can however happen that end is called without start - but we could
handle that I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-09 18:56:29 UTC

Post by Peter Zijlstra
Another possibility, would something like this work?
/*
* null out the begin function, no new begin calls can be made
*/
rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL);
/*
* lock/unlock all rmap locks in any order - this ensures that any
* pending start() will have its end() function called.
*/
mm_barrier(mm);
/*
* now that no new start() call can be made and all start()/end() pairs
* are complete we can remove the notifier.
*/
mmu_notifier_remove(mm, my_notifier);
This requires a mmu_notifier instance per attached mm and that
__mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
the function.
for each start an end will be called

We don't need that, it's perfectly ok if start is called but end is
not, it's ok to unregister in the middle as I guarantee ->release is
called before mmu_notifier_unregister returns (if ->release is needed
at all, not the case for KVM/GRU).

Unregister is already solved with srcu/rcu without any additional
complication as we don't need the guarantee that for each start an end
will be called.

Post by Peter Zijlstra
It can however happen that end is called without start - but we could
handle that I think.

The only reason mm_lock() was introduced is to solve "register", to
guarantee that for each end there was a start. We can't handle end
called without start in the driver.

The reason the driver must be prevented to register in the middle of
start/end, if that if it ever happens the driver has no way to know it
must stop the secondary mmu page faults to call get_user_pages and
instantiate sptes/secondarytlbs on pages that will be freed as soon as
zap_page_range starts.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

2008-05-09 19:06:14 UTC

We don't need that, it's perfectly ok if start is called but end is
not, it's ok to unregister in the middle as I guarantee ->release is
called before mmu_notifier_unregister returns (if ->release is needed
at all, not the case for KVM/GRU).
Unregister is already solved with srcu/rcu without any additional
complication as we don't need the guarantee that for each start an end
will be called.

Post by Peter Zijlstra
It can however happen that end is called without start - but we could
handle that I think.

Right - then I got it backwards. Never mind me then..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 01:27:37 UTC

Post by Linus Torvalds
You replace mm_lock() with the sequence that Andrew gave you (and I
spin_lock(&global_lock)
.. get all locks UNORDERED ..
spin_unlock(&global_lock)
and you're now done. You have your "mm_lock()" (which still needs to be
renamed - it should be a "mmu_notifier_lock()" or something like that),
but you don't need the insane sorting. At most you apparently need a way
to recognize duplicates (so that you don't deadlock on yourself), which
looks like a simple bit-per-vma.
The global lock doesn't protect any data structures itself - it just
protects two of these mm_lock() functions from ABBA'ing on each other!

I thought the thing to remove was the "get all locks". I didn't
realize the major problem was only the sorting of the array.

I'll add the global lock, it's worth it as it drops the worst case
number of steps by log(65536) times. Furthermore surely two concurrent
mm_notifier_lock will run faster as it'll decrease the cacheline
collisions. Since you ask to call it mmu_notifier_lock I'll also move
it to mmu_notifier.[ch] as consequence.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Benjamin Herrenschmidt

2008-05-07 23:31:21 UTC

Post by Andrea Arcangeli
Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

You said yourself that mmu_notifier_register can be as slow as you
want ... what about you use stop_machine for it ? I'm not even joking
here :-)

Post by Andrea Arcangeli
vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
than one lock and we've to still take the global-system-wide lock
_before_ this single i_mmap_lock and no other lock at all.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 23:45:47 UTC

Post by Benjamin Herrenschmidt

Post by Andrea Arcangeli
Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

You said yourself that mmu_notifier_register can be as slow as you
want ... what about you use stop_machine for it ? I'm not even joking
here :-)

We can put a cap of time + a cap of vmas. It's not important if it
fails, the only useful case we know it, and it won't be slow at
all. The failure can happen because the cap of time or the cap of vmas
or the cap vmas triggers or there's a vmalloc shortage. We handle the
failure in userland of course. There are zillon of allocations needed
anyway, any one of them can fail, so this isn't a new fail path, is
the same fail path that always existed before mmu_notifiers existed.

I can't possibly see how adding a new global wide lock that forces all
truncate to be serialized against each other, practically eliminating
the need of the i_mmap_lock, could be superior to an approach that
doesn't cause the overhead to the VM at all, and only require kvm to
pay for an additional cost when it startup.

Furthermore the only reason I had to implement mm_lock was to fix the
invalidate_range_start/end model, if we go with only invalidate_page
and invalidate_pages called inside the PT lock and we use the PT lock
to serialize, we don't need a mm_lock anymore and no new lock from the
VM either. I tried to push for that, but everyone else wanted
invalidate_range_start/end. I only did the only possible thing to do:
to make invalidate_range_start safe to make everyone happy without
slowing down the VM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-08 01:35:29 UTC

Sorry for not having completely answered to this. I initially thought
stop_machine could work when you mentioned it, but I don't think it
can even removing xpmem block-inside-mmu-notifier-method requirements.

For stop_machine to solve this (besides being slower and potentially
not more safe as running stop_machine in a loop isn't nice), we'd need
to prevent preemption in between invalidate_range_start/end.

I think there are two ways:

1) add global lock around mm_lock to remove the sorting

2) remove invalidate_range_start/end, nuke mm_lock as consequence of
it, and replace all three with invalidate_pages issued inside the
PT lock, one invalidation for each 512 pte_t modified, so
serialization against get_user_pages becomes trivial but this will
be not ok at all for SGI as it increases a lot their invalidation
frequency

For KVM both ways are almost the same.

I'll implement 1 now then we'll see...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-13 12:14:50 UTC

Post by Andrea Arcangeli
Sorry for not having completely answered to this. I initially thought
stop_machine could work when you mentioned it, but I don't think it
can even removing xpmem block-inside-mmu-notifier-method requirements.
For stop_machine to solve this (besides being slower and potentially
not more safe as running stop_machine in a loop isn't nice), we'd need
to prevent preemption in between invalidate_range_start/end.
1) add global lock around mm_lock to remove the sorting
2) remove invalidate_range_start/end, nuke mm_lock as consequence of
it, and replace all three with invalidate_pages issued inside the
PT lock, one invalidation for each 512 pte_t modified, so
serialization against get_user_pages becomes trivial but this will
be not ok at all for SGI as it increases a lot their invalidation
frequency

This is what I suggested to begin with before this crazy locking was
developed to handle these corner cases... because I wanted the locking
to match with the tried and tested Linux core mm/ locking rather than
introducing this new idea.

I don't see why you're bending over so far backwards to accommodate
this GRU thing that we don't even have numbers for and could actually
potentially be batched up in other ways (eg. using mmu_gather or
mmu_gather-like idea).

The bare essential, matches-with-Linux-mm mmu notifiers that I first
saw of yours was pretty elegant and nice. The idea that "only one
solution must go in and handle everything perfectly" is stupid because
it is quite obvious that the sleeping invalidate idea is just an order
of magnitude or two more complex than the simple atomic invalidates
needed by you. We should and could easily have had that code upstream
long ago :(

I'm not saying we ignore the sleeping or batching cases, but we should
introduce the ideas slowly and carefully and assess the pros and cons
of each step along the way.

Post by Andrea Arcangeli
For KVM both ways are almost the same.
I'll implement 1 now then we'll see...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

Benjamin Herrenschmidt

2008-05-14 05:47:44 UTC

Post by Nick Piggin
ea.
I don't see why you're bending over so far backwards to accommodate
this GRU thing that we don't even have numbers for and could actually
potentially be batched up in other ways (eg. using mmu_gather or
mmu_gather-like idea).

I agree, we're better off generalizing the mmu_gather batching
instead...

I had some never-finished patches to use the mmu_gather for pretty much
everything except single page faults, tho various subtle differences
between archs and lack of time caused me to let them take the dust and
not finish them...

I can try to dig some of that out when I'm back from my current travel,
though it's probably worth re-doing from scratch now.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-14 06:06:31 UTC

Post by Benjamin Herrenschmidt

I agree, we're better off generalizing the mmu_gather batching
instead...

Well, the first thing would be just to get rid of the whole start/end
idea, which completely departs from the standard Linux system of
clearing ptes, then flushing TLBs, then freeing memory.

The onus would then be on GRU to come up with some numbers to justify
batching, and a patch which works nicely with the rest of the Linux
mm. And yes, mmu-gather is *the* obvious first choice of places to
look if one wanted batching hooks.

Post by Benjamin Herrenschmidt
I had some never-finished patches to use the mmu_gather for pretty much
everything except single page faults, tho various subtle differences
between archs and lack of time caused me to let them take the dust and
not finish them...
I can try to dig some of that out when I'm back from my current travel,
though it's probably worth re-doing from scratch now.

I always liked the idea as you know. But I don't think that should
be mixed in with the first iteration of the mmu notifiers patch
anyway. GRU actually can work without batching, but there is simply
some (unquantified to me) penalty for not batching it. I think it
is far better to first put in a clean and simple and working functionality
first. The idea that we have to unload some monster be-all-and-end-all
solution onto mainline in a single go seems counter productive to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jack Steiner

2008-05-14 13:16:25 UTC

Post by Benjamin Herrenschmidt

I agree, we're better off generalizing the mmu_gather batching
instead...

Unfortunately, we are at least several months away from being able to
provide numbers to justify batching - assuming it is really needed. We need
large systems running real user workloads. I wish we had that available
right now, but we don't.

It also depends on what you mean by "no batching". If you mean that the
notifier gets called for each pte that is removed from the page table, then
the overhead is clearly very high for some operations. Consider the unmap of
a very large object. A TLB flush per page will be too costly.

However, something based on the mmu_gather seems like it should provide
exactly what is needed to do efficient flushing of the TLB. The GRU does not
require that it be called in a sleepable context. As long as the notifier
callout provides the mmu_gather and vaddr range being flushed, the GRU can
do the efficiently do the rest.

-- jack
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 22:46:42 UTC

Post by Andrea Arcangeli
static spinlock_t global_lock = ...
There's no way to make it more granular.

Right. So what?

It's still about a million times faster than what the code does now.

You comment about "great smp scalability optimization" just shows that
you're a moron. It is no such thing. The fact is, it's a horrible
pessimization, since even SMP will be *SLOWER*. It will just be "less
slower" when you have a million CPU's and they all try to do this at the
same time (which probably never ever happens).

In other words, "scalability" is totally meaningless. The only thing that
matters is *performance*. If the "scalable" version performs WORSE, then
it is simply worse. Not better. End of story.

Post by Andrea Arcangeli
mmu_notifier_register can take ages. No problem.

So what you're saying is that performance doesn't matter?

So why do you do the ugly crazy hundred-line implementation, when a simple
two-liner would do equally well?

Your arguments are crap.

Anyway, discussion over. This code doesn't get merged. It doesn't get
merged before 2.6.26, and it doesn't get merged _after_ either.

Rewrite the code, or not. I don't care. I'll very happily not merge crap
for the rest of my life.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 22:58:32 UTC

Post by Andrea Arcangeli
static spinlock_t global_lock = ...
There's no way to make it more granular.

Right. So what?
It's still about a million times faster than what the code does now.

mmu_notifier_register only runs when windows or linux or macosx
boots. Who could ever care of the msec spent in mm_lock compared to
the time it takes to linux to boot?

What you're proposing is to slowdown AIM and certain benchmarks 20% or
more for all users, just so you save at most 1msec to start a VM.

Post by Linus Torvalds
Rewrite the code, or not. I don't care. I'll very happily not merge crap
for the rest of my life.

If you want the global lock I'll do it no problem, I just think it's
obviously inferior solution for 99% of users out there (including kvm
users that will also have to take that lock while kvm userland runs).

In my view the most we should do in this area is to reduce further the
max number of locks to take if max_map_count already isn't enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 23:03:32 UTC

To remove mm_lock without adding an horrible system-wide lock before
every i_mmap_lock etc.. we've to remove
invalidate_range_begin/end. Then we can return to an older approach of
doing only invalidate_page and serializing it with the PT lock against
get_user_pages. That works fine for KVM but GRU will have to flush the
tlb once every time we drop the PT lock, that means once per each 512
ptes on x86-64 etc... instead of a single time for the whole range
regardless how large the range is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-07 23:11:51 UTC

Post by Andrea Arcangeli
mmu_notifier_register only runs when windows or linux or macosx
boots. Who could ever care of the msec spent in mm_lock compared to
the time it takes to linux to boot?

Andrea, you're *this* close to going to my list of people who it is not
worth reading email from, and where it's better for everybody involved if
I just teach my spam-filter about it.

That code was CRAP.

That code was crap whether it's used once, or whether it's used a million
times. Stop making excuses for it just because it's not performance-
critical.

So give it up already. I told you what the non-crap solution was. It's
simpler, faster, and is about two lines of code compared to the crappy
version (which was what - 200 lines of crap with a big comment on top of
it just to explain the idiocy?).

So until you can understand the better solution, don't even bother
emailing me, ok? Because the next email I get from you that shows the
intelligence level of a gnat, I'll just give up and put you in a
spam-filter.

Because my IQ goes down just from reading your mails. I can't afford to
continue.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-08 00:38:57 UTC

I really suspect we don't really have to, and that it would be better to
just fix the code that does that.

That fix is going to be fairly difficult. I will argue impossible.

First, a little background. SGI allows one large numa-link connected
machine to be broken into seperate single-system images which we call
partitions.

XPMEM allows, at its most extreme, one process on one partition to
grant access to a portion of its virtual address range to processes on
another partition. Those processes can then fault pages and directly
share the memory.

In order to invalidate the remote page table entries, we need to message
(uses XPC) to the remote side. The remote side needs to acquire the
importing process's mmap_sem and call zap_page_range(). Between the
messaging and the acquiring a sleeping lock, I would argue this will
require sleeping locks in the path prior to the mmu_notifier invalidate_*
callouts().

On a side note, we currently have XPMEM working on x86_64 SSI, and
ia64 cross-partition. We are in the process of getting XPMEM working
on x86_64 cross-partition in support of UV.

Thanks,
Robin Holt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-08 00:57:45 UTC

Post by Robin Holt
In order to invalidate the remote page table entries, we need to message
(uses XPC) to the remote side. The remote side needs to acquire the
importing process's mmap_sem and call zap_page_range(). Between the
messaging and the acquiring a sleeping lock, I would argue this will
require sleeping locks in the path prior to the mmu_notifier invalidate_*
callouts().

You simply will *have* to do it without locally holding all the MM
spinlocks. Because quite frankly, slowing down all the normal VM stuff for
some really esoteric hardware simply isn't acceptable. We just don't do
it.

So what is it that actually triggers one of these events?

The most obvious solution is to just queue the affected pages while
holding the spinlocks (perhaps locking them locally), and then handling
all the stuff that can block after releasing things. That's how we
normally do these things, and it works beautifully, without making
everything slower.

Sometimes we go to extremes, and actually break the locks are restart
(ugh), and it gets ugly, but even that tends to be preferable to using the
wrong locking.

The thing is, spinlocks really kick ass. Yes, they restrict what you can
do within them, but if 99.99% of all work is non-blocking, then the really
odd rare blocking case is the one that needs to accomodate, not the rest.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-13 12:13:52 UTC

I really suspect we don't really have to, and that it would be better to
just fix the code that does that.

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-13 15:32:57 UTC

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.

We are trying to shoot down page tables of a different process running
on a different instance of Linux running on Numa-link connected portions
of the same machine.

The messaging is clearly going to require sleeping. Are you suggesting
we need to rework XPC communications to not require sleeping? I think
that is going to be impossible since the transfer engine requires a
sleeping context.

Additionally, the call to zap_page_range expects to have the mmap_sem
held. I suppose we could use something other than zap_page_range and
atomically clear the process page tables. Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-14 04:11:53 UTC

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.

We are trying to shoot down page tables of a different process running
on a different instance of Linux running on Numa-link connected portions
of the same machine.

Right. You can zap page tables without sleeping, if you're careful. I
don't know that we quite do that for anonymous pages at the moment, but it
should be possible with a bit of thought, I believe.

Post by Robin Holt
The messaging is clearly going to require sleeping. Are you suggesting
we need to rework XPC communications to not require sleeping? I think
that is going to be impossible since the transfer engine requires a
sleeping context.

I guess that you have found a way to perform TLB flushing within coherent
domains over the numalink interconnect without sleeping. I'm sure it would
be possible to send similar messages between non coherent domains.

So yes, I'd much rather rework such highly specialized system to fit in
closer with Linux than rework Linux to fit with these machines (and
apparently slow everyone else down).

Post by Robin Holt
Additionally, the call to zap_page_range expects to have the mmap_sem
held. I suppose we could use something other than zap_page_range and
atomically clear the process page tables.

zap_page_range does not expect to have mmap_sem held. I think for anon
pages it is always called with mmap_sem, however try_to_unmap_anon is
not (although it expects page lock to be held, I think we should be able
to avoid that).

Post by Robin Holt
Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

No, but I'd venture to guess that is not impossible to implement even
on your current hardware (maybe a firmware update is needed)?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-14 11:26:44 UTC

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.

We are trying to shoot down page tables of a different process running
on a different instance of Linux running on Numa-link connected portions
of the same machine.

Right. You can zap page tables without sleeping, if you're careful. I
don't know that we quite do that for anonymous pages at the moment, but it
should be possible with a bit of thought, I believe.

I assume by coherent domains, your are actually talking about system
images. Our memory coherence domain on the 3700 family is 512 processors
on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes.
We extend a "Read-Exclusive" mode beyond the coherence domain so any
processor is able to read any cacheline on the system. We also provide
uncached access for certain types of memory beyond the coherence domain.

For the other partitions, the exporting partition does not know what
virtual address the imported pages are mapped. The pages are frequently
mapped in a different order by the MPI library to help with MPI collective
operations.

For the exporting side to do those TLB flushes, we would need to replicate
all that importing information back to the exporting side.

Additionally, the hardware that does the TLB flushing is protected
by a spinlock on each system image. We would need to change that
simple spinlock into a type of hardware lock that would work (on 3700)
outside the processors coherence domain. The only way to do that is to
use uncached addresses with our Atomic Memory Operations which do the
cmpxchg at the memory controller. The uncached accesses are an order
of magnitude or more slower.

Post by Nick Piggin
So yes, I'd much rather rework such highly specialized system to fit in
closer with Linux than rework Linux to fit with these machines (and
apparently slow everyone else down).

But it isn't that we are having a problem adapting to just the hardware.
One of the limiting factors is Linux on the other partition.

zap_page_range calls unmap_vmas which walks to vma->next. Are you saying
that can be walked without grabbing the mmap_sem at least readably?
I feel my understanding of list management and locking completely
shifting.

Post by Robin Holt
Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

No, but I'd venture to guess that is not impossible to implement even
on your current hardware (maybe a firmware update is needed)?

Are you suggesting the sending side would not need to sleep or the
receiving side? Assuming you meant the sender, it spins waiting for the
remote side to acknowledge the invalidate request? We place the data
into a previously agreed upon buffer and send an interrupt. At this
point, we would need to start spinning and waiting for completion.
Let's assume we never run out of buffer space.

The receiving side receives an interrupt. The interrupt currently wakes
an XPC thread to do the work of transfering and delivering the message
to XPMEM. The transfer of the data which XPC does uses the BTE engine
which takes up to 28 seconds to timeout (hardware timeout before raising
and error) and the BTE code automatically does a retry for certain
types of failure. We currently need to grab semaphores which _MAY_
be able to be reworked into other types of locks.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-14 15:20:01 UTC

Post by Robin Holt
Are you suggesting the sending side would not need to sleep or the
receiving side?

One thing to realize is that most of the time (read: pretty much *always*)
when we have the problem of wanting to sleep inside a spinlock, the
solution is actually to just move the sleeping to outside the lock, and
then have something else that serializes things.

That way, the core code (protected by the spinlock, and in all the hot
paths) doesn't sleep, but the special case code (that wants to sleep) can
have some other model of serialization that allows sleeping, and that
includes as a small part the spinlocked region.

I do not know how XPMEM actually works, or how you use it, but it
seriously sounds like that is how things *should* work. And yes, that
probably means that the mmu-notifiers as they are now are simply not
workable: they'd need to be moved up so that they are inside the mmap
semaphore but not the spinlocks.

Can it be done? I don't know. But I do know that I'm unlikely to accept a
noticeable slowdown in some very core code for a case that affects about
0.00001% of the population. In other words, I think you *have* to do it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-14 16:22:45 UTC

Post by Robin Holt
Are you suggesting the sending side would not need to sleep or the
receiving side?

We are in the process of attempting this now. Unfortunately for SGI,
Christoph is on vacation right now so we have been trying to work it
internally.

We are looking through two possible methods, one we add a callout to the
tlb flush paths for both the mmu_gather and flush_tlb_page locations.
The other we place a specific callout seperate from the gather callouts
in the paths we are concerned with. We will look at both more carefully
before posting.

In either implementation, not all call paths would require the stall
to ensure data integrity. Would it be acceptable to always put a
sleepable stall in even if the code path did not require the pages be
unwritable prior to continuing? If we did that, I would be freed from
having a pool of invalidate threads ready for XPMEM to use for that work.
Maybe there is a better way, but the sleeping requirement we would have
on the threads make most options seem unworkable.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-14 16:58:33 UTC

Would it be acceptable to always put a sleepable stall in even if the
code path did not require the pages be unwritable prior to continuing?
If we did that, I would be freed from having a pool of invalidate
threads ready for XPMEM to use for that work. Maybe there is a better
way, but the sleeping requirement we would have on the threads make most
options seem unworkable.

I'm not understanding the question. If you can do you management outside
of the spinlocks, then you can obviously do whatever you want, including
sleping. It's changing the existing spinlocks to be sleepable that is not
acceptable, because it's such a performance problem.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-14 17:57:31 UTC

Post by Linus Torvalds
One thing to realize is that most of the time (read: pretty much *always*)
when we have the problem of wanting to sleep inside a spinlock, the
solution is actually to just move the sleeping to outside the lock, and
then have something else that serializes things.

The problem is that the code in rmap.c try_to_umap() and friends loops
over reverse maps after taking a spinlock. The mm_struct is only known
after the rmap has been acccessed. This means *inside* the spinlock.

That is why I tried to convert the locks to scan the revese maps to
semaphores. If that is done then one can indeed do the callouts outside of
atomic contexts.

Post by Linus Torvalds
Can it be done? I don't know. But I do know that I'm unlikely to accept a
noticeable slowdown in some very core code for a case that affects about
0.00001% of the population. In other words, I think you *have* to do it.

With larger number of processor semaphores make a lot of sense since the
holdoff times on spinlocks will increase. If we go to sleep then the
processor can do something useful instead of hogging a cacheline.

A rw lock there can also increase concurrency during reclaim espcially if
the anon_vma chains and the number of address spaces mapping a page is
high.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2008-05-14 18:34:35 UTC

Post by Christoph Lameter
The problem is that the code in rmap.c try_to_umap() and friends loops
over reverse maps after taking a spinlock. The mm_struct is only known
after the rmap has been acccessed. This means *inside* the spinlock.

So you queue them. That's what we do with things like the dirty bit. We
need to hold various spinlocks to look up pages, but then we can't
actually call the filesystem with the spinlock held.

Converting a spinlock to a waiting lock for things like that is simply not
acceptable. You have to work with the system.

Yeah, there's only a single bit worth of information on whether a page is
dirty or not, so "queueing" that information is trivial (it's just the
return value from "page_mkclean_file()". Some things are harder than
others, and I suspect you need some kind of "gather" structure to queue up
all the vma's that can be affected.

But it sounds like for the case of rmap, the approach of:

- the page lock is the higher-level "sleeping lock" (which makes sense,
since this is very close to an IO event, and that is what the page lock
is generally used for)

But hey, it could be anything else - maybe you have some other even
bigger lock to allow you to handle lots of pages in one go.

- with that lock held, you do the whole rmap dance (which requires
spinlocks) and gather up the vma's and the struct mm's involved.

- outside the spinlocks you then do whatever it is you need to do.

This doesn't sound all that different from TLB shoot-down in SMP, and the
"mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while
holding the spinlocks, but if we couldn't that's how we'd still do it:
it would get more involved (because we'd need to guarantee that the gather
can hold *all* the pages - right now we can just flush in the middle if we
need to), but it wouldn't be all that fundamentally different.

And no, I really haven't even wanted to look at what XPMEM really needs to
do, so maybe the above thing doesn't work for you, and you have other
issues. I'm just pointing you in a general direction, not trying to say
"this is exactly how to get there".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-17 01:38:54 UTC

Implementation of what Linus suggested: Defer the XPMEM processing until
after the locks are dropped. Allow immediate action by GRU/KVM.

This patch implements a callbacks for device drivers that establish external
references to pages aside from the Linux rmaps. Those either:

1. Do not take a refcount on pages that are mapped from devices. They
have a TLB cache like handling and must be able to flush external references
from atomic contexts. These devices do not need to provide the _sync methods.

2. Do take a refcount on pages mapped externally. These are handling by
marking pages as to be invalidated in atomic contexts. Invalidation
may be started by the driver. A _sync variant for the individual or
range unmap is called when we are back in a nonatomic context. At that
point the device must complete the removal of external references
and drop its refcount.

With the mm notifier it is possible for the device driver to release external
references after the page references are removed from a process that made
them available.

With the notifier it becomes possible to get pages unpinned on request and thus
avoid issues that come with having a large amount of pinned pages.

A device driver must subscribe to a process using

mm_register_notifier(struct mm_struct *, struct mm_notifier *)

The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space.

When the process terminates then first the ->release method is called to
remove all pages still mapped to the proces.

Before the mm_struct is freed the ->destroy() method is called which
should dispose of the mm_notifier structure.

The following callbacks exist:

invalidate_range(notifier, mm_struct *, from , to)

Invalidate a range of addresses. The invalidation is
not required to complete immediately.

invalidate_range_sync(notifier, mm_struct *, from, to)

This is called after some invalidate_range callouts.
The driver may only return when the invalidation of the references
is completed. Callback is only called from non atomic contexts.
There is no need to provide this callback if the driver can remove
references in an atomic context.

invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address)

Invalidate references to a particular page. The driver may
defer the invalidation.

invalidate_page_sync(notifier, mm_struct *,struct *)

Called after one or more invalidate_page() callbacks. The callback
must only return when the external references have been removed.
The callback does not need to be provided if the driver can remove
references in atomic contexts.

[NOTE] The invalidate_page_sync() callback is weird because it is called for
every notifier that supports the invalidate_page_sync() callback
if a page has PageNotifier() set. The driver must determine in an efficient
way that the page is not of interest. This is because we do not have the
mm context after we have dropped the rmap list lock.
Drivers incrementing the refcount must set and clear PageNotifier
appropriately when establishing and/or dropping a refcount!
[These conditions are similar to the rmap notifier that was introduced
in my V7 of the mmu_notifier].

There is no support for an aging callback. A device driver may simply set the
reference bit on the linux pte when the external mapping is referenced if such
support is desired.

The patch is provisional. All functions are inlined for now. They should be wrapped like
in Andrea's series. Its probably good to have Andrea review this if we actually
decide to go this route since he is pretty good as detecting issues with complex
lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the
strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and
we are reintroducing that now in a light weight order to be able to defer freeing
until after the rmap spinlocks have been dropped.

Jack tested this with the GRU.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
fs/hugetlbfs/inode.c | 2
include/linux/mm_types.h | 3
include/linux/page-flags.h | 3
include/linux/rmap.h | 161 +++++++++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 4 +
mm/Kconfig | 4 +
mm/filemap_xip.c | 2
mm/fremap.c | 2
mm/hugetlb.c | 3
mm/memory.c | 38 ++++++++--
mm/mmap.c | 3
mm/mprotect.c | 3
mm/mremap.c | 5 +
mm/rmap.c | 11 ++-
14 files changed, 234 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/kernel/fork.c 2008-05-16 16:06:26.000000000 -0700
@@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+#ifdef CONFIG_MM_NOTIFIER
+ mm->mm_notifier = NULL;
+#endif
return mm;
}

@@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm)
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
destroy_context(mm);
+ mm_notifier_destroy(mm);
free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/filemap_xip.c 2008-05-16 16:06:26.000000000 -0700
@@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
@@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp
}
}
spin_unlock(&mapping->i_mmap_lock);
+ mm_notifier_invalidate_page_sync(page);
}

/*
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/fremap.c 2008-05-16 16:06:26.000000000 -0700
@@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ mm_notifier_invalidate_range(mm, start, start + size);
err = populate_range(mm, vma, start, size, pgoff);
+ mm_notifier_invalidate_range_sync(mm, start, start + size);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/hugetlb.c 2008-05-16 17:50:31.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/rmap.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mm_notifier_invalidate_range(mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
@@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area
spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
__unmap_hugepage_range(vma, start, end);
spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+ mm_notifier_invalidate_range_sync(vma->vm_mm, start, end);
}
}

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/memory.c 2008-05-16 16:06:26.000000000 -0700
@@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
+ mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE);
pte = pte_wrprotect(pte);
}

@@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
+ int ret;

/*
* Don't copy ptes where a page fault will fill them correctly.
@@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);

+ ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
- if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
- vma, addr, next))
- return -ENOMEM;
+ if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ vma, addr, next))) {
+ ret = -ENOMEM;
+ break;
+ }
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
- return 0;
+
+ /*
+ * We need to invalidate the secondary MMU mappings only when
+ * there could be a permission downgrade on the ptes of the
+ * parent mm. And a permission downgrade will only happen if
+ * is_cow_mapping() returns true.
+ */
+ if (is_cow_mapping(vma->vm_flags))
+ mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end);
+
+ return ret;
}

static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath
}

tlb_finish_mmu(*tlbp, tlb_start, start);
+ mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start);

if (need_resched() ||
(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
@@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
- if (tlb)
+ if (tlb) {
tlb_finish_mmu(tlb, address, end);
+ mm_notifier_invalidate_range(mm, address, end);
+ }
return end;
}

@@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct *
*/
page_table = pte_offset_map_lock(mm, pmd, address,
&ptl);
- page_cache_release(old_page);
if (!pte_same(*page_table, orig_pte))
goto unlock;

@@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct *
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+ old_page = NULL;
goto unlock;
}

@@ -1774,6 +1792,7 @@ gotten:
* thread doing COW.
*/
ptep_clear_flush(vma, address, page_table);
+ mm_notifier_invalidate_page(mm, old_page, address);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1787,10 +1806,13 @@ gotten:

if (new_page)
page_cache_release(new_page);
- if (old_page)
- page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
+ if (old_page) {
+ mm_notifier_invalidate_page_sync(old_page);
+ page_cache_release(old_page);
+ }
+
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mmap.c 2008-05-16 16:06:26.000000000 -0700
@@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mm_notifier_invalidate_range(mm, start, end);
+ mm_notifier_invalidate_range_sync(mm, start, end);
}

/*
@@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm)

/* mm's last user has gone, and its about to be pulled down */
arch_exit_mmap(mm);
+ mm_notifier_release(mm);

lru_add_drain();
flush_cache_mm(mm);
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mprotect.c 2008-05-16 16:06:26.000000000 -0700
@@ -21,6 +21,7 @@
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/rmap.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -132,6 +133,7 @@ static void change_protection(struct vm_
change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
} while (pgd++, addr = next, addr != end);
flush_tlb_range(vma, start, end);
+ mm_notifier_invalidate_range(vma->vm_mm, start, end);
}

int
@@ -211,6 +213,7 @@ success:
hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+ mm_notifier_invalidate_range_sync(mm, start, end);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/mremap.c 2008-05-16 16:06:26.000000000 -0700
@@ -18,6 +18,7 @@
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/rmap.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
+ unsigned long old_start = old_addr;

if (vma->vm_file) {
/*
@@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
arch_enter_lazy_mmu_mode();

+ mm_notifier_invalidate_range(mm, old_addr, old_end);
for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {
if (pte_none(*old_pte))
@@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str
pte_unmap_unlock(old_pte - 1, old_ptl);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+
+ mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end);
}

#define LATENCY_LIMIT (64 * PAGE_SIZE)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/rmap.c 2008-05-16 16:06:26.000000000 -0700
@@ -52,6 +52,9 @@

#include <asm/tlbflush.h>

+struct mm_notifier *mm_notifier_page_sync;
+DECLARE_RWSEM(mm_notifier_page_sync_sem);
+
struct kmem_cache *anon_vma_cachep;

/* This must be called under the mmap_sem. */
@@ -458,6 +461,7 @@ static int page_mkclean_one(struct page

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -502,8 +506,8 @@ int page_mkclean(struct page *page)
ret = 1;
}
}
+ mm_notifier_invalidate_page_sync(page);
}
-
return ret;
}
EXPORT_SYMBOL_GPL(page_mkclean);
@@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
@@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);

+ mm_notifier_invalidate_page_sync(page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
}
-
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/rmap.h 2008-05-16 18:32:52.000000000 -0700
@@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa
#define SWAP_AGAIN 1
#define SWAP_FAIL 2

+#ifdef CONFIG_MM_NOTIFIER
+
+struct mm_notifier_ops {
+ void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+ void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+ void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm,
+ struct page *page, unsigned long addr);
+ void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+ struct page *page);
+ void (*release)(struct mm_notifier *mn, struct mm_struct *mm);
+ void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm);
+};
+
+struct mm_notifier {
+ struct mm_notifier_ops *ops;
+ struct mm_struct *mm;
+ struct mm_notifier *next;
+ struct mm_notifier *next_page_sync;
+};
+
+extern struct mm_notifier *mm_notifier_page_sync;
+extern struct rw_semaphore mm_notifier_page_sync_sem;
+
+/*
+ * Must hold mmap_sem when calling mm_notifier_register.
+ */
+static inline void mm_notifier_register(struct mm_notifier *mn,
+ struct mm_struct *mm)
+{
+ mn->mm = mm;
+ mn->next = mm->mm_notifier;
+ rcu_assign_pointer(mm->mm_notifier, mn);
+ if (mn->ops->invalidate_page_sync) {
+ down_write(&mm_notifier_page_sync_sem);
+ mn->next_page_sync = mm_notifier_page_sync;
+ mm_notifier_page_sync = mn;
+ up_write(&mm_notifier_page_sync_sem);
+ }
+}
+
+/*
+ * Invalidate remote references in a particular address range
+ */
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ mn->ops->invalidate_range(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references in a particular address range.
+ * Can sleep. Only return if all remote references have been removed.
+ */
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ if (mn->ops->invalidate_range_sync)
+ mn->ops->invalidate_range_sync(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references to a page
+ */
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+ struct page *page, unsigned long addr)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ mn->ops->invalidate_page(mn, mm, page, addr);
+}
+
+/*
+ * Invalidate remote references to a partioular page. Only return
+ * if all references have been removed.
+ *
+ * Note: This is an expensive function since it is not clear at the time
+ * of call to which mm_struct() the page belongs.. It walks through the
+ * mmlist and calls the mmu notifier ops for each address space in the
+ * system. At some point this needs to be optimized.
+ */
+static inline void mm_notifier_invalidate_page_sync(struct page *page)
+{
+ struct mm_notifier *mn;
+
+ if (!PageNotifier(page))
+ return;
+
+ down_read(&mm_notifier_page_sync_sem);
+
+ for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync)
+ if (mn->ops->invalidate_page_sync)
+ mn->ops->invalidate_page_sync(mn, mn->mm, page);
+
+ up_read(&mm_notifier_page_sync_sem);
+}
+
+/*
+ * Invalidate all remote references before shutdown
+ */
+static inline void mm_notifier_release(struct mm_struct *mm)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ mn->ops->release(mn, mm);
+}
+
+/*
+ * Release resources before freeing mm_struct.
+ */
+static inline void mm_notifier_destroy(struct mm_struct *mm)
+{
+ struct mm_notifier *mn;
+
+ while (mm->mm_notifier) {
+ mn = mm->mm_notifier;
+ mm->mm_notifier = mn->next;
+ if (mn->ops->invalidate_page_sync) {
+ struct mm_notifier *m;
+
+ down_write(&mm_notifier_page_sync_sem);
+
+ if (mm_notifier_page_sync != mn) {
+ for (m = mm_notifier_page_sync; m; m = m->next_page_sync)
+ if (m->next_page_sync == mn)
+ break;
+
+ m->next_page_sync = mn->next_page_sync;
+ } else
+ mm_notifier_page_sync = mn->next_page_sync;
+
+ up_write(&mm_notifier_page_sync_sem);
+ }
+ mn->ops->destroy(mn, mm);
+ }
+}
+#else
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+ unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+ struct page *page, unsigned long address) {}
+static inline void mm_notifier_invalidate_page_sync(struct page *page) {}
+static inline void mm_notifier_release(struct mm_struct *mm) {}
+static inline void mm_notifier_destroy(struct mm_struct *mm) {}
+#endif
+
#endif /* _LINUX_RMAP_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/Kconfig 2008-05-16 16:06:26.000000000 -0700
@@ -205,3 +205,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MM_NOTIFIER
+ def_bool y
+
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/mm_types.h 2008-05-16 16:06:26.000000000 -0700
@@ -244,6 +244,9 @@ struct mm_struct {
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
+#ifdef CONFIG_MM_NOTIFIER
+ struct mm_notifier *mm_notifier;
+#endif
};

#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/page-flags.h 2008-05-16 16:06:26.000000000 -0700
@@ -93,6 +93,7 @@ enum pageflags {
PG_mappedtodisk, /* Has blocks allocated on-disk */
PG_reclaim, /* To be reclaimed asap */
PG_buddy, /* Page is free, on buddy lists */
+ PG_notifier, /* Call notifier when page is changed/unmapped */
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
@@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk)
PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */

+PAGEFLAG(Notifier, notifier);
+
#ifdef CONFIG_HIGHMEM
/*
* Must use a macro here due to header dependency issues. page_zone() is not
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c 2008-05-16 16:06:55.000000000 -0700
@@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree

__unmap_hugepage_range(vma,
vma->vm_start + v_offset, vma->vm_end);
+ mm_notifier_invalidate_range_sync(vma->vm_mm,
+ vma->vm_start + v_offset, vma->vm_end);
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-15 07:58:31 UTC

Post by Nick Piggin
I guess that you have found a way to perform TLB flushing within coherent
domains over the numalink interconnect without sleeping. I'm sure it would
be possible to send similar messages between non coherent domains.

I assume by coherent domains, your are actually talking about system
images.

Yes

Post by Robin Holt
Our memory coherence domain on the 3700 family is 512 processors
on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes.
We extend a "Read-Exclusive" mode beyond the coherence domain so any
processor is able to read any cacheline on the system. We also provide
uncached access for certain types of memory beyond the coherence domain.

Yes, I understand the basics.

Post by Robin Holt
For the other partitions, the exporting partition does not know what
virtual address the imported pages are mapped. The pages are frequently
mapped in a different order by the MPI library to help with MPI collective
operations.
For the exporting side to do those TLB flushes, we would need to replicate
all that importing information back to the exporting side.

Right. Or the exporting side could be passed tokens that it tracks itself,
rather than virtual addresses.

Post by Robin Holt
Additionally, the hardware that does the TLB flushing is protected
by a spinlock on each system image. We would need to change that
simple spinlock into a type of hardware lock that would work (on 3700)
outside the processors coherence domain. The only way to do that is to
use uncached addresses with our Atomic Memory Operations which do the
cmpxchg at the memory controller. The uncached accesses are an order
of magnitude or more slower.

I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,
I don't see why this can't be done.

Post by Nick Piggin
So yes, I'd much rather rework such highly specialized system to fit in
closer with Linux than rework Linux to fit with these machines (and
apparently slow everyone else down).

But it isn't that we are having a problem adapting to just the hardware.
One of the limiting factors is Linux on the other partition.

In what way is the Linux limiting?

zap_page_range calls unmap_vmas which walks to vma->next. Are you saying
that can be walked without grabbing the mmap_sem at least readably?

Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But
at any rate, yes we can easily zap pagetables without holding mmap_sem.

Post by Robin Holt
I feel my understanding of list management and locking completely
shifting.

FWIW, mmap_sem isn't held to protect vma->next there anyway, because at
that point the vmas are detached from the mm's rbtree and linked list.
But sure, in that particular path it is held for other reasons.

Post by Robin Holt
Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

No, but I'd venture to guess that is not impossible to implement even
on your current hardware (maybe a firmware update is needed)?

How would you run out of buffer space if it is synchronous?

Post by Robin Holt
The receiving side receives an interrupt. The interrupt currently wakes
an XPC thread to do the work of transfering and delivering the message
to XPMEM. The transfer of the data which XPC does uses the BTE engine
which takes up to 28 seconds to timeout (hardware timeout before raising
and error) and the BTE code automatically does a retry for certain
types of failure. We currently need to grab semaphores which _MAY_
be able to be reworked into other types of locks.

Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.

What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-15 11:02:12 UTC

We are pursuing Linus' suggestion currently. This discussion is
completely unrelated to that work.

Post by Nick Piggin
I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,
I don't see why this can't be done.

We would need to deposit the payload into a central location to do the
invalidate, correct? That central location would either need to be
indexed by physical cpuid (65536 possible currently, UV will push that
up much higher) or some sort of global id which is difficult because
remote partitions can reboot giving you a different view of the machine
and running partitions would need to be updated. Alternatively, that
central location would need to be protected by a global lock or atomic
type operation, but a majority of the machine does not have coherent
access to other partitions so they would need to use uncached operations.
Essentially, take away from this paragraph that it is going to be really
slow or really large.

Then we need to deposit the information needed to do the invalidate.

Lastly, we would need to interrupt. Unfortunately, here we have a
thundering herd. There could be up to 16256 processors interrupting the
same processor. That will be a lot of work. It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

Unfortunately, the sending side is not free to continue (in most cases)
until it knows that the invalidate is completed. So it will need to spin
waiting for a completion signal will could be as simple as an uncached
word. But how will it handle the possible failure of the other partition?
How will it detect that failure and recover? A timeout value could be
difficult to gauge because the other side may be off doing a considerable
amount of work and may just be backed up.

Post by Nick Piggin
Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.

It is an assumption based upon some of the kernel functions we call
doing things like grabbing mutexes or rw_sems. That pushes back to us.
I think the kernel's locking is perfectly reasonable. The problem we run
into is we are trying to get from one context in one kernel to a different
context in another and the in-between piece needs to be sleepable.

Post by Nick Piggin
What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

XPMEM allows one process to make a portion of its virtual address range
directly addressable by another process with the appropriate access.
The other process can be on other partitions. As long as Numa-link
allows access to the memory, we can make it available. Userland has an
advantage in that the kernel entrance/exit code contains memory errors
so we can contain hardware failures (in most cases) to only needing to
terminate a user program and not lose the partition. The kernel enjoys
no such fault containment so it can not safely directly reference memory.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2008-05-15 11:12:54 UTC

Post by Robin Holt
Then we need to deposit the information needed to do the invalidate.
Lastly, we would need to interrupt. Unfortunately, here we have a
thundering herd. There could be up to 16256 processors interrupting the
same processor. That will be a lot of work. It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

You don't need to interrupt every time. Place your data in a queue (you
do support rmw operations, right?) and interrupt. Invalidates from
other processors will see that the queue hasn't been processed yet and
skip the interrupt.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

2008-05-15 17:34:20 UTC

Post by Nick Piggin
Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But
at any rate, yes we can easily zap pagetables without holding mmap_sem.

How is that synchronized with code that walks the same pagetable. These
walks may not hold mmap_sem either. I would expect that one could only
remove a portion of the pagetable where we have some sort of guarantee
that no accesses occur. So the removal of the vma prior ensures that?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-15 23:52:53 UTC

I don't really understand the question. If you remove the pte and invalidate
the TLBS on the remote image's process (importing the page), then it can
of course try to refault the page in because it's vma is still there. But
you catch that refault in your driver , which can prevent the page from
being faulted back in.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-16 11:23:27 UTC

I think Christoph's question has more to do with faults that are
in flight. A recently requested fault could have just released the
last lock that was holding up the invalidate callout. It would then
begin messaging back the response PFN which could still be in flight.
The invalidate callout would then fire and do the interrupt shoot-down
while that response was still active (essentially beating the inflight
response). The invalidate would clear up nothing and then the response
would insert the PFN after it is no longer the correct PFN.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robin Holt

2008-05-16 11:50:30 UTC

I just looked over XPMEM. I think we could make this work. We already
have a list of active faults which is protected by a simple spinlock.
I would need to nest this lock within another lock protected our PFN
table (currently it is a mutex) and then the invalidate interrupt handler
would need to mark the fault as invalid (which is also currently there).

I think my sticking points with the interrupt method remain at fault
containment and timeout. The inability of the ia64 processor to handle
provide predictive failures for the read/write of memory on other
partitions prevents us from being able to contain the failure. I don't
think we can get the information we would need to do the invalidate
without introducing fault containment issues which has been a continous
area of concern for our customers.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nick Piggin

2008-05-20 05:32:04 UTC

I just looked over XPMEM. I think we could make this work. We already
have a list of active faults which is protected by a simple spinlock.
I would need to nest this lock within another lock protected our PFN
table (currently it is a mutex) and then the invalidate interrupt handler
would need to mark the fault as invalid (which is also currently there).
I think my sticking points with the interrupt method remain at fault
containment and timeout. The inability of the ia64 processor to handle
provide predictive failures for the read/write of memory on other
partitions prevents us from being able to contain the failure. I don't
think we can get the information we would need to do the invalidate
without introducing fault containment issues which has been a continous
area of concern for our customers.

Really? You can get the information through via a sleeping messaging API,
but not a non-sleeping one? What is the difference from the hardware POV?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jack Steiner

2008-05-07 22:43:19 UTC

Post by Andrea Arcangeli
And I don't see a problem in making the conversion from
spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
anything but ia64.

That is currently true but we are also working on XPMEM for x86_64.
The new XPMEM code should be posted within a few weeks.

--- jack
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:42:26 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115508 -7200
# Node ID 94eaa1515369e8ef183e2457f6f25a7f36473d70
# Parent 6b384bb988786aa78ef07440180e4b2948c4c6a2
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
unsigned long flags, struct page **pages);

struct mm_lock_data {
- spinlock_t **i_mmap_locks;
- spinlock_t **anon_vma_locks;
- size_t nr_i_mmap_locks;
- size_t nr_anon_vma_locks;
+ struct rw_semaphore **i_mmap_sems;
+ struct rw_semaphore **anon_vma_sems;
+ size_t nr_i_mmap_sems;
+ size_t nr_anon_vma_sems;
};
extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st

static int mm_lock_cmp(const void *a, const void *b)
{
- unsigned long _a = (unsigned long)*(spinlock_t **)a;
- unsigned long _b = (unsigned long)*(spinlock_t **)b;
+ unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+ unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;

cond_resched();
if (_a < _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
return 0;
}

-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
int anon)
{
struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (anon) {
if (vma->anon_vma)
- locks[i++] = &vma->anon_vma->lock;
+ sems[i++] = &vma->anon_vma->sem;
} else {
if (vma->vm_file && vma->vm_file->f_mapping)
- locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+ sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem;
}
}

if (!i)
goto out;

- sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+ sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);

out:
return i;
}

static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
{
- return mm_lock_sort(mm, locks, 1);
+ return mm_lock_sort(mm, sems, 1);
}

static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
{
- return mm_lock_sort(mm, locks, 0);
+ return mm_lock_sort(mm, sems, 0);
}

-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
{
- spinlock_t *last = NULL;
+ struct rw_semaphore *last = NULL;
size_t i;

for (i = 0; i < nr; i++)
/* Multiple vmas may use the same lock. */
- if (locks[i] != last) {
- BUG_ON((unsigned long) last > (unsigned long) locks[i]);
- last = locks[i];
+ if (sems[i] != last) {
+ BUG_ON((unsigned long) last > (unsigned long) sems[i]);
+ last = sems[i];
if (lock)
- spin_lock(last);
+ down_write(last);
else
- spin_unlock(last);
+ up_write(last);
}
}

-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
{
- mm_lock_unlock(locks, nr, 1);
+ mm_lock_unlock(sems, nr, 1);
}

-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
{
- mm_lock_unlock(locks, nr, 0);
+ mm_lock_unlock(sems, nr, 0);
}

/*
@@ -2358,10 +2358,10 @@ static inline void __mm_unlock(spinlock_
* of vmas is defined in /proc/sys/vm/max_map_count.
*
* mm_lock() can fail if memory allocation fails. The worst case
- * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
- * so around 1Mbyte, but in practice it'll be much less because
- * normally there won't be max_map_count vmas allocated in the task
- * that runs mm_lock().
+ * vmalloc allocation required is 2*max_map_count*sizeof(struct
+ * rw_semaphore *), so around 1Mbyte, but in practice it'll be much
+ * less because normally there won't be max_map_count vmas allocated
+ * in the task that runs mm_lock().
*
* The vmalloc memory allocated by mm_lock is stored in the
* mm_lock_data structure that must be allocated by the caller and it
@@ -2375,16 +2375,16 @@ static inline void __mm_unlock(spinlock_
*/
int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
{
- spinlock_t **anon_vma_locks, **i_mmap_locks;
+ struct rw_semaphore **anon_vma_sems, **i_mmap_sems;

if (mm->map_count) {
- anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
- if (unlikely(!anon_vma_locks))
+ anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+ if (unlikely(!anon_vma_sems))
return -ENOMEM;

- i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
- if (unlikely(!i_mmap_locks)) {
- vfree(anon_vma_locks);
+ i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+ if (unlikely(!i_mmap_sems)) {
+ vfree(anon_vma_sems);
return -ENOMEM;
}

@@ -2392,31 +2392,31 @@ int mm_lock(struct mm_struct *mm, struct
* When mm_lock_sort_anon_vma/i_mmap returns zero it
* means there's no lock to take and so we can free
* the array here without waiting mm_unlock. mm_unlock
- * will do nothing if nr_i_mmap/anon_vma_locks is
+ * will do nothing if nr_i_mmap/anon_vma_sems is
* zero.
*/
- data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
- data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+ data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems);
+ data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems);

- if (data->nr_anon_vma_locks) {
- __mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
- data->anon_vma_locks = anon_vma_locks;
+ if (data->nr_anon_vma_sems) {
+ __mm_lock(anon_vma_sems, data->nr_anon_vma_sems);
+ data->anon_vma_sems = anon_vma_sems;
} else
- vfree(anon_vma_locks);
+ vfree(anon_vma_sems);

- if (data->nr_i_mmap_locks) {
- __mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
- data->i_mmap_locks = i_mmap_locks;
+ if (data->nr_i_mmap_sems) {
+ __mm_lock(i_mmap_sems, data->nr_i_mmap_sems);
+ data->i_mmap_sems = i_mmap_sems;
} else
- vfree(i_mmap_locks);
+ vfree(i_mmap_sems);
}
return 0;
}

-static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr)
{
- __mm_unlock(locks, nr);
- vfree(locks);
+ __mm_unlock(sems, nr);
+ vfree(sems);
}

/*
@@ -2435,11 +2435,11 @@ void mm_unlock(struct mm_struct *mm, str
void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
{
if (mm->map_count) {
- if (data->nr_anon_vma_locks)
- mm_unlock_vfree(data->anon_vma_locks,
- data->nr_anon_vma_locks);
- if (data->nr_i_mmap_locks)
- mm_unlock_vfree(data->i_mmap_locks,
- data->nr_i_mmap_locks);
+ if (data->nr_anon_vma_sems)
+ mm_unlock_vfree(data->anon_vma_sems,
+ data->nr_anon_vma_sems);
+ if (data->nr_i_mmap_sems)
+ mm_unlock_vfree(data->i_mmap_sems,
+ data->nr_i_mmap_sems);
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:42:59 UTC

# HG changeset patch
# User Andrea Arcangeli <***@qumranet.com>
# Date 1210115798 -7200
# Node ID eb924315351f6b056428e35c983ad28040420fea
# Parent 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <***@sgi.com>
Signed-off-by: Andrea Arcangeli <***@qumranet.com>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
*
* ->i_mutex (generic_file_buffered_write)
* ->mmap_sem (fault_in_pages_readable->do_page_fault)
+ *
+ * When taking multiple mmap_sems, one should lock the lowest-addressed
+ * one first proceeding on up to the highest-addressed one.
*
* ->i_mutex
* ->i_alloc_sem (various)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrea Arcangeli

2008-05-07 14:43:34 UTC