Discussion:
[patch 00/52] vfs scalability patches updated
n***@suse.de
2010-06-24 03:02:12 UTC
Permalink
http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

Update to vfs scalability patches:

- Lots of fixes, particularly RCU inode stuff
- Lots of cleanups and aesthetic to the code, ifdef reduction etc
- Use bit locks for inode and dentry hashes
- Small improvements to single-threaded performance
- Split inode LRU and writeback list locking
- Per-bdi inode writeback list locking
- Per-zone mm shrinker
- Per-zone dentry and inode LRU lists
- Several fixes brought in from -rt tree testing
- No global locks remain in any fastpaths (arguably, rename)

I have not included the store-free path walk patches in this posting. They
require a bit more work and they will need to be reworked after
->d_revalidate/->follow_mount changes that Al wants to do. I prefer to
concentrate on these locking patches first.

Autofs4 is sadly missing. It's a bit tricky, patches have to be reworked.

Performance:
Last time I was testing on a 32-node Altix which could be considered as not a
sweet-spot for Linux performance target (ie. improvements there may not justify
complexity). So recently I've been testing with a tightly interconnected
4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
system.

*** Single-thread microbenchmark (simple syscall loops, lower is better):
Test Difference at 95.0% confidence (50 runs)
open/close -6.07% +/- 1.075%
creat/unlink 27.83% +/- 0.522%
Open/close is a little faster, which should be due to one less atomic in the
dput common case. Creat/unlink is significantly slower, which is due to RCU
freeing inodes. We have have made the same magnitude of performance regression
tradeoff when going to RCU freed dentries and files as well. Inode RCU is
required for reducing inode hash lookup locking and improve lock ordering,
also for store-free path-walk.

*** Let's take a look at this creat/unlink regression more closely. If we call
rdtsc around the creat/unlink loop, and just run it once (so as to avoid
much of the RCU induced problems):
vanilla: 5328 cycles
vfs: 5960 cycles (+11.8%)
Not so bad when RCU is not being stressed.

*** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
vanilla vfs
real 0m4.911s 0m0.183s
user 0m1.920s 0m1.610s
sys 4m58.670s 0m5.770s
After vfs patches, 26x increase in throughput, however parallelism is limited
by test spawning and exit phases. sys time improvement shows closer to 50x
improvement. vanilla is bottlenecked on dcache_lock.

*** Google sockets (http://marc.info/?l=linux-kernel&m=123215942507568&w=2):
vanilla vfs
real 1m 7.774s 0m 3.245s
user 0m19.230s 0m36.750s
sys 71m41.310s 2m47.320s
do_exit path for
the run took 24.755s 1.219s
After vfs patches, 20x increase in throughput for both the total duration and
the do_exit (teardown) time.

*** file-ops test (people.redhat.com/mingo/file-ops-test/file-ops-test.c)
Parallel open/close or creat/unlink in same or different cwds within the same
ramfs mount. Relative throughput percentages are given at each parallelism
point (higher is better):

open/close vanilla vfs
same cwd
1 100.0 119.1
2 74.2 187.4
4 38.4 40.9
8 18.7 27.0
16 9.0 24.9
32 5.9 24.2
64 6.0 27.7
different cwd
1 100.0 119.1
2 133.0 238.6
4 21.2 488.6
8 19.7 932.6
16 18.8 1784.1
32 18.7 3469.5
64 19.0 2858.0

creat/unlink vanilla vfs
same cwd
1 100.0 75.0
2 44.1 41.8
4 28.7 24.6
8 16.5 14.2
16 8.7 8.9
32 5.5 7.8
64 5.9 7.4
different cwd
1 100.0 75.0
2 89.8 137.2
4 20.1 267.4
8 17.2 513.0
16 16.2 901.9
32 15.8 1724.0
64 17.3 1161.8

Note that at 64, we start using sibling threads on the CPU, making results jump
around a bit. The drop at 64 in different-cwd cases seems to be hitting an RCU
or slab allocator issue (or maybe it's just the SMT).

The scalability regression I was seeing in same-cwd tests is no longer there
(is even improved now). It may still be present in some workloads doing
common-element path lookups. This can be solved by making d_count atomic again,
at the cost of more atomic ops in some cases, but scalability is still limited.
So I prefer to do store-free path walking which is much more scalable.

In the different cwd open/close case, cost to bounce cachelines over the
interconnect is putting absolute upper limit of 162K open/closes per second
over the entire machine in vanilla kernel. After vfs patches, it is around 30M.
On larger and less well connected machines, the lower limit will only get lower
while the vfs case should continue to keep going up (assuming mm subsystem
can keep up).

*** Reclaim
I have not done much reclaim testing yet. It should be more scalable and lower
latency due to significant reduction in lru locks interfering with other
critical sections in inode/dentry code, and because we have per-zone locks.
Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
kswapd will operate on lists of node-local memory objects.
n***@suse.de
2010-06-24 03:02:22 UTC
Permalink
Add a new lock, dcache_lru_lock, to protect the dcache hash table from
concurrent modification. d_lru is also protected by d_lock.

Move lru scanning out from underneath dcache_lock.
XXX Maybe do that in another patch

Signed-off-by: Nick Piggin <***@suse.de>

---
fs/dcache.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 85 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -37,11 +37,19 @@

/*
* Usage:
- * dcache_hash_lock protects dcache hash table
+ * dcache_hash_lock protects:
+ * - the dcache hash table
+ * dcache_lru_lock protects:
+ * - the dcache lru lists and counters
+ * d_lock protects:
+ * - d_flags
+ * - d_name
+ * - d_lru
*
* Ordering:
* dcache_lock
* dentry->d_lock
+ * dcache_lru_lock
* dcache_hash_lock
*
* if (dentry1 < dentry2)
@@ -52,6 +60,7 @@ int sysctl_vfs_cache_pressure __read_mos
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

@@ -138,37 +147,56 @@ static void dentry_iput(struct dentry *
}

/*
- * dentry_lru_(add|add_tail|del|del_init) must be called with dcache_lock held.
+ * dentry_lru_(add|add_tail|del|del_init) must be called with d_lock held
+ * to protect list_empty(d_lru) condition.
*/
static void dentry_lru_add(struct dentry *dentry)
{
+ spin_lock(&dcache_lru_lock);
list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
dentry->d_sb->s_nr_dentry_unused++;
dentry_stat.nr_unused++;
+ spin_unlock(&dcache_lru_lock);
}

static void dentry_lru_add_tail(struct dentry *dentry)
{
+ spin_lock(&dcache_lru_lock);
list_add_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
dentry->d_sb->s_nr_dentry_unused++;
dentry_stat.nr_unused++;
+ spin_unlock(&dcache_lru_lock);
+}
+
+static void __dentry_lru_del(struct dentry *dentry)
+{
+ list_del(&dentry->d_lru);
+ dentry->d_sb->s_nr_dentry_unused--;
+ dentry_stat.nr_unused--;
+}
+
+static void __dentry_lru_del_init(struct dentry *dentry)
+{
+ list_del_init(&dentry->d_lru);
+ dentry->d_sb->s_nr_dentry_unused--;
+ dentry_stat.nr_unused--;
}

static void dentry_lru_del(struct dentry *dentry)
{
if (!list_empty(&dentry->d_lru)) {
- list_del(&dentry->d_lru);
- dentry->d_sb->s_nr_dentry_unused--;
- dentry_stat.nr_unused--;
+ spin_lock(&dcache_lru_lock);
+ __dentry_lru_del(dentry);
+ spin_unlock(&dcache_lru_lock);
}
}

static void dentry_lru_del_init(struct dentry *dentry)
{
if (likely(!list_empty(&dentry->d_lru))) {
- list_del_init(&dentry->d_lru);
- dentry->d_sb->s_nr_dentry_unused--;
- dentry_stat.nr_unused--;
+ spin_lock(&dcache_lru_lock);
+ __dentry_lru_del_init(dentry);
+ spin_unlock(&dcache_lru_lock);
}
}

@@ -179,6 +207,8 @@ static void dentry_lru_del_init(struct d
* The dentry must already be unhashed and removed from the LRU.
*
* If this is the root of the dentry tree, return NULL.
+ *
+ * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
*/
static struct dentry *d_kill(struct dentry *dentry)
__releases(dentry->d_lock)
@@ -333,11 +363,19 @@ int d_invalidate(struct dentry * dentry)
EXPORT_SYMBOL(d_invalidate);

/* This should be called _only_ with dcache_lock held */
+static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+{
+ atomic_inc(&dentry->d_count);
+ dentry_lru_del_init(dentry);
+ return dentry;
+}

static inline struct dentry * __dget_locked(struct dentry *dentry)
{
atomic_inc(&dentry->d_count);
+ spin_lock(&dentry->d_lock);
dentry_lru_del_init(dentry);
+ spin_unlock(&dentry->d_lock);
return dentry;
}

@@ -416,7 +454,7 @@ restart:
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
if (!atomic_read(&dentry->d_count)) {
- __dget_locked(dentry);
+ __dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
@@ -449,17 +487,18 @@ static void prune_one_dentry(struct dent
* Prune ancestors. Locking is simpler than in dput(),
* because dcache_lock needs to be taken anyway.
*/
- spin_lock(&dcache_lock);
while (dentry) {
- if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock))
+ spin_lock(&dcache_lock);
+ if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock)) {
+ spin_unlock(&dcache_lock);
return;
+ }

if (dentry->d_op && dentry->d_op->d_delete)
dentry->d_op->d_delete(dentry);
dentry_lru_del_init(dentry);
__d_drop(dentry);
dentry = d_kill(dentry);
- spin_lock(&dcache_lock);
}
}

@@ -480,10 +519,11 @@ static void __shrink_dcache_sb(struct su

BUG_ON(!sb);
BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
- spin_lock(&dcache_lock);
if (count != NULL)
/* called from prune_dcache() and shrink_dcache_parent() */
cnt = *count;
+relock:
+ spin_lock(&dcache_lru_lock);
restart:
if (count == NULL)
list_splice_init(&sb->s_dentry_lru, &tmp);
@@ -493,7 +533,10 @@ restart:
struct dentry, d_lru);
BUG_ON(dentry->d_sb != sb);

- spin_lock(&dentry->d_lock);
+ if (!spin_trylock(&dentry->d_lock)) {
+ spin_unlock(&dcache_lru_lock);
+ goto relock;
+ }
/*
* If we are honouring the DCACHE_REFERENCED flag and
* the dentry has this flag set, don't free it. Clear
@@ -511,13 +554,22 @@ restart:
if (!cnt)
break;
}
- cond_resched_lock(&dcache_lock);
+ cond_resched_lock(&dcache_lru_lock);
}
}
+ spin_unlock(&dcache_lru_lock);
+
+ spin_lock(&dcache_lock);
+again:
+ spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
while (!list_empty(&tmp)) {
dentry = list_entry(tmp.prev, struct dentry, d_lru);
- dentry_lru_del_init(dentry);
- spin_lock(&dentry->d_lock);
+
+ if (!spin_trylock(&dentry->d_lock)) {
+ spin_unlock(&dcache_lru_lock);
+ goto again;
+ }
+ __dentry_lru_del_init(dentry);
/*
* We found an inuse dentry which was not removed from
* the LRU because of laziness during lookup. Do not free
@@ -527,17 +579,22 @@ restart:
spin_unlock(&dentry->d_lock);
continue;
}
+
+ spin_unlock(&dcache_lru_lock);
prune_one_dentry(dentry);
- /* dentry->d_lock was dropped in prune_one_dentry() */
- cond_resched_lock(&dcache_lock);
+ /* dcache_lock and dentry->d_lock dropped */
+ spin_lock(&dcache_lock);
+ spin_lock(&dcache_lru_lock);
}
+ spin_unlock(&dcache_lock);
+
if (count == NULL && !list_empty(&sb->s_dentry_lru))
goto restart;
if (count != NULL)
*count = cnt;
if (!list_empty(&referenced))
list_splice(&referenced, &sb->s_dentry_lru);
- spin_unlock(&dcache_lock);
+ spin_unlock(&dcache_lru_lock);
}

/**
@@ -645,7 +702,9 @@ static void shrink_dcache_for_umount_sub

/* detach this root from the system */
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
dentry_lru_del_init(dentry);
+ spin_unlock(&dentry->d_lock);
__d_drop(dentry);
spin_unlock(&dcache_lock);

@@ -659,7 +718,9 @@ static void shrink_dcache_for_umount_sub
spin_lock(&dcache_lock);
list_for_each_entry(loop, &dentry->d_subdirs,
d_u.d_child) {
+ spin_lock(&loop->d_lock);
dentry_lru_del_init(loop);
+ spin_unlock(&loop->d_lock);
__d_drop(loop);
cond_resched_lock(&dcache_lock);
}
@@ -843,13 +904,17 @@ resume:
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;

+ spin_lock(&dentry->d_lock);
dentry_lru_del_init(dentry);
+ spin_unlock(&dentry->d_lock);
/*
* move only zero ref count dentries to the end
* of the unused list for prune_dcache
*/
if (!atomic_read(&dentry->d_count)) {
+ spin_lock(&dentry->d_lock);
dentry_lru_add_tail(dentry);
+ spin_unlock(&dentry->d_lock);
found++;
}



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:21 UTC
Permalink
Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 35 ++++++++++++++++++++++++-----------
include/linux/dcache.h | 3 +++
2 files changed, 27 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -35,12 +35,27 @@
#include <linux/hardirq.h>
#include "internal.h"

+/*
+ * Usage:
+ * dcache_hash_lock protects dcache hash table
+ *
+ * Ordering:
+ * dcache_lock
+ * dentry->d_lock
+ * dcache_hash_lock
+ *
+ * if (dentry1 < dentry2)
+ * dentry1->d_lock
+ * dentry2->d_lock
+ */
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

- __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

+EXPORT_SYMBOL(dcache_hash_lock);
EXPORT_SYMBOL(dcache_lock);

static struct kmem_cache *dentry_cache __read_mostly;
@@ -1480,17 +1495,20 @@ int d_validate(struct dentry *dentry, st
goto out;

spin_lock(&dcache_lock);
+ spin_lock(&dcache_hash_lock);
base = d_hash(dparent, dentry->d_name.hash);
hlist_for_each(lhp,base) {
/* hlist_for_each_entry_rcu() not required for d_hash list
* as it is parsed under dcache_lock
*/
if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
+ spin_unlock(&dcache_hash_lock);
__dget_locked(dentry);
spin_unlock(&dcache_lock);
return 1;
}
}
+ spin_unlock(&dcache_hash_lock);
spin_unlock(&dcache_lock);
out:
return 0;
@@ -1567,7 +1585,9 @@ void d_rehash(struct dentry * entry)
{
spin_lock(&dcache_lock);
spin_lock(&entry->d_lock);
+ spin_lock(&dcache_hash_lock);
_d_rehash(entry);
+ spin_unlock(&dcache_hash_lock);
spin_unlock(&entry->d_lock);
spin_unlock(&dcache_lock);
}
@@ -1647,8 +1667,6 @@ static void switch_names(struct dentry *
*/
static void d_move_locked(struct dentry * dentry, struct dentry * target)
{
- struct hlist_head *list;
-
if (!dentry->d_inode)
printk(KERN_WARNING "VFS: moving negative dcache entry\n");

@@ -1665,14 +1683,11 @@ static void d_move_locked(struct dentry
}

/* Move the dentry to the target hash queue, if on different bucket */
- if (d_unhashed(dentry))
- goto already_unhashed;
-
- hlist_del_rcu(&dentry->d_hash);
-
-already_unhashed:
- list = d_hash(target->d_parent, target->d_name.hash);
- __d_rehash(dentry, list);
+ spin_lock(&dcache_hash_lock);
+ if (!d_unhashed(dentry))
+ hlist_del_rcu(&dentry->d_hash);
+ __d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
+ spin_unlock(&dcache_hash_lock);

/* Unhash the target: dput() will then get rid of it */
__d_drop(target);
@@ -1869,7 +1884,9 @@ struct dentry *d_materialise_unique(stru
found_lock:
spin_lock(&actual->d_lock);
found:
+ spin_lock(&dcache_hash_lock);
_d_rehash(actual);
+ spin_unlock(&dcache_hash_lock);
spin_unlock(&actual->d_lock);
spin_unlock(&dcache_lock);
out_nolock:
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -188,6 +188,7 @@ d_iput: no no no yes

#define DCACHE_CANT_MOUNT 0x0100

+extern spinlock_t dcache_hash_lock;
extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;

@@ -211,7 +212,9 @@ static inline void __d_drop(struct dentr
{
if (!(dentry->d_flags & DCACHE_UNHASHED)) {
dentry->d_flags |= DCACHE_UNHASHED;
+ spin_lock(&dcache_hash_lock);
hlist_del_rcu(&dentry->d_hash);
+ spin_unlock(&dcache_hash_lock);
}
}



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:24 UTC
Permalink
Make d_count non-atomic and protect it with d_lock. This allows us to
ensure a 0 refcount dentry remains 0 without dcache_lock. It is also
fairly natural when we start protecting many other dentry members with
d_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
arch/powerpc/platforms/cell/spufs/inode.c | 2
drivers/infiniband/hw/ipath/ipath_fs.c | 2
fs/autofs4/expire.c | 8 +-
fs/autofs4/root.c | 6 -
fs/coda/dir.c | 2
fs/configfs/dir.c | 3
fs/configfs/inode.c | 2
fs/dcache.c | 107 ++++++++++++++++++++++--------
fs/ecryptfs/inode.c | 2
fs/exportfs/expfs.c | 9 ++
fs/hpfs/namei.c | 2
fs/locks.c | 2
fs/namei.c | 2
fs/nfs/dir.c | 12 +--
fs/nfsd/vfs.c | 5 -
fs/notify/fsnotify.c | 11 ++-
fs/notify/inotify/inotify.c | 14 +++
fs/smbfs/dir.c | 8 +-
fs/smbfs/proc.c | 8 +-
include/linux/dcache.h | 29 ++++----
kernel/cgroup.c | 2
21 files changed, 162 insertions(+), 76 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -45,6 +45,7 @@
* - d_flags
* - d_name
* - d_lru
+ * - d_count
*
* Ordering:
* dcache_lock
@@ -112,6 +113,7 @@ static void d_callback(struct rcu_head *
static void d_free(struct dentry *dentry)
{
atomic_dec(&dentry_stat.nr_dentry);
+ BUG_ON(dentry->d_count);
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
/* if dentry was never inserted into hash, immediate free is OK */
@@ -263,13 +265,23 @@ void dput(struct dentry *dentry)
return;

repeat:
- if (atomic_read(&dentry->d_count) == 1)
+ if (dentry->d_count == 1)
might_sleep();
- if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
- return;
-
spin_lock(&dentry->d_lock);
- if (atomic_read(&dentry->d_count)) {
+ if (dentry->d_count == 1) {
+ if (!spin_trylock(&dcache_lock)) {
+ /*
+ * Something of a livelock possibility we could avoid
+ * by taking dcache_lock and trying again, but we
+ * want to reduce dcache_lock anyway so this will
+ * get improved.
+ */
+ spin_unlock(&dentry->d_lock);
+ goto repeat;
+ }
+ }
+ dentry->d_count--;
+ if (dentry->d_count) {
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return;
@@ -347,7 +359,7 @@ int d_invalidate(struct dentry * dentry)
* working directory or similar).
*/
spin_lock(&dentry->d_lock);
- if (atomic_read(&dentry->d_count) > 1) {
+ if (dentry->d_count > 1) {
if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
@@ -362,29 +374,60 @@ int d_invalidate(struct dentry * dentry)
}
EXPORT_SYMBOL(d_invalidate);

-/* This should be called _only_ with dcache_lock held */
+/* This must be called with dcache_lock and d_lock held */
static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
{
- atomic_inc(&dentry->d_count);
+ dentry->d_count++;
dentry_lru_del_init(dentry);
return dentry;
}

+/* This should be called _only_ with dcache_lock held */
static inline struct dentry * __dget_locked(struct dentry *dentry)
{
- atomic_inc(&dentry->d_count);
spin_lock(&dentry->d_lock);
- dentry_lru_del_init(dentry);
+ __dget_locked_dlock(dentry);
spin_unlock(&dentry->d_lock);
return dentry;
}

+struct dentry * dget_locked_dlock(struct dentry *dentry)
+{
+ return __dget_locked_dlock(dentry);
+}
+
struct dentry * dget_locked(struct dentry *dentry)
{
return __dget_locked(dentry);
}
EXPORT_SYMBOL(dget_locked);

+struct dentry *dget_parent(struct dentry *dentry)
+{
+ struct dentry *ret;
+
+repeat:
+ spin_lock(&dentry->d_lock);
+ ret = dentry->d_parent;
+ if (!ret)
+ goto out;
+ if (dentry == ret) {
+ ret->d_count++;
+ goto out;
+ }
+ if (!spin_trylock(&ret->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto repeat;
+ }
+ BUG_ON(!ret->d_count);
+ ret->d_count++;
+ spin_unlock(&ret->d_lock);
+out:
+ spin_unlock(&dentry->d_lock);
+ return ret;
+}
+EXPORT_SYMBOL(dget_parent);
+
/**
* d_find_alias - grab a hashed alias of inode
* @inode: inode in question
@@ -453,7 +496,7 @@ restart:
spin_lock(&dcache_lock);
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
- if (!atomic_read(&dentry->d_count)) {
+ if (!dentry->d_count) {
__dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
@@ -489,7 +532,10 @@ static void prune_one_dentry(struct dent
*/
while (dentry) {
spin_lock(&dcache_lock);
- if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock)) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_count--;
+ if (dentry->d_count) {
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return;
}
@@ -575,7 +621,7 @@ again:
* the LRU because of laziness during lookup. Do not free
* it - just keep it off the LRU list.
*/
- if (atomic_read(&dentry->d_count)) {
+ if (dentry->d_count) {
spin_unlock(&dentry->d_lock);
continue;
}
@@ -736,7 +782,7 @@ static void shrink_dcache_for_umount_sub
do {
struct inode *inode;

- if (atomic_read(&dentry->d_count) != 0) {
+ if (dentry->d_count != 0) {
printk(KERN_ERR
"BUG: Dentry %p{i=%lx,n=%s}"
" still in use (%d)"
@@ -745,7 +791,7 @@ static void shrink_dcache_for_umount_sub
dentry->d_inode ?
dentry->d_inode->i_ino : 0UL,
dentry->d_name.name,
- atomic_read(&dentry->d_count),
+ dentry->d_count,
dentry->d_sb->s_type->name,
dentry->d_sb->s_id);
BUG();
@@ -755,7 +801,9 @@ static void shrink_dcache_for_umount_sub
parent = NULL;
else {
parent = dentry->d_parent;
- atomic_dec(&parent->d_count);
+ spin_lock(&parent->d_lock);
+ parent->d_count--;
+ spin_unlock(&parent->d_lock);
}

list_del(&dentry->d_u.d_child);
@@ -810,7 +858,9 @@ void shrink_dcache_for_umount(struct sup

dentry = sb->s_root;
sb->s_root = NULL;
- atomic_dec(&dentry->d_count);
+ spin_lock(&dentry->d_lock);
+ dentry->d_count--;
+ spin_unlock(&dentry->d_lock);
shrink_dcache_for_umount_subtree(dentry);

while (!hlist_empty(&sb->s_anon)) {
@@ -903,17 +953,15 @@ resume:

spin_lock(&dentry->d_lock);
dentry_lru_del_init(dentry);
- spin_unlock(&dentry->d_lock);
/*
* move only zero ref count dentries to the end
* of the unused list for prune_dcache
*/
- if (!atomic_read(&dentry->d_count)) {
- spin_lock(&dentry->d_lock);
+ if (!dentry->d_count) {
dentry_lru_add_tail(dentry);
- spin_unlock(&dentry->d_lock);
found++;
}
+ spin_unlock(&dentry->d_lock);

/*
* We can return to the caller if we have found some (this
@@ -1023,7 +1071,7 @@ struct dentry *d_alloc(struct dentry * p
memcpy(dname, name->name, name->len);
dname[name->len] = 0;

- atomic_set(&dentry->d_count, 1);
+ dentry->d_count = 1;
dentry->d_flags = DCACHE_UNHASHED;
spin_lock_init(&dentry->d_lock);
dentry->d_inode = NULL;
@@ -1497,7 +1545,7 @@ struct dentry * __d_lookup(struct dentry
goto next;
}

- atomic_inc(&dentry->d_count);
+ dentry->d_count++;
found = dentry;
spin_unlock(&dentry->d_lock);
break;
@@ -1558,6 +1606,7 @@ int d_validate(struct dentry *dentry, st
goto out;

spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
spin_lock(&dcache_hash_lock);
base = d_hash(dparent, dentry->d_name.hash);
hlist_for_each(lhp,base) {
@@ -1566,12 +1615,14 @@ int d_validate(struct dentry *dentry, st
*/
if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
spin_unlock(&dcache_hash_lock);
- __dget_locked(dentry);
+ __dget_locked_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return 1;
}
}
spin_unlock(&dcache_hash_lock);
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
out:
return 0;
@@ -1608,7 +1659,7 @@ void d_delete(struct dentry * dentry)
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
- if (atomic_read(&dentry->d_count) == 1) {
+ if (dentry->d_count == 1) {
dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_iput(dentry);
fsnotify_nameremove(dentry, isdir);
@@ -2314,11 +2365,15 @@ resume:
this_parent = dentry;
goto repeat;
}
- atomic_dec(&dentry->d_count);
+ spin_lock(&dentry->d_lock);
+ dentry->d_count--;
+ spin_unlock(&dentry->d_lock);
}
if (this_parent != root) {
next = this_parent->d_u.d_child.next;
- atomic_dec(&this_parent->d_count);
+ spin_lock(&this_parent->d_lock);
+ this_parent->d_count--;
+ spin_unlock(&this_parent->d_lock);
this_parent = this_parent->d_parent;
goto resume;
}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -87,7 +87,7 @@ full_name_hash(const unsigned char *name
#endif

struct dentry {
- atomic_t d_count;
+ unsigned int d_count; /* protected by d_lock */
unsigned int d_flags; /* protected by d_lock */
spinlock_t d_lock; /* per dentry lock */
int d_mounted;
@@ -334,17 +334,28 @@ extern char *dentry_path(struct dentry *
* needs and they take necessary precautions) you should hold dcache_lock
* and call dget_locked() instead of dget().
*/
-
+static inline struct dentry *dget_dlock(struct dentry *dentry)
+{
+ if (dentry) {
+ BUG_ON(!dentry->d_count);
+ dentry->d_count++;
+ }
+ return dentry;
+}
static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry) {
- BUG_ON(!atomic_read(&dentry->d_count));
- atomic_inc(&dentry->d_count);
+ spin_lock(&dentry->d_lock);
+ dget_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
}
return dentry;
}

extern struct dentry * dget_locked(struct dentry *);
+extern struct dentry * dget_locked_dlock(struct dentry *);
+
+extern struct dentry *dget_parent(struct dentry *dentry);

/**
* d_unhashed - is dentry hashed
@@ -375,16 +386,6 @@ static inline void dont_mount(struct den
spin_unlock(&dentry->d_lock);
}

-static inline struct dentry *dget_parent(struct dentry *dentry)
-{
- struct dentry *ret;
-
- spin_lock(&dentry->d_lock);
- ret = dget(dentry->d_parent);
- spin_unlock(&dentry->d_lock);
- return ret;
-}
-
extern void dput(struct dentry *);

static inline int d_mountpoint(struct dentry *dentry)
Index: linux-2.6/fs/configfs/dir.c
===================================================================
--- linux-2.6.orig/fs/configfs/dir.c
+++ linux-2.6/fs/configfs/dir.c
@@ -399,8 +399,7 @@ static void remove_dir(struct dentry * d
if (d->d_inode)
simple_rmdir(parent->d_inode,d);

- pr_debug(" o %s removing done (%d)\n",d->d_name.name,
- atomic_read(&d->d_count));
+ pr_debug(" o %s removing done (%d)\n",d->d_name.name, d->d_count);

dput(parent);
}
Index: linux-2.6/fs/locks.c
===================================================================
--- linux-2.6.orig/fs/locks.c
+++ linux-2.6/fs/locks.c
@@ -1375,7 +1375,7 @@ int generic_setlease(struct file *filp,
if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
goto out;
if ((arg == F_WRLCK)
- && ((atomic_read(&dentry->d_count) > 1)
+ && (dentry->d_count > 1
|| (atomic_read(&inode->i_count) > 1)))
goto out;
}
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2154,7 +2154,7 @@ void dentry_unhash(struct dentry *dentry
shrink_dcache_parent(dentry);
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
- if (atomic_read(&dentry->d_count) == 2)
+ if (dentry->d_count == 2)
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -198,7 +198,7 @@ static int autofs4_tree_busy(struct vfsm
else
ino_count++;

- if (atomic_read(&p->d_count) > ino_count) {
+ if (p->d_count > ino_count) {
top_ino->last_used = jiffies;
dput(p);
return 1;
@@ -347,7 +347,7 @@ struct dentry *autofs4_expire_indirect(s

/* Path walk currently on this dentry? */
ino_count = atomic_read(&ino->count) + 2;
- if (atomic_read(&dentry->d_count) > ino_count)
+ if (dentry->d_count > ino_count)
goto next;

/* Can we umount this guy */
@@ -369,7 +369,7 @@ struct dentry *autofs4_expire_indirect(s
if (!exp_leaves) {
/* Path walk currently on this dentry? */
ino_count = atomic_read(&ino->count) + 1;
- if (atomic_read(&dentry->d_count) > ino_count)
+ if (dentry->d_count > ino_count)
goto next;

if (!autofs4_tree_busy(mnt, dentry, timeout, do_now)) {
@@ -383,7 +383,7 @@ struct dentry *autofs4_expire_indirect(s
} else {
/* Path walk currently on this dentry? */
ino_count = atomic_read(&ino->count) + 1;
- if (atomic_read(&dentry->d_count) > ino_count)
+ if (dentry->d_count > ino_count)
goto next;

expired = autofs4_check_leaves(mnt, dentry, timeout, do_now);
Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c
+++ linux-2.6/fs/coda/dir.c
@@ -612,7 +612,7 @@ static int coda_dentry_revalidate(struct
if (cii->c_flags & C_FLUSH)
coda_flag_inode_children(inode, C_FLUSH);

- if (atomic_read(&de->d_count) > 1)
+ if (de->d_count > 1)
/* pretend it's valid, but don't change the flags */
goto out;

Index: linux-2.6/fs/ecryptfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/inode.c
+++ linux-2.6/fs/ecryptfs/inode.c
@@ -255,7 +255,7 @@ int ecryptfs_lookup_and_interpose_lower(
ecryptfs_dentry->d_parent));
lower_inode = lower_dentry->d_inode;
fsstack_copy_attr_atime(ecryptfs_dir_inode, lower_dir_dentry->d_inode);
- BUG_ON(!atomic_read(&lower_dentry->d_count));
+ BUG_ON(!lower_dentry->d_count);
ecryptfs_set_dentry_private(ecryptfs_dentry,
kmem_cache_alloc(ecryptfs_dentry_info_cache,
GFP_KERNEL));
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1357,7 +1357,7 @@ static int nfs_sillyrename(struct inode

dfprintk(VFS, "NFS: silly-rename(%s/%s, ct=%d)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
- atomic_read(&dentry->d_count));
+ dentry->d_count);
nfs_inc_stats(dir, NFSIOS_SILLYRENAME);

/*
@@ -1466,7 +1466,7 @@ static int nfs_unlink(struct inode *dir,

spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
- if (atomic_read(&dentry->d_count) > 1) {
+ if (dentry->d_count > 1) {
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
/* Start asynchronous writeout of the inode */
@@ -1614,7 +1614,7 @@ static int nfs_rename(struct inode *old_
dfprintk(VFS, "NFS: rename(%s/%s -> %s/%s, ct=%d)\n",
old_dentry->d_parent->d_name.name, old_dentry->d_name.name,
new_dentry->d_parent->d_name.name, new_dentry->d_name.name,
- atomic_read(&new_dentry->d_count));
+ new_dentry->d_count);

/*
* For non-directories, check whether the target is busy and if so,
@@ -1632,7 +1632,7 @@ static int nfs_rename(struct inode *old_
rehash = new_dentry;
}

- if (atomic_read(&new_dentry->d_count) > 2) {
+ if (new_dentry->d_count > 2) {
int err;

/* copy the target dentry's name */
@@ -1655,7 +1655,7 @@ static int nfs_rename(struct inode *old_
/*
* ... prune child dentries and writebacks if needed.
*/
- if (atomic_read(&old_dentry->d_count) > 1) {
+ if (old_dentry->d_count > 1) {
if (S_ISREG(old_inode->i_mode))
nfs_wb_all(old_inode);
shrink_dcache_parent(old_dentry);
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c
+++ linux-2.6/fs/nfsd/vfs.c
@@ -1752,8 +1752,7 @@ nfsd_rename(struct svc_rqst *rqstp, stru
goto out_dput_new;

if (svc_msnfs(ffhp) &&
- ((atomic_read(&odentry->d_count) > 1)
- || (atomic_read(&ndentry->d_count) > 1))) {
+ ((odentry->d_count > 1) || (ndentry->d_count > 1))) {
host_err = -EPERM;
goto out_dput_new;
}
@@ -1839,7 +1838,7 @@ nfsd_unlink(struct svc_rqst *rqstp, stru
if (type != S_IFDIR) { /* It's UNLINK */
#ifdef MSNFS
if ((fhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
- (atomic_read(&rdentry->d_count) > 1)) {
+ (rdentry->d_count > 1)) {
host_err = -EPERM;
} else
#endif
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -74,12 +74,19 @@ static struct dentry *
find_disconnected_root(struct dentry *dentry)
{
dget(dentry);
+again:
spin_lock(&dentry->d_lock);
while (!IS_ROOT(dentry) &&
(dentry->d_parent->d_flags & DCACHE_DISCONNECTED)) {
struct dentry *parent = dentry->d_parent;
- dget(parent);
+
+ if (!spin_trylock(&parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto again;
+ }
+ dget_dlock(parent);
spin_unlock(&dentry->d_lock);
+ spin_unlock(&parent->d_lock);
dput(dentry);
dentry = parent;
spin_lock(&dentry->d_lock);
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -335,18 +335,28 @@ void inotify_dentry_parent_queue_event(s
if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))
return;

+again:
spin_lock(&dentry->d_lock);
parent = dentry->d_parent;
+ if (parent != dentry && !spin_trylock(&parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto again;
+ }
inode = parent->d_inode;

if (inotify_inode_watched(inode)) {
- dget(parent);
+ dget_dlock(parent);
spin_unlock(&dentry->d_lock);
+ if (parent != dentry)
+ spin_unlock(&parent->d_lock);
inotify_inode_queue_event(inode, mask, cookie, name,
dentry->d_inode);
dput(parent);
- } else
+ } else {
spin_unlock(&dentry->d_lock);
+ if (parent != dentry)
+ spin_unlock(&parent->d_lock);
+ }
}
EXPORT_SYMBOL_GPL(inotify_dentry_parent_queue_event);

Index: linux-2.6/fs/smbfs/dir.c
===================================================================
--- linux-2.6.orig/fs/smbfs/dir.c
+++ linux-2.6/fs/smbfs/dir.c
@@ -406,6 +406,7 @@ void
smb_renew_times(struct dentry * dentry)
{
dget(dentry);
+again:
spin_lock(&dentry->d_lock);
for (;;) {
struct dentry *parent;
@@ -414,8 +415,13 @@ smb_renew_times(struct dentry * dentry)
if (IS_ROOT(dentry))
break;
parent = dentry->d_parent;
- dget(parent);
+ if (!spin_trylock(&parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto again;
+ }
+ dget_dlock(parent);
spin_unlock(&dentry->d_lock);
+ spin_unlock(&parent->d_lock);
dput(dentry);
dentry = parent;
spin_lock(&dentry->d_lock);
Index: linux-2.6/fs/smbfs/proc.c
===================================================================
--- linux-2.6.orig/fs/smbfs/proc.c
+++ linux-2.6/fs/smbfs/proc.c
@@ -332,6 +332,7 @@ static int smb_build_path(struct smb_sb_
* and store it in reversed order [see reverse_string()]
*/
dget(entry);
+again:
spin_lock(&entry->d_lock);
while (!IS_ROOT(entry)) {
struct dentry *parent;
@@ -350,6 +351,7 @@ static int smb_build_path(struct smb_sb_
dput(entry);
return len;
}
+
reverse_string(path, len);
path += len;
if (unicode) {
@@ -361,7 +363,11 @@ static int smb_build_path(struct smb_sb_
maxlen -= len+1;

parent = entry->d_parent;
- dget(parent);
+ if (!spin_trylock(&parent->d_lock)) {
+ spin_unlock(&entry->d_lock);
+ goto again;
+ }
+ dget_dlock(parent);
spin_unlock(&entry->d_lock);
dput(entry);
entry = parent;
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -3545,9 +3545,7 @@ again:
list_del(&cgrp->sibling);
cgroup_unlock_hierarchy(cgrp->root);

- spin_lock(&cgrp->dentry->d_lock);
d = dget(cgrp->dentry);
- spin_unlock(&d->d_lock);

cgroup_d_remove_dir(d);
dput(d);
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -161,7 +161,7 @@ static void spufs_prune_dir(struct dentr
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (!(d_unhashed(dentry)) && dentry->d_inode) {
- dget_locked(dentry);
+ dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
simple_unlink(dir->d_inode, dentry);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_fs.c
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -276,7 +276,7 @@ static int remove_file(struct dentry *pa
spin_lock(&dcache_lock);
spin_lock(&tmp->d_lock);
if (!(d_unhashed(tmp) && tmp->d_inode)) {
- dget_locked(tmp);
+ dget_locked_dlock(tmp);
__d_drop(tmp);
spin_unlock(&tmp->d_lock);
spin_unlock(&dcache_lock);
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -252,7 +252,7 @@ void configfs_drop_dentry(struct configf
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (!(d_unhashed(dentry) && dentry->d_inode)) {
- dget_locked(dentry);
+ dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
Index: linux-2.6/fs/notify/fsnotify.c
===================================================================
--- linux-2.6.orig/fs/notify/fsnotify.c
+++ linux-2.6/fs/notify/fsnotify.c
@@ -88,13 +88,18 @@ void __fsnotify_parent(struct dentry *de
if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED))
return;

+again:
spin_lock(&dentry->d_lock);
parent = dentry->d_parent;
+ if (parent != dentry && !spin_trylock(&parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto again;
+ }
p_inode = parent->d_inode;

if (fsnotify_inode_watches_children(p_inode)) {
if (p_inode->i_fsnotify_mask & mask) {
- dget(parent);
+ dget_dlock(parent);
send = true;
}
} else {
@@ -104,11 +109,13 @@ void __fsnotify_parent(struct dentry *de
* children and update their d_flags to let them know p_inode
* doesn't care about them any more.
*/
- dget(parent);
+ dget_dlock(parent);
should_update_children = true;
}

spin_unlock(&dentry->d_lock);
+ if (parent != dentry)
+ spin_unlock(&parent->d_lock);

if (send) {
/* we are notifying a parent so come up with the new mask which
Index: linux-2.6/fs/ceph/dir.c
===================================================================
--- linux-2.6.orig/fs/ceph/dir.c
+++ linux-2.6/fs/ceph/dir.c
@@ -149,7 +149,9 @@ more:
di = ceph_dentry(dentry);
}

- atomic_inc(&dentry->d_count);
+ spin_lock(&dentry->d_lock);
+ dentry->d_count++;
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
spin_unlock(&inode->i_lock);

Index: linux-2.6/fs/ceph/inode.c
===================================================================
--- linux-2.6.orig/fs/ceph/inode.c
+++ linux-2.6/fs/ceph/inode.c
@@ -863,8 +863,8 @@ static struct dentry *splice_dentry(stru
} else if (realdn) {
dout("dn %p (%d) spliced with %p (%d) "
"inode %p ino %llx.%llx\n",
- dn, atomic_read(&dn->d_count),
- realdn, atomic_read(&realdn->d_count),
+ dn, dn->d_count,
+ realdn, realdn->d_count,
realdn->d_inode, ceph_vinop(realdn->d_inode));
dput(dn);
dn = realdn;
Index: linux-2.6/fs/ceph/mds_client.c
===================================================================
--- linux-2.6.orig/fs/ceph/mds_client.c
+++ linux-2.6/fs/ceph/mds_client.c
@@ -1371,7 +1371,7 @@ retry:
*base = ceph_ino(temp->d_inode);
*plen = len;
dout("build_path on %p %d built %llx '%.*s'\n",
- dentry, atomic_read(&dentry->d_count), *base, len, path);
+ dentry, dentry->d_count, *base, len, path);
return path;
}

Index: linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/qib/qib_fs.c
+++ linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
@@ -454,7 +454,7 @@ static int remove_file(struct dentry *pa
spin_lock(&dcache_lock);
spin_lock(&tmp->d_lock);
if (!(d_unhashed(tmp) && tmp->d_inode)) {
- dget_locked(tmp);
+ dget_locked_dlock(tmp);
__d_drop(tmp);
spin_unlock(&tmp->d_lock);
spin_unlock(&dcache_lock);


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:03:02 UTC
Permalink
Allow the shrinker to do per-zone shrinking. This means it is called for
each zone scanned. The shrinker is now completely responsible for calculating
and batching (given helpers), which provides better flexibility.

Finding the ratio of objects to scan requires scaling the ratio of pagecache
objects scanned. By passing down both the per-zone and the global reclaimable
pages, per-zone caches and global caches can be calculated correctly.

Finally, add some fixed-point scaling to the ratio, which helps calculations.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 2
fs/drop_caches.c | 2
fs/inode.c | 2
fs/mbcache.c | 4 -
fs/nfs/dir.c | 2
fs/nfs/internal.h | 2
fs/quota/dquot.c | 2
include/linux/mm.h | 6 +-
mm/vmscan.c | 131 ++++++++++++++---------------------------------------
9 files changed, 47 insertions(+), 106 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -999,16 +999,19 @@ static inline void sync_mm_rss(struct ta
* querying the cache size, so a fastpath for that case is appropriate.
*/
struct shrinker {
- int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
- int seeks; /* seeks to recreate an obj */
-
+ int (*shrink)(struct zone *zone, unsigned long scanned, unsigned long total,
+ unsigned long global, gfp_t gfp_mask);
/* These are for internal use */
struct list_head list;
- long nr; /* objs pending delete */
};
-#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
+#define DEFAULT_SEEKS (128UL*2) /* A good number if you don't know better. */
+#define SHRINK_BATCH 128 /* A good number if you don't know better */
extern void register_shrinker(struct shrinker *);
extern void unregister_shrinker(struct shrinker *);
+extern void shrinker_add_scan(unsigned long *dst,
+ unsigned long scanned, unsigned long total,
+ unsigned long objects, unsigned int ratio);
+extern unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch);

int vma_wants_writenotify(struct vm_area_struct *vma);

@@ -1422,8 +1425,7 @@ int in_gate_area_no_task(unsigned long a

int drop_caches_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
-unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
- unsigned long lru_pages);
+void shrink_all_slab(void);

#ifndef CONFIG_MMU
#define randomize_va_space 0
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -160,7 +160,6 @@ static unsigned long zone_nr_lru_pages(s
*/
void register_shrinker(struct shrinker *shrinker)
{
- shrinker->nr = 0;
down_write(&shrinker_rwsem);
list_add_tail(&shrinker->list, &shrinker_list);
up_write(&shrinker_rwsem);
@@ -178,7 +177,38 @@ void unregister_shrinker(struct shrinker
}
EXPORT_SYMBOL(unregister_shrinker);

-#define SHRINK_BATCH 128
+void shrinker_add_scan(unsigned long *dst,
+ unsigned long scanned, unsigned long total,
+ unsigned long objects, unsigned int ratio)
+{
+ unsigned long long delta;
+
+ delta = (unsigned long long)scanned * objects * ratio;
+ do_div(delta, total + 1);
+ delta /= (128ULL / 4ULL);
+
+ /*
+ * Avoid risking looping forever due to too large nr value:
+ * never try to free more than twice the estimate number of
+ * freeable entries.
+ */
+ *dst += delta;
+
+ if (*dst > objects)
+ *dst = objects;
+}
+EXPORT_SYMBOL(shrinker_add_scan);
+
+unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch)
+{
+ unsigned long nr = ACCESS_ONCE(*dst);
+ if (nr < batch)
+ return 0;
+ *dst = nr - batch;
+ return batch;
+}
+EXPORT_SYMBOL(shrinker_do_scan);
+
/*
* Call the shrink functions to age shrinkable caches
*
@@ -198,8 +228,8 @@ EXPORT_SYMBOL(unregister_shrinker);
*
* Returns the number of slab objects which we shrunk.
*/
-unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
- unsigned long lru_pages)
+static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
+ unsigned long global, gfp_t gfp_mask)
{
struct shrinker *shrinker;
unsigned long ret = 0;
@@ -211,55 +241,25 @@ unsigned long shrink_slab(unsigned long
return 1; /* Assume we'll be able to shrink next time */

list_for_each_entry(shrinker, &shrinker_list, list) {
- unsigned long long delta;
- unsigned long total_scan;
- unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
-
- delta = (4 * scanned) / shrinker->seeks;
- delta *= max_pass;
- do_div(delta, lru_pages + 1);
- shrinker->nr += delta;
- if (shrinker->nr < 0) {
- printk(KERN_ERR "shrink_slab: %pF negative objects to "
- "delete nr=%ld\n",
- shrinker->shrink, shrinker->nr);
- shrinker->nr = max_pass;
- }
-
- /*
- * Avoid risking looping forever due to too large nr value:
- * never try to free more than twice the estimate number of
- * freeable entries.
- */
- if (shrinker->nr > max_pass * 2)
- shrinker->nr = max_pass * 2;
-
- total_scan = shrinker->nr;
- shrinker->nr = 0;
-
- while (total_scan >= SHRINK_BATCH) {
- long this_scan = SHRINK_BATCH;
- int shrink_ret;
- int nr_before;
-
- nr_before = (*shrinker->shrink)(0, gfp_mask);
- shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
- if (shrink_ret == -1)
- break;
- if (shrink_ret < nr_before)
- ret += nr_before - shrink_ret;
- count_vm_events(SLABS_SCANNED, this_scan);
- total_scan -= this_scan;
-
- cond_resched();
- }
-
- shrinker->nr += total_scan;
+ (*shrinker->shrink)(zone, scanned, total, global, gfp_mask);
}
up_read(&shrinker_rwsem);
return ret;
}

+void shrink_all_slab(void)
+{
+ struct zone *zone;
+ unsigned long nr;
+
+again:
+ nr = 0;
+ for_each_zone(zone)
+ nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+ if (nr >= 10)
+ goto again;
+}
+
static inline int is_page_cache_freeable(struct page *page)
{
/*
@@ -1660,18 +1660,23 @@ static void set_lumpy_reclaim_mode(int p
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc, unsigned long global_lru_pages)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+ unsigned long nr_scanned = sc->nr_scanned;
+ unsigned long lru_pages = 0;

get_scan_count(zone, sc, nr, priority);

set_lumpy_reclaim_mode(priority, sc);

+ if (scanning_global_lru(sc))
+ lru_pages = zone_reclaimable_pages(zone);
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1696,8 +1701,6 @@ static void shrink_zone(int priority, st
break;
}

- sc->nr_reclaimed = nr_reclaimed;
-
/*
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
@@ -1705,6 +1708,23 @@ static void shrink_zone(int priority, st
if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

+ /*
+ * Don't shrink slabs when reclaiming memory from
+ * over limit cgroups
+ */
+ if (scanning_global_lru(sc)) {
+ struct reclaim_state *reclaim_state = current->reclaim_state;
+
+ shrink_slab(zone, sc->nr_scanned - nr_scanned,
+ lru_pages, global_lru_pages, sc->gfp_mask);
+ if (reclaim_state) {
+ nr_reclaimed += reclaim_state->reclaimed_slab;
+ reclaim_state->reclaimed_slab = 0;
+ }
+ }
+
+ sc->nr_reclaimed = nr_reclaimed;
+
throttle_vm_writeout(sc->gfp_mask);
}

@@ -1725,7 +1745,7 @@ static void shrink_zone(int priority, st
* scan then give up on it.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
- struct scan_control *sc)
+ struct scan_control *sc, unsigned long global_lru_pages)
{
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
@@ -1756,7 +1776,7 @@ static bool shrink_zones(int priority, s
priority);
}

- shrink_zone(priority, zone, sc);
+ shrink_zone(priority, zone, sc, global_lru_pages);
all_unreclaimable = false;
}
return all_unreclaimable;
@@ -1784,7 +1804,6 @@ static unsigned long do_try_to_free_page
int priority;
bool all_unreclaimable;
unsigned long total_scanned = 0;
- struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
struct zoneref *z;
struct zone *zone;
@@ -1796,6 +1815,7 @@ static unsigned long do_try_to_free_page

if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
+
/*
* mem_cgroup will not do shrink_slab.
*/
@@ -1813,18 +1833,8 @@ static unsigned long do_try_to_free_page
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
- all_unreclaimable = shrink_zones(priority, zonelist, sc);
- /*
- * Don't shrink slabs when reclaiming memory from
- * over limit cgroups
- */
- if (scanning_global_lru(sc)) {
- shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
- if (reclaim_state) {
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
- reclaim_state->reclaimed_slab = 0;
- }
- }
+ all_unreclaimable = shrink_zones(priority, zonelist,
+ sc, lru_pages);
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;
@@ -1930,7 +1940,7 @@ unsigned long mem_cgroup_shrink_node_zon
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_zone(0, zone, &sc);
+ shrink_zone(0, zone, &sc, zone_reclaimable_pages(zone));
return sc.nr_reclaimed;
}

@@ -2012,7 +2022,6 @@ static unsigned long balance_pgdat(pg_da
int priority;
int i;
unsigned long total_scanned;
- struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
@@ -2100,7 +2109,6 @@ loop_again:
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int nr_slab;
int nid, zid;

if (!populated_zone(zone))
@@ -2127,16 +2135,11 @@ loop_again:
*/
if (!zone_watermark_ok(zone, order,
8*high_wmark_pages(zone), end_zone, 0))
- shrink_zone(priority, zone, &sc);
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
- lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+ shrink_zone(priority, zone, &sc, lru_pages);
total_scanned += sc.nr_scanned;
if (zone->all_unreclaimable)
continue;
- if (nr_slab == 0 &&
- zone->pages_scanned >= (zone_reclaimable_pages(zone) * 6))
+ if (zone->pages_scanned >= (zone_reclaimable_pages(zone) * 6))
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and
@@ -2610,34 +2613,15 @@ static int __zone_reclaim(struct zone *z
priority = ZONE_RECLAIM_PRIORITY;
do {
note_zone_scanning_priority(zone, priority);
- shrink_zone(priority, zone, &sc);
+ shrink_zone(priority, zone, &sc,
+ zone_reclaimable_pages(zone));
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
}

slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (slab_reclaimable > zone->min_slab_pages) {
- /*
- * shrink_slab() does not currently allow us to determine how
- * many pages were freed in this zone. So we take the current
- * number of slab pages and shake the slab until it is reduced
- * by the same nr_pages that we used for reclaiming unmapped
- * pages.
- *
- * Note that shrink_slab will free memory on all zones and may
- * take a long time.
- */
- while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
- zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
- slab_reclaimable - nr_pages)
- ;
-
- /*
- * Update nr_reclaimed by the number of slab pages we
- * reclaimed from this zone.
- */
- sc.nr_reclaimed += slab_reclaimable -
- zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+ /* XXX: don't shrink slab in shrink_zone if we're under this */
}

p->reclaim_state = NULL;
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -748,20 +748,26 @@ again2:
*
* This function may fail to free any resources if all the dentries are in use.
*/
-static void prune_dcache(int count)
+static void prune_dcache(struct zone *zone, unsigned long scanned,
+ unsigned long total, gfp_t gfp_mask)
+
{
+ unsigned long nr_to_scan;
struct super_block *sb, *n;
int w_count;
- int unused = dentry_stat.nr_unused;
int prune_ratio;
- int pruned;
+ int count, pruned;

- if (unused == 0 || count == 0)
+ shrinker_add_scan(&nr_to_scan, scanned, total, dentry_stat.nr_unused,
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
+done:
+ count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (dentry_stat.nr_unused == 0 || count == 0)
return;
- if (count >= unused)
+ if (count >= dentry_stat.nr_unused)
prune_ratio = 1;
else
- prune_ratio = unused / count;
+ prune_ratio = dentry_stat.nr_unused / count;
spin_lock(&sb_lock);
list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
if (list_empty(&sb->s_instances))
@@ -810,6 +816,10 @@ static void prune_dcache(int count)
break;
}
spin_unlock(&sb_lock);
+ if (count <= 0) {
+ cond_resched();
+ goto done;
+ }
}

/**
@@ -1176,19 +1186,15 @@ EXPORT_SYMBOL(shrink_dcache_parent);
*
* In this case we return -1 to tell the caller that we baled.
*/
-static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dcache_memory(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
- if (nr) {
- if (!(gfp_mask & __GFP_FS))
- return -1;
- prune_dcache(nr);
- }
- return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+ prune_dcache(zone, scanned, global, gfp_mask);
+ return 0;
}

static struct shrinker dcache_shrinker = {
.shrink = shrink_dcache_memory,
- .seeks = DEFAULT_SEEKS,
};

/**
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -527,7 +527,7 @@ EXPORT_SYMBOL(invalidate_inodes);
* If the inode has metadata buffers attached to mapping->private_list then
* try to remove them.
*/
-static void prune_icache(int nr_to_scan)
+static void prune_icache(struct zone *zone, unsigned long nr_to_scan)
{
LIST_HEAD(freeable);
unsigned long reap = 0;
@@ -597,24 +597,28 @@ again:
* This function is passed the number of inodes to scan, and it returns the
* total number of remaining possibly-reclaimable inodes.
*/
-static int shrink_icache_memory(int nr, gfp_t gfp_mask)
+static int shrink_icache_memory(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
- if (nr) {
- /*
- * Nasty deadlock avoidance. We may hold various FS locks,
- * and we don't want to recurse into the FS that called us
- * in clear_inode() and friends..
- */
- if (!(gfp_mask & __GFP_FS))
- return -1;
- prune_icache(nr);
+ static unsigned long nr_to_scan;
+ unsigned long nr;
+
+ shrinker_add_scan(&nr_to_scan, scanned, global,
+ inodes_stat.nr_unused,
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
+ if (!(gfp_mask & __GFP_FS))
+ return 0;
+
+ while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
+ prune_icache(zone, nr);
+ cond_resched();
}
- return inodes_stat.nr_unused / 100 * sysctl_vfs_cache_pressure;
+
+ return 0;
}

static struct shrinker icache_shrinker = {
.shrink = shrink_icache_memory,
- .seeks = DEFAULT_SEEKS,
};

static void __wait_on_freeing_inode(struct inode *inode);
Index: linux-2.6/fs/mbcache.c
===================================================================
--- linux-2.6.orig/fs/mbcache.c
+++ linux-2.6/fs/mbcache.c
@@ -115,11 +115,12 @@ mb_cache_indexes(struct mb_cache *cache)
* What the mbcache registers as to get shrunk dynamically.
*/

-static int mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask);
+static int
+mb_cache_shrink_fn(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask);

static struct shrinker mb_cache_shrinker = {
.shrink = mb_cache_shrink_fn,
- .seeks = DEFAULT_SEEKS,
};

static inline int
@@ -197,11 +198,14 @@ forget:
* Returns the number of objects which are present in the cache.
*/
static int
-mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask)
+mb_cache_shrink_fn(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
LIST_HEAD(free_list);
struct list_head *l, *ltmp;
- int count = 0;
+ unsigned long count = 0;
+ unsigned long nr;

spin_lock(&mb_cache_spinlock);
list_for_each(l, &mb_cache_list) {
@@ -211,28 +215,38 @@ mb_cache_shrink_fn(int nr_to_scan, gfp_t
atomic_read(&cache->c_entry_count));
count += atomic_read(&cache->c_entry_count);
}
+ shrinker_add_scan(&nr_to_scan, scanned, global, count,
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
mb_debug("trying to free %d entries", nr_to_scan);
- if (nr_to_scan == 0) {
+
+again:
+ nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!nr) {
spin_unlock(&mb_cache_spinlock);
- goto out;
+ return 0;
}
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+ while (!list_empty(&mb_cache_lru_list)) {
struct mb_cache_entry *ce =
list_entry(mb_cache_lru_list.next,
struct mb_cache_entry, e_lru_list);
list_move_tail(&ce->e_lru_list, &free_list);
__mb_cache_entry_unhash(ce);
+ cond_resched_lock(&mb_cache_spinlock);
+ if (!--nr)
+ break;
}
spin_unlock(&mb_cache_spinlock);
list_for_each_safe(l, ltmp, &free_list) {
__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
e_lru_list), gfp_mask);
}
-out:
- return (count / 100) * sysctl_vfs_cache_pressure;
+ if (!nr) {
+ spin_lock(&mb_cache_spinlock);
+ goto again;
+ }
+ return 0;
}

-
/*
* mb_cache_create() create a new cache
*
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1709,21 +1709,31 @@ static void nfs_access_free_list(struct
}
}

-int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
+int nfs_access_cache_shrinker(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
LIST_HEAD(head);
- struct nfs_inode *nfsi;
struct nfs_access_entry *cache;
-
- if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
- return (nr_to_scan == 0) ? 0 : -1;
+ unsigned long nr;

spin_lock(&nfs_access_lru_lock);
- list_for_each_entry(nfsi, &nfs_access_lru_list, access_cache_inode_lru) {
+ shrinker_add_scan(&nr_to_scan, scanned, global,
+ atomic_long_read(&nfs_access_nr_entries),
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
+ if (!(gfp_mask & __GFP_FS) || nr_to_scan < SHRINK_BATCH) {
+ spin_unlock(&nfs_access_lru_lock);
+ return 0;
+ }
+ nr = ACCESS_ONCE(nr_to_scan);
+ nr_to_scan = 0;
+
+ while (nr-- && !list_empty(&nfs_access_lru_list)) {
+ struct nfs_inode *nfsi;
struct inode *inode;

- if (nr_to_scan-- == 0)
- break;
+ nfsi = list_entry(nfs_access_lru_list.next,
+ struct nfs_inode, access_cache_inode_lru);
inode = &nfsi->vfs_inode;
spin_lock(&inode->i_lock);
if (list_empty(&nfsi->access_cache_entry_lru))
@@ -1743,10 +1753,11 @@ remove_lru_entry:
smp_mb__after_clear_bit();
}
spin_unlock(&inode->i_lock);
+ cond_resched_lock(&nfs_access_lru_lock);
}
spin_unlock(&nfs_access_lru_lock);
nfs_access_free_list(&head);
- return (atomic_long_read(&nfs_access_nr_entries) / 100) * sysctl_vfs_cache_pressure;
+ return 0;
}

static void __nfs_access_zap_cache(struct nfs_inode *nfsi, struct list_head *head)
Index: linux-2.6/fs/nfs/internal.h
===================================================================
--- linux-2.6.orig/fs/nfs/internal.h
+++ linux-2.6/fs/nfs/internal.h
@@ -205,7 +205,8 @@ extern struct rpc_procinfo nfs4_procedur
void nfs_close_context(struct nfs_open_context *ctx, int is_sync);

/* dir.c */
-extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask);
+extern int nfs_access_cache_shrinker(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask);

/* inode.c */
extern struct workqueue_struct *nfsiod_workqueue;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -655,7 +655,7 @@ int dquot_quota_sync(struct super_block
EXPORT_SYMBOL(dquot_quota_sync);

/* Free unused dquots from cache */
-static void prune_dqcache(int count)
+static void prune_dqcache(unsigned long count)
{
struct list_head *head;
struct dquot *dquot;
@@ -676,21 +676,28 @@ static void prune_dqcache(int count)
* This is called from kswapd when we think we need some
* more memory
*/
-static int shrink_dqcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dqcache_memory(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
- if (nr) {
+ static unsigned long nr_to_scan;
+ unsigned long nr;
+
+ shrinker_add_scan(&nr_to_scan, scanned, total,
+ percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]),
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
+
+ while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
spin_lock(&dq_list_lock);
prune_dqcache(nr);
spin_unlock(&dq_list_lock);
+ cond_resched();
}
- return ((unsigned)
- percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS])
- /100) * sysctl_vfs_cache_pressure;
+
+ return 0;
}

static struct shrinker dqcache_shrinker = {
.shrink = shrink_dqcache_memory,
- .seeks = DEFAULT_SEEKS,
};

/*
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -38,11 +38,7 @@ static void drop_pagecache_sb(struct sup

static void drop_slab(void)
{
- int nr_objects;
-
- do {
- nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
- } while (nr_objects > 10);
+ shrink_all_slab();
}

int drop_caches_sysctl_handler(ctl_table *table, int write,
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c
+++ linux-2.6/mm/memory-failure.c
@@ -229,14 +229,8 @@ void shake_page(struct page *p, int acce
* Only all shrink_slab here (which would also
* shrink other caches) if access is not potentially fatal.
*/
- if (access) {
- int nr;
- do {
- nr = shrink_slab(1000, GFP_KERNEL, 1000);
- if (page_count(p) == 0)
- break;
- } while (nr > 10);
- }
+ if (access)
+ shrink_all_slab();
}
EXPORT_SYMBOL_GPL(shake_page);

Index: linux-2.6/arch/x86/kvm/mmu.c
===================================================================
--- linux-2.6.orig/arch/x86/kvm/mmu.c
+++ linux-2.6/arch/x86/kvm/mmu.c
@@ -2924,14 +2924,29 @@ static int kvm_mmu_remove_some_alloc_mmu
return kvm_mmu_zap_page(kvm, page) + 1;
}

-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
struct kvm *kvm;
struct kvm *kvm_freed = NULL;
- int cache_count = 0;
+ unsigned long cache_count = 0;

spin_lock(&kvm_lock);
+ list_for_each_entry(kvm, &vm_list, vm_list) {
+ cache_count += kvm->arch.n_alloc_mmu_pages -
+ kvm->arch.n_free_mmu_pages;
+ }

+ shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
+ DEFAULT_SEEKS*10);
+
+done:
+ cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!cache_count) {
+ spin_unlock(&kvm_lock);
+ return 0;
+ }
list_for_each_entry(kvm, &vm_list, vm_list) {
int npages, idx, freed_pages;

@@ -2939,28 +2954,24 @@ static int mmu_shrink(int nr_to_scan, gf
spin_lock(&kvm->mmu_lock);
npages = kvm->arch.n_alloc_mmu_pages -
kvm->arch.n_free_mmu_pages;
- cache_count += npages;
- if (!kvm_freed && nr_to_scan > 0 && npages > 0) {
+ if (!kvm_freed && npages > 0) {
freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
- cache_count -= freed_pages;
kvm_freed = kvm;
}
- nr_to_scan--;
-
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
+
+ if (!--cache_count)
+ break;
}
if (kvm_freed)
list_move_tail(&kvm_freed->vm_list, &vm_list);
-
- spin_unlock(&kvm_lock);
-
- return cache_count;
+ cond_resched_lock(&kvm_lock);
+ goto done;
}

static struct shrinker mmu_shrinker = {
.shrink = mmu_shrink,
- .seeks = DEFAULT_SEEKS * 10,
};

static void mmu_destroy_caches(void)
Index: linux-2.6/drivers/gpu/drm/i915/i915_gem.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/i915/i915_gem.c
+++ linux-2.6/drivers/gpu/drm/i915/i915_gem.c
@@ -4977,41 +4977,46 @@ i915_gpu_is_active(struct drm_device *de
}

static int
-i915_gem_shrink(int nr_to_scan, gfp_t gfp_mask)
+i915_gem_shrink(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
+ unsigned long cnt = 0;
drm_i915_private_t *dev_priv, *next_dev;
struct drm_i915_gem_object *obj_priv, *next_obj;
- int cnt = 0;
int would_deadlock = 1;

/* "fast-path" to count number of available objects */
- if (nr_to_scan == 0) {
- spin_lock(&shrink_list_lock);
- list_for_each_entry(dev_priv, &shrink_list, mm.shrink_list) {
- struct drm_device *dev = dev_priv->dev;
+ spin_lock(&shrink_list_lock);
+ list_for_each_entry(dev_priv, &shrink_list, mm.shrink_list) {
+ struct drm_device *dev = dev_priv->dev;

- if (mutex_trylock(&dev->struct_mutex)) {
- list_for_each_entry(obj_priv,
- &dev_priv->mm.inactive_list,
- list)
- cnt++;
- mutex_unlock(&dev->struct_mutex);
- }
+ if (mutex_trylock(&dev->struct_mutex)) {
+ list_for_each_entry(obj_priv,
+ &dev_priv->mm.inactive_list,
+ list)
+ cnt++;
+ mutex_unlock(&dev->struct_mutex);
}
- spin_unlock(&shrink_list_lock);
-
- return (cnt / 100) * sysctl_vfs_cache_pressure;
}
+ shrinker_add_scan(&nr_to_scan, scanned, global, cnt,
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);

- spin_lock(&shrink_list_lock);
-
+done:
+ cnt = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
rescan:
+ if (!cnt) {
+ spin_unlock(&shrink_list_lock);
+ return 0;
+ }
+
/* first scan for clean buffers */
+ /* XXX: this probably needs list_safe_reset_next */
list_for_each_entry_safe(dev_priv, next_dev,
&shrink_list, mm.shrink_list) {
struct drm_device *dev = dev_priv->dev;

- if (! mutex_trylock(&dev->struct_mutex))
+ if (!mutex_trylock(&dev->struct_mutex))
continue;

spin_unlock(&shrink_list_lock);
@@ -5025,8 +5030,8 @@ rescan:
list) {
if (i915_gem_object_is_purgeable(obj_priv)) {
i915_gem_object_unbind(&obj_priv->base);
- if (--nr_to_scan <= 0)
- break;
+ if (!--cnt)
+ goto done;
}
}

@@ -5034,9 +5039,6 @@ rescan:
mutex_unlock(&dev->struct_mutex);

would_deadlock = 0;
-
- if (nr_to_scan <= 0)
- break;
}

/* second pass, evict/count anything still on the inactive list */
@@ -5052,11 +5054,9 @@ rescan:
list_for_each_entry_safe(obj_priv, next_obj,
&dev_priv->mm.inactive_list,
list) {
- if (nr_to_scan > 0) {
- i915_gem_object_unbind(&obj_priv->base);
- nr_to_scan--;
- } else
- cnt++;
+ i915_gem_object_unbind(&obj_priv->base);
+ if (!--cnt)
+ goto done;
}

spin_lock(&shrink_list_lock);
@@ -5065,7 +5065,7 @@ rescan:
would_deadlock = 0;
}

- if (nr_to_scan) {
+ if (cnt) {
int active = 0;

/*
@@ -5096,18 +5096,11 @@ rescan:
}

spin_unlock(&shrink_list_lock);
-
- if (would_deadlock)
- return -1;
- else if (cnt > 0)
- return (cnt / 100) * sysctl_vfs_cache_pressure;
- else
- return 0;
+ return 0;
}

static struct shrinker shrinker = {
.shrink = i915_gem_shrink,
- .seeks = DEFAULT_SEEKS,
};

__init void
Index: linux-2.6/drivers/gpu/drm/ttm/ttm_page_alloc.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ linux-2.6/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -395,30 +395,38 @@ static int ttm_pool_get_num_unused_pages
/**
* Callback for mm to request pool to reduce number of page held.
*/
-static int ttm_pool_mm_shrink(int shrink_pages, gfp_t gfp_mask)
+static int ttm_pool_mm_shrink(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
- static atomic_t start_pool = ATOMIC_INIT(0);
- unsigned i;
- unsigned pool_offset = atomic_add_return(1, &start_pool);
- struct ttm_page_pool *pool;
+ static unsigned long nr_to_scan;
+ unsigned long shrink_pages;

- pool_offset = pool_offset % NUM_POOLS;
- /* select start pool in round robin fashion */
- for (i = 0; i < NUM_POOLS; ++i) {
- unsigned nr_free = shrink_pages;
- if (shrink_pages == 0)
- break;
- pool = &_manager.pools[(i + pool_offset)%NUM_POOLS];
- shrink_pages = ttm_page_pool_free(pool, nr_free);
+ shrinker_add_scan(&nr_to_scan, scanned, global,
+ ttm_pool_get_num_unused_pages(),
+ SHRINK_FIXED);
+
+ while ((shrink_pages = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
+ static atomic_t start_pool = ATOMIC_INIT(0);
+ unsigned pool_offset = atomic_add_return(1, &start_pool);
+ struct ttm_page_pool *pool;
+ unsigned i;
+
+ pool_offset = pool_offset % NUM_POOLS;
+ /* select start pool in round robin fashion */
+ for (i = 0; i < NUM_POOLS; ++i) {
+ unsigned nr_free = shrink_pages;
+ if (shrink_pages == 0)
+ break;
+ pool = &_manager.pools[(i + pool_offset)%NUM_POOLS];
+ shrink_pages = ttm_page_pool_free(pool, nr_free);
+ }
}
- /* return estimated number of unused pages in pool */
- return ttm_pool_get_num_unused_pages();
+ return 0;
}

static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
{
manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
- manager->mm_shrink.seeks = 1;
register_shrinker(&manager->mm_shrink);
}

Index: linux-2.6/fs/gfs2/glock.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glock.c
+++ linux-2.6/fs/gfs2/glock.c
@@ -1349,18 +1349,27 @@ void gfs2_glock_complete(struct gfs2_glo
}


-static int gfs2_shrink_glock_memory(int nr, gfp_t gfp_mask)
+static int gfs2_shrink_glock_memory(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
+ unsigned long nr;
struct gfs2_glock *gl;
int may_demote;
int nr_skipped = 0;
LIST_HEAD(skipped);

- if (nr == 0)
- goto out;
+ shrinker_add_scan(&nr_to_scan, scanned, global,
+ atomic_read(&lru_count),
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);

if (!(gfp_mask & __GFP_FS))
- return -1;
+ return 0;
+
+done:
+ nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!nr)
+ return 0;

spin_lock(&lru_lock);
while(nr && !list_empty(&lru_list)) {
@@ -1392,13 +1401,13 @@ static int gfs2_shrink_glock_memory(int
list_splice(&skipped, &lru_list);
atomic_add(nr_skipped, &lru_count);
spin_unlock(&lru_lock);
-out:
- return (atomic_read(&lru_count) / 100) * sysctl_vfs_cache_pressure;
+ if (!nr)
+ goto done;
+ return 0;
}

static struct shrinker glock_shrinker = {
.shrink = gfs2_shrink_glock_memory,
- .seeks = DEFAULT_SEEKS,
};

/**
Index: linux-2.6/fs/gfs2/main.c
===================================================================
--- linux-2.6.orig/fs/gfs2/main.c
+++ linux-2.6/fs/gfs2/main.c
@@ -27,7 +27,6 @@

static struct shrinker qd_shrinker = {
.shrink = gfs2_shrink_qd_memory,
- .seeks = DEFAULT_SEEKS,
};

static void gfs2_init_inode_once(void *foo)
Index: linux-2.6/fs/gfs2/quota.c
===================================================================
--- linux-2.6.orig/fs/gfs2/quota.c
+++ linux-2.6/fs/gfs2/quota.c
@@ -43,6 +43,7 @@
#include <linux/buffer_head.h>
#include <linux/sort.h>
#include <linux/fs.h>
+#include <linux/mm.h>
#include <linux/bio.h>
#include <linux/gfs2_ondisk.h>
#include <linux/kthread.h>
@@ -77,16 +78,25 @@ static LIST_HEAD(qd_lru_list);
static atomic_t qd_lru_count = ATOMIC_INIT(0);
static DEFINE_SPINLOCK(qd_lru_lock);

-int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask)
+int gfs2_shrink_qd_memory(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
+ unsigned long nr;
struct gfs2_quota_data *qd;
struct gfs2_sbd *sdp;

- if (nr == 0)
- goto out;
+ shrinker_add_scan(&nr_to_scan, scanned, global,
+ atomic_read(&qd_lru_count),
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);

if (!(gfp_mask & __GFP_FS))
- return -1;
+ return 0;
+
+done:
+ nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!nr)
+ return 0;

spin_lock(&qd_lru_lock);
while (nr && !list_empty(&qd_lru_list)) {
@@ -113,9 +123,9 @@ int gfs2_shrink_qd_memory(int nr, gfp_t
nr--;
}
spin_unlock(&qd_lru_lock);
-
-out:
- return (atomic_read(&qd_lru_count) * sysctl_vfs_cache_pressure) / 100;
+ if (!nr)
+ goto done;
+ return 0;
}

static u64 qd2offset(struct gfs2_quota_data *qd)
Index: linux-2.6/fs/gfs2/quota.h
===================================================================
--- linux-2.6.orig/fs/gfs2/quota.h
+++ linux-2.6/fs/gfs2/quota.h
@@ -51,7 +51,8 @@ static inline int gfs2_quota_lock_check(
return ret;
}

-extern int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask);
+extern int gfs2_shrink_qd_memory(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask);
extern const struct quotactl_ops gfs2_quotactl_ops;

#endif /* __QUOTA_DOT_H__ */
Index: linux-2.6/fs/nfs/super.c
===================================================================
--- linux-2.6.orig/fs/nfs/super.c
+++ linux-2.6/fs/nfs/super.c
@@ -350,7 +350,6 @@ static const struct super_operations nfs

static struct shrinker acl_shrinker = {
.shrink = nfs_access_cache_shrinker,
- .seeks = DEFAULT_SEEKS,
};

/*
Index: linux-2.6/fs/ubifs/shrinker.c
===================================================================
--- linux-2.6.orig/fs/ubifs/shrinker.c
+++ linux-2.6/fs/ubifs/shrinker.c
@@ -277,13 +277,16 @@ static int kick_a_thread(void)
return 0;
}

-int ubifs_shrinker(int nr, gfp_t gfp_mask)
+int ubifs_shrinker(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
+ unsigned long nr;
int freed, contention = 0;
long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);

- if (nr == 0)
- return clean_zn_cnt;
+ shrinker_add_scan(&nr_to_scan, scanned, global, clean_zn_cnt,
+ DEFAULT_SEEKS);

if (!clean_zn_cnt) {
/*
@@ -297,24 +300,28 @@ int ubifs_shrinker(int nr, gfp_t gfp_mas
return kick_a_thread();
}

+done:
+ nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!nr)
+ return 0;
+
freed = shrink_tnc_trees(nr, OLD_ZNODE_AGE, &contention);
if (freed >= nr)
- goto out;
+ goto done;

dbg_tnc("not enough old znodes, try to free young ones");
freed += shrink_tnc_trees(nr - freed, YOUNG_ZNODE_AGE, &contention);
if (freed >= nr)
- goto out;
+ goto done;

dbg_tnc("not enough young znodes, free all");
freed += shrink_tnc_trees(nr - freed, 0, &contention);
+ if (freed >= nr)
+ goto done;

- if (!freed && contention) {
- dbg_tnc("freed nothing, but contention");
- return -1;
- }
+ if (!freed && contention)
+ nr_to_scan += nr;

-out:
- dbg_tnc("%d znodes were freed, requested %d", freed, nr);
+ dbg_tnc("%d znodes were freed, requested %lu", freed, nr);
return freed;
}
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c
+++ linux-2.6/fs/ubifs/super.c
@@ -50,7 +50,6 @@ struct kmem_cache *ubifs_inode_slab;
/* UBIFS TNC shrinker description */
static struct shrinker ubifs_shrinker_info = {
.shrink = ubifs_shrinker,
- .seeks = DEFAULT_SEEKS,
};

/**
Index: linux-2.6/fs/ubifs/ubifs.h
===================================================================
--- linux-2.6.orig/fs/ubifs/ubifs.h
+++ linux-2.6/fs/ubifs/ubifs.h
@@ -1575,7 +1575,8 @@ int ubifs_tnc_start_commit(struct ubifs_
int ubifs_tnc_end_commit(struct ubifs_info *c);

/* shrinker.c */
-int ubifs_shrinker(int nr_to_scan, gfp_t gfp_mask);
+int ubifs_shrinker(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask);

/* commit.c */
int ubifs_bg_thread(void *info);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
@@ -45,11 +45,11 @@

static kmem_zone_t *xfs_buf_zone;
STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(int, gfp_t);
+STATIC int xfsbufd_wakeup(struct zone *,
+ unsigned long, unsigned long, unsigned long, gfp_t);
STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
static struct shrinker xfs_buf_shake = {
.shrink = xfsbufd_wakeup,
- .seeks = DEFAULT_SEEKS,
};

static struct workqueue_struct *xfslogd_workqueue;
@@ -340,7 +340,7 @@ _xfs_buf_lookup_pages(
__func__, gfp_mask);

XFS_STATS_INC(xb_page_retries);
- xfsbufd_wakeup(0, gfp_mask);
+ xfsbufd_wakeup(NULL, 0, 0, 0, gfp_mask);
congestion_wait(BLK_RW_ASYNC, HZ/50);
goto retry;
}
@@ -1762,8 +1762,11 @@ xfs_buf_runall_queues(

STATIC int
xfsbufd_wakeup(
- int priority,
- gfp_t mask)
+ struct zone *zone,
+ unsigned long scanned,
+ unsigned long total,
+ unsigned long global,
+ gfp_t gfp_mask)
{
xfs_buftarg_t *btp;

Index: linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_sync.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
@@ -838,43 +838,52 @@ static struct rw_semaphore xfs_mount_lis

static int
xfs_reclaim_inode_shrink(
- int nr_to_scan,
+ struct zone *zone,
+ unsigned long scanned,
+ unsigned long total,
+ unsigned long global,
gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
+ int nr;
struct xfs_mount *mp;
struct xfs_perag *pag;
xfs_agnumber_t ag;
- int reclaimable = 0;
-
- if (nr_to_scan) {
- if (!(gfp_mask & __GFP_FS))
- return -1;
-
- down_read(&xfs_mount_list_lock);
- list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
- xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
- XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
- if (nr_to_scan <= 0)
- break;
- }
- up_read(&xfs_mount_list_lock);
- }
+ unsigned long nr_reclaimable = 0;

down_read(&xfs_mount_list_lock);
list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
pag = xfs_perag_get(mp, ag);
- reclaimable += pag->pag_ici_reclaimable;
+ nr_reclaimable += pag->pag_ici_reclaimable;
xfs_perag_put(pag);
}
}
+ shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
+ DEFAULT_SEEKS);
+ if (!(gfp_mask & __GFP_FS)) {
+ up_read(&xfs_mount_list_lock);
+ return 0;
+ }
+
+done:
+ nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!nr) {
+ up_read(&xfs_mount_list_lock);
+ return 0;
+ }
+ list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
+ xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
+ XFS_ICI_RECLAIM_TAG, 1, &nr);
+ if (nr <= 0)
+ goto done;
+ }
up_read(&xfs_mount_list_lock);
- return reclaimable;
+ return 0;
}

static struct shrinker xfs_inode_shrinker = {
.shrink = xfs_reclaim_inode_shrink,
- .seeks = DEFAULT_SEEKS,
};

void __init
Index: linux-2.6/fs/xfs/quota/xfs_qm.c
===================================================================
--- linux-2.6.orig/fs/xfs/quota/xfs_qm.c
+++ linux-2.6/fs/xfs/quota/xfs_qm.c
@@ -69,11 +69,11 @@ STATIC void xfs_qm_list_destroy(xfs_dqli

STATIC int xfs_qm_init_quotainos(xfs_mount_t *);
STATIC int xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int xfs_qm_shake(int, gfp_t);
+STATIC int xfs_qm_shake(struct zone *, unsigned long, unsigned long,
+ unsigned long, gfp_t);

static struct shrinker xfs_qm_shaker = {
.shrink = xfs_qm_shake,
- .seeks = DEFAULT_SEEKS,
};

#ifdef DEBUG
@@ -2119,7 +2119,12 @@ xfs_qm_shake_freelist(
*/
/* ARGSUSED */
STATIC int
-xfs_qm_shake(int nr_to_scan, gfp_t gfp_mask)
+xfs_qm_shake(
+ struct zone *zone,
+ unsigned long scanned,
+ unsigned long total,
+ unsigned long global,
+ gfp_t gfp_mask)
{
int ndqused, nfree, n;

@@ -2140,7 +2145,9 @@ xfs_qm_shake(int nr_to_scan, gfp_t gfp_m
ndqused *= xfs_Gqm->qm_dqfree_ratio; /* target # of free dquots */
n = nfree - ndqused - ndquot; /* # over target */

- return xfs_qm_shake_freelist(MAX(nfree, n));
+ xfs_qm_shake_freelist(MAX(nfree, n));
+
+ return 0;
}


Index: linux-2.6/net/sunrpc/auth.c
===================================================================
--- linux-2.6.orig/net/sunrpc/auth.c
+++ linux-2.6/net/sunrpc/auth.c
@@ -227,8 +227,8 @@ EXPORT_SYMBOL_GPL(rpcauth_destroy_credca
/*
* Remove stale credentials. Avoid sleeping inside the loop.
*/
-static int
-rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
+static void
+rpcauth_prune_expired(struct list_head *free, unsigned long nr_to_scan)
{
spinlock_t *cache_lock;
struct rpc_cred *cred, *next;
@@ -244,7 +244,7 @@ rpcauth_prune_expired(struct list_head *
*/
if (time_in_range(cred->cr_expire, expired, jiffies) &&
test_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags) != 0)
- return 0;
+ break;

list_del_init(&cred->cr_lru);
number_cred_unused--;
@@ -260,27 +260,36 @@ rpcauth_prune_expired(struct list_head *
}
spin_unlock(cache_lock);
}
- return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
}

/*
* Run memory cache shrinker.
*/
static int
-rpcauth_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
+rpcauth_cache_shrinker(struct zone *zone, unsigned long scanned,
+ unsigned long total, unsigned long global, gfp_t gfp_mask)
{
+ static unsigned long nr_to_scan;
+ unsigned long nr;
LIST_HEAD(free);
- int res;

+ shrinker_add_scan(&nr_to_scan, scanned, global,
+ number_cred_unused,
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
- return (nr_to_scan == 0) ? 0 : -1;
+ return 0;
+again:
if (list_empty(&cred_unused))
return 0;
+ nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+ if (!nr)
+ return 0;
spin_lock(&rpc_credcache_lock);
- res = rpcauth_prune_expired(&free, nr_to_scan);
+ rpcauth_prune_expired(&free, nr);
spin_unlock(&rpc_credcache_lock);
rpcauth_destroy_credlist(&free);
- return res;
+ cond_resched();
+ goto again;
}

/*
@@ -584,7 +593,6 @@ rpcauth_uptodatecred(struct rpc_task *ta

static struct shrinker rpc_cred_shrinker = {
.shrink = rpcauth_cache_shrinker,
- .seeks = DEFAULT_SEEKS,
};

void __init rpcauth_init_module(void)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2010-06-24 10:06:50 UTC
Permalink
Post by n***@suse.de
Allow the shrinker to do per-zone shrinking. This means it is called for
each zone scanned. The shrinker is now completely responsible for calculating
and batching (given helpers), which provides better flexibility.
Beyond the scope of this patch, but at some point this probably needs
to be even more fine grained. With large number of cores/threads in
each socket a "zone" is actually shared by quite a large number
of CPUs now and this can cause problems.
Post by n***@suse.de
+void shrinker_add_scan(unsigned long *dst,
+ unsigned long scanned, unsigned long total,
+ unsigned long objects, unsigned int ratio)
+{
+ unsigned long long delta;
+
+ delta = (unsigned long long)scanned * objects * ratio;
+ do_div(delta, total + 1);
+ delta /= (128ULL / 4ULL);
Again I object to the magic numbers ...
Post by n***@suse.de
+ nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+ if (nr >= 10)
+ goto again;
And here.

Overall it seems good, but I have not read all the shrinker callback
changes in all subsystems.
-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2010-06-24 16:00:52 UTC
Permalink
Post by Andi Kleen
Post by n***@suse.de
Allow the shrinker to do per-zone shrinking. This means it is called for
each zone scanned. The shrinker is now completely responsible for calculating
and batching (given helpers), which provides better flexibility.
Beyond the scope of this patch, but at some point this probably needs
to be even more fine grained. With large number of cores/threads in
each socket a "zone" is actually shared by quite a large number
of CPUs now and this can cause problems.
Yes, possibly. At least it is a much better step than the big dumb
global list.
Post by Andi Kleen
Post by n***@suse.de
+void shrinker_add_scan(unsigned long *dst,
+ unsigned long scanned, unsigned long total,
+ unsigned long objects, unsigned int ratio)
+{
+ unsigned long long delta;
+
+ delta = (unsigned long long)scanned * objects * ratio;
+ do_div(delta, total + 1);
+ delta /= (128ULL / 4ULL);
Again I object to the magic numbers ...
Post by n***@suse.de
+ nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+ if (nr >= 10)
+ goto again;
And here.
I don't like them either -- problem is they were inherited from the
old code (actually 128 is the fixed point scale, I do have a define
for it just forgot to use it).

I don't know where 4 came from. And 10 is just a random number someone
picked out of a hat :P
Post by Andi Kleen
Overall it seems good, but I have not read all the shrinker callback
changes in all subsystems.
Thanks for looking over it Andi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2010-06-24 16:27:02 UTC
Permalink
Post by Nick Piggin
Post by Andi Kleen
Overall it seems good, but I have not read all the shrinker callback
changes in all subsystems.
Thanks for looking over it Andi.
FWIW i skimmed over most of the patches and nothing stood out that
I really disliked. But I have gone over the code in very deep detail.

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2010-06-24 16:32:26 UTC
Permalink
Post by Andi Kleen
Post by Nick Piggin
Post by Andi Kleen
Overall it seems good, but I have not read all the shrinker callback
changes in all subsystems.
Thanks for looking over it Andi.
FWIW i skimmed over most of the patches and nothing stood out that
I really disliked. But I have gone over the code in very deep detail.
haven't

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2010-06-24 16:37:09 UTC
Permalink
Post by Andi Kleen
Post by Nick Piggin
Post by Andi Kleen
Overall it seems good, but I have not read all the shrinker callback
changes in all subsystems.
Thanks for looking over it Andi.
FWIW i skimmed over most of the patches and nothing stood out that
I really disliked. But I have gone over the code in very deep detail.
s/have/haven't/

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
n***@suse.de
2010-06-24 03:02:20 UTC
Permalink
Improve scalability of mntget/mntput by using per-cpu counters protected by the
reader side of the brlock vfsmount_lock. If the mnt_hash field of the vfsmount
structure is attached to a list, then it is mounted which contributes to its
refcount, so the per-cpu counters need not be summed.

MNT_PSEUDO keeps track of whether the vfsmount is actually a pseudo filesystem
that will never be attached (such as sockfs).

No extra atomics in the common case because atomic mnt refcount is now replaced
with per-CPU spinlock. Code will be bigger and more complex however. With the
previous per-cpu locking patch, mount lookups and common case refcounting are
now per-cpu and should be ideally scalable. path lookups (and hence
path_get/path_put) within the same vfsmount should now be more scalable,
however this will often be hidden by dcache_lock on final dput, and d_lock on
common path elements (eg. cwd or root dentry).

Signed-off-by: Nick Piggin <***@suse.de>

[Note: this is not for merging. Un-attached operation (lazy umount) may not be
uncommon and will be slowed down and actually have worse scalablilty after
this patch. I need to think about how to do fast refcounting with unattached
mounts.]

---
drivers/mtd/mtdchar.c | 1
fs/internal.h | 1
fs/libfs.c | 1
fs/namespace.c | 155 +++++++++++++++++++++++++++++++++++++++++++-------
fs/pnode.c | 4 -
include/linux/mount.h | 26 +-------
6 files changed, 144 insertions(+), 44 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun
mnt->mnt_group_id = 0;
}

+/*
+ * vfsmount lock must be held for read
+ */
+static inline void add_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+ (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;
+#else
+ mnt->mnt_count += n;
+#endif
+}
+
+static inline void set_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+ preempt_disable();
+ (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
+ preempt_enable();
+#else
+ mnt->mnt_count = n;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void inc_mnt_count(struct vfsmount *mnt)
+{
+ add_mnt_count(mnt, 1);
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void dec_mnt_count(struct vfsmount *mnt)
+{
+ add_mnt_count(mnt, -1);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
+unsigned int count_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+ unsigned int count = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ count += *per_cpu_ptr(mnt->mnt_count, cpu);
+ }
+
+ return count;
+#else
+ return mnt->mnt_count;
+#endif
+}
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -154,7 +212,15 @@ struct vfsmount *alloc_vfsmnt(const char
goto out_free_id;
}

- atomic_set(&mnt->mnt_count, 1);
+#ifdef CONFIG_SMP
+ mnt->mnt_count = alloc_percpu(int);
+ if (!mnt->mnt_count)
+ goto out_free_devname;
+#else
+ mnt->mnt_count = 0;
+#endif
+ set_mnt_count(mnt, 1);
+
INIT_LIST_HEAD(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -166,7 +232,7 @@ struct vfsmount *alloc_vfsmnt(const char
#ifdef CONFIG_SMP
mnt->mnt_writers = alloc_percpu(int);
if (!mnt->mnt_writers)
- goto out_free_devname;
+ goto out_free_mntcount;
#else
mnt->mnt_writers = 0;
#endif
@@ -174,6 +240,8 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;

#ifdef CONFIG_SMP
+out_free_mntcount:
+ free_percpu(mnt->mnt_count);
out_free_devname:
kfree(mnt->mnt_devname);
#endif
@@ -591,7 +659,7 @@ static struct vfsmount *clone_mnt(struct
goto out_free;
}

- mnt->mnt_flags = old->mnt_flags;
+ WARN_ON(mnt->mnt_flags & MNT_WRITE_HOLD);
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
mnt->mnt_root = dget(root);
@@ -638,6 +706,11 @@ static inline void __mntput(struct vfsmo
/*
* atomic_dec_and_lock() used to deal with ->mnt_count decrements
* provides barriers, so count_mnt_writers() below is safe. AV
+ * XXX: We no longer have an atomic_dec_and_lock, so load of
+ * mnt_writers may be moved up into the vfsmount lock critical section?
+ * Do we need an smp_mb()? I don't see how it is possible because an
+ * elevated write count should also have elevated ref count so we'd
+ * never get here.
*/
WARN_ON(count_mnt_writers(mnt));
dput(mnt->mnt_root);
@@ -648,45 +721,76 @@ static inline void __mntput(struct vfsmo
void mntput_no_expire(struct vfsmount *mnt)
{
repeat:
- if (atomic_add_unless(&mnt->mnt_count, -1, 1))
+ if (likely(!list_empty(&mnt->mnt_hash) ||
+ mnt->mnt_flags & MNT_PSEUDO)) {
+ br_read_lock(vfsmount_lock);
+ if (unlikely(list_empty(&mnt->mnt_hash) &&
+ (!(mnt->mnt_flags & MNT_PSEUDO)))) {
+ br_read_unlock(vfsmount_lock);
+ goto repeat;
+ }
+ dec_mnt_count(mnt);
+ br_read_unlock(vfsmount_lock);
return;
+ }
+
br_write_lock(vfsmount_lock);
- if (!atomic_dec_and_test(&mnt->mnt_count)) {
+ dec_mnt_count(mnt);
+ if (count_mnt_count(mnt)) {
br_write_unlock(vfsmount_lock);
return;
}
- if (likely(!mnt->mnt_pinned)) {
+ if (unlikely(mnt->mnt_pinned)) {
+ add_mnt_count(mnt, mnt->mnt_pinned + 1);
+ mnt->mnt_pinned = 0;
br_write_unlock(vfsmount_lock);
- __mntput(mnt);
- return;
+ acct_auto_close_mnt(mnt);
+ goto repeat;
}
- atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
- mnt->mnt_pinned = 0;
br_write_unlock(vfsmount_lock);
- acct_auto_close_mnt(mnt);
- goto repeat;
+ __mntput(mnt);
}
EXPORT_SYMBOL(mntput_no_expire);

+void mntput(struct vfsmount *mnt)
+{
+ if (mnt) {
+ /* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+ if (unlikely(mnt->mnt_expiry_mark))
+ mnt->mnt_expiry_mark = 0;
+ mntput_no_expire(mnt);
+ }
+}
+EXPORT_SYMBOL(mntput);
+
+struct vfsmount *mntget(struct vfsmount *mnt)
+{
+ if (mnt) {
+ preempt_disable();
+ inc_mnt_count(mnt);
+ preempt_enable();
+ }
+ return mnt;
+}
+EXPORT_SYMBOL(mntget);
+
void mnt_pin(struct vfsmount *mnt)
{
br_write_lock(vfsmount_lock);
mnt->mnt_pinned++;
br_write_unlock(vfsmount_lock);
}
-
EXPORT_SYMBOL(mnt_pin);

void mnt_unpin(struct vfsmount *mnt)
{
br_write_lock(vfsmount_lock);
if (mnt->mnt_pinned) {
- atomic_inc(&mnt->mnt_count);
+ inc_mnt_count(mnt);
mnt->mnt_pinned--;
}
br_write_unlock(vfsmount_lock);
}
-
EXPORT_SYMBOL(mnt_unpin);

static inline void mangle(struct seq_file *m, const char *s)
@@ -982,12 +1086,13 @@ int may_umount_tree(struct vfsmount *mnt
int minimum_refs = 0;
struct vfsmount *p;

- br_read_lock(vfsmount_lock);
+ /* write lock needed for count_mnt_count */
+ br_write_lock(vfsmount_lock);
for (p = mnt; p; p = next_mnt(p, mnt)) {
- actual_refs += atomic_read(&p->mnt_count);
+ actual_refs += count_mnt_count(p);
minimum_refs += 2;
}
- br_read_unlock(vfsmount_lock);
+ br_write_unlock(vfsmount_lock);

if (actual_refs > minimum_refs)
return 0;
@@ -1014,10 +1119,10 @@ int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
- br_read_lock(vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (propagate_mount_busy(mnt, 2))
ret = 0;
- br_read_unlock(vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_read(&namespace_sem);
return ret;
}
@@ -1099,8 +1204,16 @@ static int do_umount(struct vfsmount *mn
flags & (MNT_FORCE | MNT_DETACH))
return -EINVAL;

- if (atomic_read(&mnt->mnt_count) != 2)
+ /*
+ * probably don't strictly need the lock here if we examined
+ * all race cases, but it's a slowpath.
+ */
+ br_write_lock(vfsmount_lock);
+ if (count_mnt_count(mnt) != 2) {
+ br_write_lock(vfsmount_lock);
return -EBUSY;
+ }
+ br_write_unlock(vfsmount_lock);

if (!xchg(&mnt->mnt_expiry_mark, 1))
return -EAGAIN;
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h
+++ linux-2.6/include/linux/mount.h
@@ -31,6 +31,7 @@ struct mnt_namespace;

#define MNT_SHRINKABLE 0x100
#define MNT_WRITE_HOLD 0x200
+#define MNT_PSEUDO 0x400

#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
@@ -67,19 +68,15 @@ struct vfsmount {
struct mnt_namespace *mnt_ns; /* containing namespace */
int mnt_id; /* mount identifier */
int mnt_group_id; /* peer group identifier */
- /*
- * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
- * to let these frequently modified fields in a separate cache line
- * (so that reads of mnt_flags wont ping-pong on SMP machines)
- */
- atomic_t mnt_count;
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
int mnt_ghosts;
#ifdef CONFIG_SMP
int __percpu *mnt_writers;
+ int __percpu *mnt_count;
#else
int mnt_writers;
+ int mnt_count;
#endif
};

@@ -92,13 +89,6 @@ static inline int *get_mnt_writers_ptr(s
#endif
}

-static inline struct vfsmount *mntget(struct vfsmount *mnt)
-{
- if (mnt)
- atomic_inc(&mnt->mnt_count);
- return mnt;
-}
-
struct file; /* forward dec */

extern int mnt_want_write(struct vfsmount *mnt);
@@ -106,18 +96,12 @@ extern int mnt_want_write_file(struct fi
extern int mnt_clone_write(struct vfsmount *mnt);
extern void mnt_drop_write(struct vfsmount *mnt);
extern void mntput_no_expire(struct vfsmount *mnt);
+extern void mntput(struct vfsmount *mnt);
+extern struct vfsmount *mntget(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);

-static inline void mntput(struct vfsmount *mnt)
-{
- if (mnt) {
- mnt->mnt_expiry_mark = 0;
- mntput_no_expire(mnt);
- }
-}
-
extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
const char *name, void *data);

Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -288,7 +288,7 @@ out:
*/
static inline int do_refcount_check(struct vfsmount *mnt, int count)
{
- int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts;
+ int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts;
return (mycount > count);
}

@@ -300,7 +300,7 @@ static inline int do_refcount_check(stru
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
*
- * vfsmount lock must be held for read or write
+ * vfsmount lock must be held for write
*/
int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
{
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h
+++ linux-2.6/fs/internal.h
@@ -63,6 +63,7 @@ extern int copy_mount_string(const void

extern void free_vfsmnt(struct vfsmount *);
extern struct vfsmount *alloc_vfsmnt(const char *);
+extern unsigned int count_mnt_count(struct vfsmount *mnt);
extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
struct vfsmount *);
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -241,6 +241,7 @@ int get_sb_pseudo(struct file_system_typ
d_instantiate(dentry, root);
s->s_root = dentry;
s->s_flags |= MS_ACTIVE;
+ mnt->mnt_flags |= MNT_PSEUDO;
simple_set_mnt(mnt, s);
return 0;

Index: linux-2.6/drivers/mtd/mtdchar.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdchar.c
+++ linux-2.6/drivers/mtd/mtdchar.c
@@ -1048,6 +1048,7 @@ err_unregister_chdev:
static void __exit cleanup_mtdchar(void)
{
unregister_mtd_user(&mtdchar_notifier);
+ mtd_inode_mnt &= ~MNT_PSEUDO;
mntput(mtd_inode_mnt);
unregister_filesystem(&mtd_inodefs_type);
__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
n***@suse.de
2010-06-24 03:02:45 UTC
Permalink
Make last_ino atomic in preperation for removing inode_lock.
Make a new lock for iunique counter, for removing inode_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/inode.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -742,7 +742,7 @@ struct inode *new_inode(struct super_blo
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
- static unsigned int last_ino;
+ static atomic_t last_ino = ATOMIC_INIT(0);
struct inode *inode;

spin_lock_prefetch(&inode_lock);
@@ -752,7 +752,7 @@ struct inode *new_inode(struct super_blo
spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
- inode->i_ino = ++last_ino;
+ inode->i_ino = atomic_inc_return(&last_ino);
inode->i_state = 0;
__inode_add_to_lists(sb, NULL, inode);
spin_unlock(&inode->i_lock);
@@ -903,6 +903,22 @@ static struct inode *get_new_inode_fast(
return inode;
}

+static int test_inode_iunique(struct super_block * sb, struct hlist_head *head, unsigned long ino)
+{
+ struct hlist_node *node;
+ struct inode * inode = NULL;
+
+ spin_lock(&inode_hash_lock);
+ hlist_for_each_entry(inode, node, head, i_hash) {
+ if (inode->i_ino == ino && inode->i_sb == sb) {
+ spin_unlock(&inode_hash_lock);
+ return 0;
+ }
+ }
+ spin_unlock(&inode_hash_lock);
+ return 1;
+}
+
/**
* iunique - get a unique inode number
* @sb: superblock
@@ -924,20 +940,20 @@ ino_t iunique(struct super_block *sb, in
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
+ static DEFINE_SPINLOCK(unique_lock);
static unsigned int counter;
- struct inode *inode;
struct hlist_head *head;
ino_t res;

spin_lock(&inode_lock);
+ spin_lock(&unique_lock);
do {
if (counter <= max_reserved)
counter = max_reserved + 1;
res = counter++;
head = inode_hashtable + hash(sb, res);
- inode = find_inode_fast(sb, head, res);
- spin_unlock(&inode->i_lock);
- } while (inode != NULL);
+ } while (!test_inode_iunique(sb, head, res));
+ spin_unlock(&unique_lock);
spin_unlock(&inode_lock);

return res;


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:36 UTC
Permalink
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.

Comment: don't need rcu_deref because we take the spinlock and recheck it.

Signed-off-by: Nick Piggin <***@suse.de>
--

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -311,23 +311,18 @@ struct dentry *dget_parent(struct dentry
struct dentry *ret;

repeat:
- spin_lock(&dentry->d_lock);
+ rcu_read_lock();
ret = dentry->d_parent;
- if (!ret)
- goto out;
- if (dentry == ret) {
- ret->d_count++;
- goto out;
- }
- if (!spin_trylock(&ret->d_lock)) {
- spin_unlock(&dentry->d_lock);
+ spin_lock(&ret->d_lock);
+ if (unlikely(ret != dentry->d_parent)) {
+ spin_unlock(&ret->d_lock);
+ rcu_read_unlock();
goto repeat;
}
+ rcu_read_unlock();
BUG_ON(!ret->d_count);
ret->d_count++;
spin_unlock(&ret->d_lock);
-out:
- spin_unlock(&dentry->d_lock);
return ret;
}
EXPORT_SYMBOL(dget_parent);
@@ -601,14 +596,22 @@ static void prune_one_dentry(struct dent
if (inode)
spin_lock(&inode->i_lock);
again:
- spin_lock(&dentry->d_lock);
- if (dentry->d_parent && dentry != dentry->d_parent) {
- if (!spin_trylock(&dentry->d_parent->d_lock)) {
- spin_unlock(&dentry->d_lock);
+ rcu_read_lock();
+ parent = dentry->d_parent;
+ if (parent) {
+ spin_lock(&parent->d_lock);
+ if (unlikely(parent != dentry->d_parent)) {
+ spin_unlock(&parent->d_lock);
+ rcu_read_unlock();
goto again;
}
- parent = dentry->d_parent;
- }
+ if (parent != dentry)
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+ else
+ parent = NULL;
+ } else
+ spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
dentry->d_count--;
if (dentry->d_count) {
if (parent)


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra
2010-06-24 08:44:22 UTC
Permalink
Post by n***@suse.de
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.
Comment: don't need rcu_deref because we take the spinlock and recheck it.
But does the LOCK barrier imply a DATA DEPENDENCY barrier? (It does on
x86, and the compiler barrier implied by spin_lock() suffices to replace
ACCESS_ONCE()).
Nick Piggin
2010-06-24 15:07:06 UTC
Permalink
Post by Peter Zijlstra
Post by n***@suse.de
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.
Comment: don't need rcu_deref because we take the spinlock and recheck it.
But does the LOCK barrier imply a DATA DEPENDENCY barrier? (It does on
x86, and the compiler barrier implied by spin_lock() suffices to replace
ACCESS_ONCE()).
Well the dependency we care about is from loading the parent pointer
to acquiring its spinlock. But we can't possibly have stale data given
to the spin lock operation itself because it is a RMW.
Paul E. McKenney
2010-06-24 15:32:18 UTC
Permalink
Post by Nick Piggin
Post by Peter Zijlstra
Post by n***@suse.de
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.
Comment: don't need rcu_deref because we take the spinlock and recheck it.
But does the LOCK barrier imply a DATA DEPENDENCY barrier? (It does on
x86, and the compiler barrier implied by spin_lock() suffices to replace
ACCESS_ONCE()).
Well the dependency we care about is from loading the parent pointer
to acquiring its spinlock. But we can't possibly have stale data given
to the spin lock operation itself because it is a RMW.
As long as you check for the structure being valid after acquiring the
lock, I agree. Otherwise, I would be concerned about the following
sequence of events:

1. CPU 0 picks up a pointer to a given data element.

2. CPU 1 removes this element from the list, drops any locks that
it might have, and starts waiting for a grace period to
elapse.

3. CPU 0 acquires the lock, does some operation that would
be appropriate had the element not been removed, then
releases the lock.

4. After the grace period, CPU 1 frees the element, negating
CPU 0's hard work.

The usual approach is to have a "deleted" flag or some such in the
element that CPU 0 would set when removing the element and that CPU 1
would check after acquiring the lock. Which you might well already
be doing! ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2010-06-24 16:05:24 UTC
Permalink
Post by Paul E. McKenney
Post by Nick Piggin
Post by Peter Zijlstra
Post by n***@suse.de
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.
Comment: don't need rcu_deref because we take the spinlock and recheck it.
But does the LOCK barrier imply a DATA DEPENDENCY barrier? (It does on
x86, and the compiler barrier implied by spin_lock() suffices to replace
ACCESS_ONCE()).
Well the dependency we care about is from loading the parent pointer
to acquiring its spinlock. But we can't possibly have stale data given
to the spin lock operation itself because it is a RMW.
As long as you check for the structure being valid after acquiring the
lock, I agree. Otherwise, I would be concerned about the following
1. CPU 0 picks up a pointer to a given data element.
2. CPU 1 removes this element from the list, drops any locks that
it might have, and starts waiting for a grace period to
elapse.
3. CPU 0 acquires the lock, does some operation that would
be appropriate had the element not been removed, then
releases the lock.
4. After the grace period, CPU 1 frees the element, negating
CPU 0's hard work.
The usual approach is to have a "deleted" flag or some such in the
element that CPU 0 would set when removing the element and that CPU 1
would check after acquiring the lock. Which you might well already
be doing! ;-)
Thanks, yep it's done under RCU, and after taking the lock it rechecks
to see that it is still reachable by the same pointer (and if not,
unlocks and retries) so it should be fine.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paul E. McKenney
2010-06-24 16:41:35 UTC
Permalink
Post by Nick Piggin
Post by Paul E. McKenney
Post by Nick Piggin
Post by Peter Zijlstra
Post by n***@suse.de
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.
Comment: don't need rcu_deref because we take the spinlock and recheck it.
But does the LOCK barrier imply a DATA DEPENDENCY barrier? (It does on
x86, and the compiler barrier implied by spin_lock() suffices to replace
ACCESS_ONCE()).
Well the dependency we care about is from loading the parent pointer
to acquiring its spinlock. But we can't possibly have stale data given
to the spin lock operation itself because it is a RMW.
As long as you check for the structure being valid after acquiring the
lock, I agree. Otherwise, I would be concerned about the following
1. CPU 0 picks up a pointer to a given data element.
2. CPU 1 removes this element from the list, drops any locks that
it might have, and starts waiting for a grace period to
elapse.
3. CPU 0 acquires the lock, does some operation that would
be appropriate had the element not been removed, then
releases the lock.
4. After the grace period, CPU 1 frees the element, negating
CPU 0's hard work.
The usual approach is to have a "deleted" flag or some such in the
element that CPU 0 would set when removing the element and that CPU 1
would check after acquiring the lock. Which you might well already
be doing! ;-)
Thanks, yep it's done under RCU, and after taking the lock it rechecks
to see that it is still reachable by the same pointer (and if not,
unlocks and retries) so it should be fine.
Very good!!! ;-)

Thanx, Paul
Paul E. McKenney
2010-06-28 21:50:13 UTC
Permalink
Post by n***@suse.de
Use RCU property of dcache to simplify locking in some places where we
take d_parent and d_lock.
Comment: don't need rcu_deref because we take the spinlock and recheck it.
Looks good other than one question below.

Thanx, Paul
Post by n***@suse.de
--
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -311,23 +311,18 @@ struct dentry *dget_parent(struct dentry
struct dentry *ret;
- spin_lock(&dentry->d_lock);
+ rcu_read_lock();
ret = dentry->d_parent;
Doesn't this need to be as follows?

ret = rcu_dereference(dentry)->d_parent;

Otherwise, couldn't we end up seeing pre-initialization value for
->d_parent for a newly inserted dentry?
Post by n***@suse.de
- if (!ret)
- goto out;
- if (dentry == ret) {
- ret->d_count++;
- goto out;
- }
- if (!spin_trylock(&ret->d_lock)) {
- spin_unlock(&dentry->d_lock);
+ spin_lock(&ret->d_lock);
Once we do this, however, we are golden, at least for all dentry
fields protected by ->lock. This does assume that the compiler does not
speculate the fetch that initialized the argument dentry into the critical
section, which I would sure hope would be a reasonable assumption.
Post by n***@suse.de
+ if (unlikely(ret != dentry->d_parent)) {
+ spin_unlock(&ret->d_lock);
+ rcu_read_unlock();
goto repeat;
}
+ rcu_read_unlock();
BUG_ON(!ret->d_count);
ret->d_count++;
spin_unlock(&ret->d_lock);
- spin_unlock(&dentry->d_lock);
return ret;
}
EXPORT_SYMBOL(dget_parent);
@@ -601,14 +596,22 @@ static void prune_one_dentry(struct dent
if (inode)
spin_lock(&inode->i_lock);
- spin_lock(&dentry->d_lock);
- if (dentry->d_parent && dentry != dentry->d_parent) {
- if (!spin_trylock(&dentry->d_parent->d_lock)) {
- spin_unlock(&dentry->d_lock);
+ rcu_read_lock();
+ parent = dentry->d_parent;
+ if (parent) {
+ spin_lock(&parent->d_lock);
+ if (unlikely(parent != dentry->d_parent)) {
+ spin_unlock(&parent->d_lock);
+ rcu_read_unlock();
goto again;
}
- parent = dentry->d_parent;
- }
+ if (parent != dentry)
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+ else
+ parent = NULL;
+ } else
+ spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
dentry->d_count--;
if (dentry->d_count) {
if (parent)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:14 UTC
Permalink
list_for_each_entry_safe is not suitable to protect against concurrent
modification of the list. 6754af6 introduced a race in sb walking.

list_for_each_entry can use the trick of pinning the current entry in
the list before we drop and retake the lock because it subsequently
follows cur->next. However list_for_each_entry_safe saves n=cur->next
for following before entering the loop body, so when the lock is
dropped, n may be deleted.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 2 ++
fs/super.c | 6 ++++++
include/linux/list.h | 15 +++++++++++++++
3 files changed, 23 insertions(+)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -590,6 +590,8 @@ static void prune_dcache(int count)
up_read(&sb->s_umount);
}
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
count -= pruned;
__put_super(sb);
/* more work left to do? */
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -374,6 +374,8 @@ void sync_supers(void)
up_read(&sb->s_umount);

spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
}
@@ -405,6 +407,8 @@ void iterate_supers(void (*f)(struct sup
up_read(&sb->s_umount);

spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
spin_unlock(&sb_lock);
@@ -585,6 +589,8 @@ static void do_emergency_remount(struct
}
up_write(&sb->s_umount);
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
spin_unlock(&sb_lock);
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h
+++ linux-2.6/include/linux/list.h
@@ -544,6 +544,21 @@ static inline void list_splice_tail_init
&pos->member != (head); \
pos = n, n = list_entry(n->member.prev, typeof(*n), member))

+/**
+ * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
+ * @pos: the loop cursor used in the list_for_each_entry_safe loop
+ * @n: temporary storage used in list_for_each_entry_safe
+ * @member: the name of the list_struct within the struct.
+ *
+ * list_safe_reset_next is not safe to use in general if the list may be
+ * modified concurrently (eg. the lock is dropped in the loop body). An
+ * exception to this is if the cursor element (pos) is pinned in the list,
+ * and list_safe_reset_next is called after re-taking the lock and before
+ * completing the current iteration of the loop body.
+ */
+#define list_safe_reset_next(pos, n, member) \
+ n = list_entry(pos->member.next, typeof(*pos), member)
+
/*
* Double linked lists with a single pointer list head.
* Mostly useful for hash tables where the two pointer list head is


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2010-06-29 13:02:14 UTC
Permalink
This should actually be on it's way to Linus for .35, shouldn't it?
Post by n***@suse.de
list_for_each_entry_safe is not suitable to protect against concurrent
modification of the list. 6754af6 introduced a race in sb walking.
list_for_each_entry can use the trick of pinning the current entry in
the list before we drop and retake the lock because it subsequently
follows cur->next. However list_for_each_entry_safe saves n=cur->next
for following before entering the loop body, so when the lock is
dropped, n may be deleted.
---
fs/dcache.c | 2 ++
fs/super.c | 6 ++++++
include/linux/list.h | 15 +++++++++++++++
3 files changed, 23 insertions(+)
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -590,6 +590,8 @@ static void prune_dcache(int count)
up_read(&sb->s_umount);
}
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
count -= pruned;
__put_super(sb);
/* more work left to do? */
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -374,6 +374,8 @@ void sync_supers(void)
up_read(&sb->s_umount);
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
}
@@ -405,6 +407,8 @@ void iterate_supers(void (*f)(struct sup
up_read(&sb->s_umount);
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
spin_unlock(&sb_lock);
@@ -585,6 +589,8 @@ static void do_emergency_remount(struct
}
up_write(&sb->s_umount);
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
spin_unlock(&sb_lock);
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h
+++ linux-2.6/include/linux/list.h
@@ -544,6 +544,21 @@ static inline void list_splice_tail_init
&pos->member != (head); \
pos = n, n = list_entry(n->member.prev, typeof(*n), member))
+/**
+ * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
+ *
+ * list_safe_reset_next is not safe to use in general if the list may be
+ * modified concurrently (eg. the lock is dropped in the loop body). An
+ * exception to this is if the cursor element (pos) is pinned in the list,
+ * and list_safe_reset_next is called after re-taking the lock and before
+ * completing the current iteration of the loop body.
+ */
+#define list_safe_reset_next(pos, n, member) \
+ n = list_entry(pos->member.next, typeof(*pos), member)
+
/*
* Double linked lists with a single pointer list head.
* Mostly useful for hash tables where the two pointer list head is
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
---end quoted text---
Nick Piggin
2010-06-29 14:56:17 UTC
Permalink
Post by Christoph Hellwig
This should actually be on it's way to Linus for .35, shouldn't it?
Yeah, I was waiting for Al to reappear, but I think this is
probably the nicest way to solve the problem. Linus?
--
fs: fix superblock iteration race

list_for_each_entry_safe is not suitable to protect against concurrent
modification of the list. 6754af6 introduced a race in sb walking.

list_for_each_entry can use the trick of pinning the current entry in
the list before we drop and retake the lock because it subsequently
follows cur->next. However list_for_each_entry_safe saves n=cur->next
for following before entering the loop body, so when the lock is
dropped, n may be deleted.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 2 ++
fs/super.c | 6 ++++++
include/linux/list.h | 15 +++++++++++++++
3 files changed, 23 insertions(+)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -590,6 +590,8 @@ static void prune_dcache(int count)
up_read(&sb->s_umount);
}
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
count -= pruned;
__put_super(sb);
/* more work left to do? */
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -374,6 +374,8 @@ void sync_supers(void)
up_read(&sb->s_umount);

spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
}
@@ -405,6 +407,8 @@ void iterate_supers(void (*f)(struct sup
up_read(&sb->s_umount);

spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
spin_unlock(&sb_lock);
@@ -585,6 +589,8 @@ static void do_emergency_remount(struct
}
up_write(&sb->s_umount);
spin_lock(&sb_lock);
+ /* lock was dropped, must reset next */
+ list_safe_reset_next(sb, n, s_list);
__put_super(sb);
}
spin_unlock(&sb_lock);
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h
+++ linux-2.6/include/linux/list.h
@@ -544,6 +544,21 @@ static inline void list_splice_tail_init
&pos->member != (head); \
pos = n, n = list_entry(n->member.prev, typeof(*n), member))

+/**
+ * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
+ * @pos: the loop cursor used in the list_for_each_entry_safe loop
+ * @n: temporary storage used in list_for_each_entry_safe
+ * @member: the name of the list_struct within the struct.
+ *
+ * list_safe_reset_next is not safe to use in general if the list may be
+ * modified concurrently (eg. the lock is dropped in the loop body). An
+ * exception to this is if the cursor element (pos) is pinned in the list,
+ * and list_safe_reset_next is called after re-taking the lock and before
+ * completing the current iteration of the loop body.
+ */
+#define list_safe_reset_next(pos, n, member) \
+ n = list_entry(pos->member.next, typeof(*pos), member)
+
/*
* Double linked lists with a single pointer list head.
* Mostly useful for hash tables where the two pointer list head is
Linus Torvalds
2010-06-29 17:35:47 UTC
Permalink
Post by Nick Piggin
Post by Christoph Hellwig
This should actually be on it's way to Linus for .35, shouldn't it?
Yeah, I was waiting for Al to reappear, but I think this is
probably the nicest way to solve the problem. Linus?
I'll apply it. We have a couple of oopses listed for the superblock
iterator, and I haven't heard from Al. And the patch looks obviously
fine, whether it's actually the cause of some of the bugs or not.

Linus
Nick Piggin
2010-06-29 17:41:22 UTC
Permalink
Post by Linus Torvalds
Post by Nick Piggin
Post by Christoph Hellwig
This should actually be on it's way to Linus for .35, shouldn't it?
Yeah, I was waiting for Al to reappear, but I think this is
probably the nicest way to solve the problem. Linus?
I'll apply it. We have a couple of oopses listed for the superblock
iterator, and I haven't heard from Al. And the patch looks obviously
fine, whether it's actually the cause of some of the bugs or not.
OK. I only have managed to get it into an infininte loop but I think
it would be surely possible to oops it because the next pointer can
be uninitialised memory at that point.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2010-06-29 17:52:03 UTC
Permalink
Post by Nick Piggin
Post by Linus Torvalds
I'll apply it. We have a couple of oopses listed for the superblock
iterator, and I haven't heard from Al. And the patch looks obviously
fine, whether it's actually the cause of some of the bugs or not.
OK. I only have managed to get it into an infininte loop but I think
it would be surely possible to oops it because the next pointer can
be uninitialised memory at that point.
Look for "2.6.35-rc3 oops trying to suspend" on lkml, for example. No
guarantee that it's the same thing, but it's "iterate_supers()"
getting an oops when it does "down_read(&sb->s_umount)". Which really
looks suspiciously like "sb" just being totally bogus, most likely
because of this same issue.

So I dunno, but I asked Al to look at it, and haven't heard back.

Regardless, I think your patch is the right thing to do (modulo any
syntactic issues - and I think your final version was the best of the
lot).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2010-06-29 17:58:14 UTC
Permalink
On Tue, Jun 29, 2010 at 10:52 AM, Linus Torvalds
Post by Linus Torvalds
Look for "2.6.35-rc3 oops trying to suspend" on lkml, for example. No
guarantee that it's the same thing, but it's "iterate_supers()"
getting an oops [..]
Also, "Oops during closedown with 2.6.35-rc3-git3" is an
iterate_supers oops (in the jpg) and Chris says it's repeatable for
him.

Chris - you could try testing current -git now that I've merged Nick's
patch. It's commit 57439f878af ("fs: fix superblock iteration race"),
and I just pushed it out (so it might take a few minutes to mirror out
to the public git trees, but it should be there shortly).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:51 UTC
Permalink
This enables locking to be reduced and ordering simplified.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/drop_caches.c | 10 ++++----
fs/inode.c | 53 +++++++++++++++-----------------------------
fs/notify/inode_mark.c | 10 --------
fs/notify/inotify/inotify.c | 10 --------
fs/quota/dquot.c | 16 ++++++-------
5 files changed, 32 insertions(+), 67 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -16,8 +16,8 @@ static void drop_pagecache_sb(struct sup
{
struct inode *inode, *toput_inode = NULL;

- spin_lock(&sb_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)
|| inode->i_mapping->nrpages == 0) {
@@ -26,13 +26,13 @@ static void drop_pagecache_sb(struct sup
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();
invalidate_mapping_pages(inode->i_mapping, 0, -1);
iput(toput_inode);
toput_inode = inode;
- spin_lock(&sb_inode_list_lock);
+ rcu_read_lock();
}
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();
iput(toput_inode);
}

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -44,10 +44,10 @@
*
* Ordering:
* inode_lock
- * sb_inode_list_lock
- * inode->i_lock
- * wb_inode_list_lock
- * inode_hash_bucket lock
+ * inode->i_lock
+ * sb_inode_list_lock
+ * wb_inode_list_lock
+ * inode_hash_bucket lock
*/
/*
* This is needed for the following functions:
@@ -379,12 +379,12 @@ static void dispose_list(struct list_hea
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);

- spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
__remove_inode_hash(inode);
- list_del_init(&inode->i_sb_list);
- spin_unlock(&inode->i_lock);
+ spin_lock(&sb_inode_list_lock);
+ list_del_rcu(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
+ spin_unlock(&inode->i_lock);

wake_up_inode(inode);
destroy_inode(inode);
@@ -406,14 +406,6 @@ static int invalidate_list(struct list_h
struct list_head *tmp = next;
struct inode *inode;

- /*
- * We can reschedule here without worrying about the list's
- * consistency because the per-sb list of inodes must not
- * change during umount anymore, and because iprune_sem keeps
- * shrink_icache_memory() away.
- */
- cond_resched_lock(&sb_inode_list_lock);
-
next = next->next;
if (tmp == head)
break;
@@ -456,12 +448,17 @@ int invalidate_inodes(struct super_block
int busy;
LIST_HEAD(throw_away);

+ /*
+ * Don't need to worry about the list's consistency because the per-sb
+ * list of inodes must not change during umount anymore, and because
+ * iprune_sem keeps shrink_icache_memory() away.
+ */
down_write(&iprune_sem);
- spin_lock(&sb_inode_list_lock);
+// spin_lock(&sb_inode_list_lock); XXX: is this safe?
inotify_unmount_inodes(&sb->s_inodes);
fsnotify_unmount_inodes(&sb->s_inodes);
busy = invalidate_list(&sb->s_inodes, &throw_away);
- spin_unlock(&sb_inode_list_lock);
+// spin_unlock(&sb_inode_list_lock);

dispose_list(&throw_away);
up_write(&iprune_sem);
@@ -665,7 +662,8 @@ __inode_add_to_lists(struct super_block
struct inode *inode)
{
atomic_inc(&inodes_stat.nr_inodes);
- list_add(&inode->i_sb_list, &sb->s_inodes);
+ spin_lock(&sb_inode_list_lock);
+ list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
if (b) {
spin_lock_bucket(b);
@@ -690,7 +688,6 @@ void inode_add_to_lists(struct super_blo
{
struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);

- spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
__inode_add_to_lists(sb, b, inode);
spin_unlock(&inode->i_lock);
@@ -722,7 +719,6 @@ struct inode *new_inode(struct super_blo
inode = alloc_inode(sb);
if (inode) {
/* XXX: init as locked for speedup */
- spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
inode->i_ino = atomic_inc_return(&last_ino);
inode->i_state = 0;
@@ -789,7 +785,6 @@ static struct inode *get_new_inode(struc
/* We released the lock, so.. */
old = find_inode(sb, b, test, data);
if (!old) {
- spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
if (set(inode, data))
goto set_failed;
@@ -819,7 +814,6 @@ static struct inode *get_new_inode(struc

set_failed:
spin_unlock(&inode->i_lock);
- spin_unlock(&sb_inode_list_lock);
destroy_inode(inode);
return NULL;
}
@@ -840,7 +834,6 @@ static struct inode *get_new_inode_fast(
/* We released the lock, so.. */
old = find_inode_fast(sb, b, ino);
if (!old) {
- spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
inode->i_ino = ino;
inode->i_state = I_NEW;
@@ -1320,7 +1313,8 @@ void generic_delete_inode(struct inode *
if (!inode->i_state)
atomic_dec(&inodes_stat.nr_unused);
}
- list_del_init(&inode->i_sb_list);
+ spin_lock(&sb_inode_list_lock);
+ list_del_rcu(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
@@ -1377,15 +1371,12 @@ int generic_detach_inode(struct inode *i
atomic_inc(&inodes_stat.nr_unused);
}
spin_unlock(&inode->i_lock);
- spin_unlock(&sb_inode_list_lock);
return 0;
}
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode->i_lock);
- spin_unlock(&sb_inode_list_lock);
write_inode_now(inode, 1);
- spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
@@ -1398,7 +1389,8 @@ int generic_detach_inode(struct inode *i
if (!inode->i_state)
atomic_dec(&inodes_stat.nr_unused);
}
- list_del_init(&inode->i_sb_list);
+ spin_lock(&sb_inode_list_lock);
+ list_del_rcu(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
@@ -1468,19 +1460,12 @@ void iput(struct inode *inode)
if (inode) {
BUG_ON(inode->i_state == I_CLEAR);

-retry:
spin_lock(&inode->i_lock);
- if (inode->i_count == 1) {
- if (!spin_trylock(&sb_inode_list_lock)) {
- spin_unlock(&inode->i_lock);
- goto retry;
- }
- inode->i_count--;
+ inode->i_count--;
+ if (inode->i_count == 0)
iput_final(inode);
- } else {
- inode->i_count--;
+ else
spin_unlock(&inode->i_lock);
- }
}
}
EXPORT_SYMBOL(iput);
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c
+++ linux-2.6/fs/notify/inode_mark.c
@@ -412,14 +412,6 @@ void fsnotify_unmount_inodes(struct list
spin_unlock(&next_i->i_lock);
}

- /*
- * We can safely drop inode_lock here because we hold
- * references on both inode and next_i. Also no new inodes
- * will be added since the umount has begun. Finally,
- * iprune_mutex keeps shrink_icache_memory() away.
- */
- spin_unlock(&sb_inode_list_lock);
-
if (need_iput_tmp)
iput(need_iput_tmp);

@@ -429,7 +421,5 @@ void fsnotify_unmount_inodes(struct list
fsnotify_inode_delete(inode);

iput(inode);
-
- spin_lock(&sb_inode_list_lock);
}
}
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -434,14 +434,6 @@ void inotify_unmount_inodes(struct list_
spin_unlock(&next_i->i_lock);
}

- /*
- * We can safely drop inode_lock here because we hold
- * references on both inode and next_i. Also no new inodes
- * will be added since the umount has begun. Finally,
- * iprune_mutex keeps shrink_icache_memory() away.
- */
- spin_unlock(&sb_inode_list_lock);
-
if (need_iput_tmp)
iput(need_iput_tmp);

@@ -459,8 +451,6 @@ void inotify_unmount_inodes(struct list_
}
mutex_unlock(&inode->inotify_mutex);
iput(inode);
-
- spin_lock(&sb_inode_list_lock);
}
}
EXPORT_SYMBOL_GPL(inotify_unmount_inodes);
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -883,8 +883,8 @@ static void add_dquot_ref(struct super_b
int reserved = 0;
#endif

- spin_lock(&sb_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
@@ -905,7 +905,7 @@ static void add_dquot_ref(struct super_b

__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();

iput(old_inode);
__dquot_initialize(inode, type);
@@ -915,9 +915,9 @@ static void add_dquot_ref(struct super_b
* reference and we cannot iput it under inode_lock. So we
* keep the reference and iput it later. */
old_inode = inode;
- spin_lock(&sb_inode_list_lock);
+ rcu_read_lock();
}
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();
iput(old_inode);

#ifdef CONFIG_QUOTA_DEBUG
@@ -995,8 +995,8 @@ static void remove_dquot_ref(struct supe
{
struct inode *inode;

- spin_lock(&sb_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
/*
* We have to scan also I_NEW inodes because they can already
* have quota pointer initialized. Luckily, we need to touch
@@ -1006,7 +1006,7 @@ static void remove_dquot_ref(struct supe
if (!IS_NOQUOTA(inode))
remove_inode_dquot_ref(inode, type, tofree_head);
}
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();
}

/* Gather all references from inodes and drop them */


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:48 UTC
Permalink
Remove the global inode_hash_lock and replace it with per-hash-bucket locks.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/inode.c | 167 ++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 90 insertions(+), 77 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -25,12 +25,13 @@
#include <linux/mount.h>
#include <linux/async.h>
#include <linux/posix_acl.h>
+#include <linux/bit_spinlock.h>

/*
* Usage:
* sb_inode_list_lock protects:
* s_inodes, i_sb_list
- * inode_hash_lock protects:
+ * inode_hash_bucket lock protects:
* inode hash table, i_hash
* wb_inode_list_lock protects:
* inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
@@ -46,7 +47,7 @@
* sb_inode_list_lock
* inode->i_lock
* wb_inode_list_lock
- * inode_hash_lock
+ * inode_hash_bucket lock
*/
/*
* This is needed for the following functions:
@@ -97,7 +98,22 @@ static unsigned int i_hash_shift __read_

LIST_HEAD(inode_in_use);
LIST_HEAD(inode_unused);
-static struct hlist_head *inode_hashtable __read_mostly;
+
+struct inode_hash_bucket {
+ struct hlist_bl_head head;
+};
+
+static inline void spin_lock_bucket(struct inode_hash_bucket *b)
+{
+ bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
+{
+ __bit_spin_unlock(0, (unsigned long *)b);
+}
+
+static struct inode_hash_bucket *inode_hashtable __read_mostly;

/*
* A simple spinlock to protect the list manipulations.
@@ -107,7 +123,6 @@ static struct hlist_head *inode_hashtabl
*/
DEFINE_SPINLOCK(sb_inode_list_lock);
DEFINE_SPINLOCK(wb_inode_list_lock);
-static DEFINE_SPINLOCK(inode_hash_lock);

/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -280,7 +295,7 @@ void destroy_inode(struct inode *inode)
void inode_init_once(struct inode *inode)
{
memset(inode, 0, sizeof(*inode));
- INIT_HLIST_NODE(&inode->i_hash);
+ INIT_HLIST_BL_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_devices);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
@@ -596,20 +611,21 @@ static void __wait_on_freeing_inode(stru
* add any additional branch in the common code.
*/
static struct inode *find_inode(struct super_block *sb,
- struct hlist_head *head,
+ struct inode_hash_bucket *b,
int (*test)(struct inode *, void *),
void *data)
{
- struct hlist_node *node;
+ struct hlist_bl_node *node;
struct inode *inode = NULL;

repeat:
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(inode, node, head, i_hash) {
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
if (inode->i_sb != sb)
continue;
if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
+ cpu_relax();
goto repeat;
}
if (!test(inode, data)) {
@@ -617,13 +633,13 @@ repeat:
continue;
}
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
__wait_on_freeing_inode(inode);
goto repeat;
}
break;
}
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
return node ? inode : NULL;
}

@@ -632,30 +648,32 @@ repeat:
* iget_locked for details.
*/
static struct inode *find_inode_fast(struct super_block *sb,
- struct hlist_head *head, unsigned long ino)
+ struct inode_hash_bucket *b,
+ unsigned long ino)
{
- struct hlist_node *node;
+ struct hlist_bl_node *node;
struct inode *inode = NULL;

repeat:
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(inode, node, head, i_hash) {
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
+ cpu_relax();
goto repeat;
}
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
__wait_on_freeing_inode(inode);
goto repeat;
}
break;
}
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
return node ? inode : NULL;
}

@@ -670,7 +688,7 @@ static unsigned long hash(struct super_b
}

static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
+__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
struct inode *inode)
{
atomic_inc(&inodes_stat.nr_inodes);
@@ -679,10 +697,10 @@ __inode_add_to_lists(struct super_block
spin_lock(&wb_inode_list_lock);
list_add(&inode->i_list, &inode_in_use);
spin_unlock(&wb_inode_list_lock);
- if (head) {
- spin_lock(&inode_hash_lock);
- hlist_add_head(&inode->i_hash, head);
- spin_unlock(&inode_hash_lock);
+ if (b) {
+ spin_lock_bucket(b);
+ hlist_bl_add_head(&inode->i_hash, &b->head);
+ spin_unlock_bucket(b);
}
}

@@ -700,11 +718,11 @@ __inode_add_to_lists(struct super_block
*/
void inode_add_to_lists(struct super_block *sb, struct inode *inode)
{
- struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);

spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
- __inode_add_to_lists(sb, head, inode);
+ __inode_add_to_lists(sb, b, inode);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -786,7 +804,7 @@ EXPORT_SYMBOL(unlock_new_inode);
* -- ***@arm.uk.linux.org
*/
static struct inode *get_new_inode(struct super_block *sb,
- struct hlist_head *head,
+ struct inode_hash_bucket *b,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *),
void *data)
@@ -798,7 +816,7 @@ static struct inode *get_new_inode(struc
struct inode *old;

/* We released the lock, so.. */
- old = find_inode(sb, head, test, data);
+ old = find_inode(sb, b, test, data);
if (!old) {
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
@@ -806,7 +824,7 @@ static struct inode *get_new_inode(struc
goto set_failed;

inode->i_state = I_NEW;
- __inode_add_to_lists(sb, head, inode);
+ __inode_add_to_lists(sb, b, inode);
spin_unlock(&inode->i_lock);

/* Return the locked inode with I_NEW set, the
@@ -840,7 +858,7 @@ set_failed:
* comment at iget_locked for details.
*/
static struct inode *get_new_inode_fast(struct super_block *sb,
- struct hlist_head *head, unsigned long ino)
+ struct inode_hash_bucket *b, unsigned long ino)
{
struct inode *inode;

@@ -849,13 +867,13 @@ static struct inode *get_new_inode_fast(
struct inode *old;

/* We released the lock, so.. */
- old = find_inode_fast(sb, head, ino);
+ old = find_inode_fast(sb, b, ino);
if (!old) {
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
inode->i_ino = ino;
inode->i_state = I_NEW;
- __inode_add_to_lists(sb, head, inode);
+ __inode_add_to_lists(sb, b, inode);
spin_unlock(&inode->i_lock);

/* Return the locked inode with I_NEW set, the
@@ -878,19 +896,20 @@ static struct inode *get_new_inode_fast(
return inode;
}

-static int test_inode_iunique(struct super_block * sb, struct hlist_head *head, unsigned long ino)
+static int test_inode_iunique(struct super_block *sb,
+ struct inode_hash_bucket *b, unsigned long ino)
{
- struct hlist_node *node;
- struct inode * inode = NULL;
+ struct hlist_bl_node *node;
+ struct inode *inode = NULL;

- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(inode, node, head, i_hash) {
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
if (inode->i_ino == ino && inode->i_sb == sb) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
return 0;
}
}
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
return 1;
}

@@ -917,7 +936,7 @@ ino_t iunique(struct super_block *sb, in
*/
static DEFINE_SPINLOCK(unique_lock);
static unsigned int counter;
- struct hlist_head *head;
+ struct inode_hash_bucket *b;
ino_t res;

spin_lock(&unique_lock);
@@ -925,8 +944,8 @@ ino_t iunique(struct super_block *sb, in
if (counter <= max_reserved)
counter = max_reserved + 1;
res = counter++;
- head = inode_hashtable + hash(sb, res);
- } while (!test_inode_iunique(sb, head, res));
+ b = inode_hashtable + hash(sb, res);
+ } while (!test_inode_iunique(sb, b, res));
spin_unlock(&unique_lock);

return res;
@@ -973,12 +992,13 @@ EXPORT_SYMBOL(igrab);
* Note, @test is called with the inode_lock held, so can't sleep.
*/
static struct inode *ifind(struct super_block *sb,
- struct hlist_head *head, int (*test)(struct inode *, void *),
+ struct inode_hash_bucket *b,
+ int (*test)(struct inode *, void *),
void *data, const int wait)
{
struct inode *inode;

- inode = find_inode(sb, head, test, data);
+ inode = find_inode(sb, b, test, data);
if (inode) {
__iget(inode);
spin_unlock(&inode->i_lock);
@@ -1005,11 +1025,12 @@ static struct inode *ifind(struct super_
* Otherwise NULL is returned.
*/
static struct inode *ifind_fast(struct super_block *sb,
- struct hlist_head *head, unsigned long ino)
+ struct inode_hash_bucket *b,
+ unsigned long ino)
{
struct inode *inode;

- inode = find_inode_fast(sb, head, ino);
+ inode = find_inode_fast(sb, b, ino);
if (inode) {
__iget(inode);
spin_unlock(&inode->i_lock);
@@ -1043,9 +1064,9 @@ static struct inode *ifind_fast(struct s
struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);

- return ifind(sb, head, test, data, 0);
+ return ifind(sb, b, test, data, 0);
}
EXPORT_SYMBOL(ilookup5_nowait);

@@ -1071,9 +1092,9 @@ EXPORT_SYMBOL(ilookup5_nowait);
struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);

- return ifind(sb, head, test, data, 1);
+ return ifind(sb, b, test, data, 1);
}
EXPORT_SYMBOL(ilookup5);

@@ -1093,9 +1114,9 @@ EXPORT_SYMBOL(ilookup5);
*/
struct inode *ilookup(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);

- return ifind_fast(sb, head, ino);
+ return ifind_fast(sb, b, ino);
}
EXPORT_SYMBOL(ilookup);

@@ -1123,17 +1144,17 @@ struct inode *iget5_locked(struct super_
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
{
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
struct inode *inode;

- inode = ifind(sb, head, test, data, 1);
+ inode = ifind(sb, b, test, data, 1);
if (inode)
return inode;
/*
* get_new_inode() will do the right thing, re-trying the search
* in case it had to block at any point.
*/
- return get_new_inode(sb, head, test, set, data);
+ return get_new_inode(sb, b, test, set, data);
}
EXPORT_SYMBOL(iget5_locked);

@@ -1154,17 +1175,17 @@ EXPORT_SYMBOL(iget5_locked);
*/
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
struct inode *inode;

- inode = ifind_fast(sb, head, ino);
+ inode = ifind_fast(sb, b, ino);
if (inode)
return inode;
/*
* get_new_inode_fast() will do the right thing, re-trying the search
* in case it had to block at any point.
*/
- return get_new_inode_fast(sb, head, ino);
+ return get_new_inode_fast(sb, b, ino);
}
EXPORT_SYMBOL(iget_locked);

@@ -1172,16 +1193,16 @@ int insert_inode_locked(struct inode *in
{
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);

inode->i_state |= I_NEW;
while (1) {
- struct hlist_node *node;
+ struct hlist_bl_node *node;
struct inode *old = NULL;

repeat:
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(old, node, head, i_hash) {
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
if (old->i_ino != ino)
continue;
if (old->i_sb != sb)
@@ -1189,21 +1210,21 @@ repeat:
if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
continue;
if (!spin_trylock(&old->i_lock)) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
goto repeat;
}
break;
}
if (likely(!node)) {
- hlist_add_head(&inode->i_hash, head);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_add_head(&inode->i_hash, &b->head);
+ spin_unlock_bucket(b);
return 0;
}
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
__iget(old);
spin_unlock(&old->i_lock);
wait_on_inode(old);
- if (unlikely(!hlist_unhashed(&old->i_hash))) {
+ if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
iput(old);
return -EBUSY;
}
@@ -1216,17 +1237,17 @@ int insert_inode_locked4(struct inode *i
int (*test)(struct inode *, void *), void *data)
{
struct super_block *sb = inode->i_sb;
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);

inode->i_state |= I_NEW;

while (1) {
- struct hlist_node *node;
+ struct hlist_bl_node *node;
struct inode *old = NULL;

repeat:
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(old, node, head, i_hash) {
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
if (old->i_sb != sb)
continue;
if (!test(old, data))
@@ -1234,21 +1255,21 @@ repeat:
if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
continue;
if (!spin_trylock(&old->i_lock)) {
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
goto repeat;
}
break;
}
if (likely(!node)) {
- hlist_add_head(&inode->i_hash, head);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_add_head(&inode->i_hash, &b->head);
+ spin_unlock_bucket(b);
return 0;
}
- spin_unlock(&inode_hash_lock);
+ spin_unlock_bucket(b);
__iget(old);
spin_unlock(&old->i_lock);
wait_on_inode(old);
- if (unlikely(!hlist_unhashed(&old->i_hash))) {
+ if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
iput(old);
return -EBUSY;
}
@@ -1267,12 +1288,12 @@ EXPORT_SYMBOL(insert_inode_locked4);
*/
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
- struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+ struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, hashval);

spin_lock(&inode->i_lock);
- spin_lock(&inode_hash_lock);
- hlist_add_head(&inode->i_hash, head);
- spin_unlock(&inode_hash_lock);
+ spin_lock_bucket(b);
+ hlist_bl_add_head(&inode->i_hash, &b->head);
+ spin_unlock_bucket(b);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(__insert_inode_hash);
@@ -1286,9 +1307,10 @@ EXPORT_SYMBOL(__insert_inode_hash);
*/
void __remove_inode_hash(struct inode *inode)
{
- spin_lock(&inode_hash_lock);
- hlist_del_init(&inode->i_hash);
- spin_unlock(&inode_hash_lock);
+ struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+ spin_lock_bucket(b);
+ hlist_bl_del_init(&inode->i_hash);
+ spin_unlock_bucket(b);
}

/**
@@ -1370,7 +1392,7 @@ int generic_detach_inode(struct inode *i
{
struct super_block *sb = inode->i_sb;

- if (!hlist_unhashed(&inode->i_hash)) {
+ if (!hlist_bl_unhashed(&inode->i_hash)) {
if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
spin_lock(&wb_inode_list_lock);
list_move(&inode->i_list, &inode_unused);
@@ -1699,7 +1721,7 @@ void __init inode_init_early(void)

inode_hashtable =
alloc_large_system_hash("Inode-cache",
- sizeof(struct hlist_head),
+ sizeof(struct inode_hash_bucket),
ihash_entries,
14,
HASH_EARLY,
@@ -1708,7 +1730,7 @@ void __init inode_init_early(void)
0);

for (loop = 0; loop < (1 << i_hash_shift); loop++)
- INIT_HLIST_HEAD(&inode_hashtable[loop]);
+ INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
}

void __init inode_init(void)
@@ -1730,7 +1752,7 @@ void __init inode_init(void)

inode_hashtable =
alloc_large_system_hash("Inode-cache",
- sizeof(struct hlist_head),
+ sizeof(struct inode_hash_bucket),
ihash_entries,
14,
0,
@@ -1739,7 +1761,7 @@ void __init inode_init(void)
0);

for (loop = 0; loop < (1 << i_hash_shift); loop++)
- INIT_HLIST_HEAD(&inode_hashtable[loop]);
+ INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
}

void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -1147,7 +1147,7 @@ void __mark_inode_dirty(struct inode *in
* dirty list. Add blockdev inodes as well.
*/
if (!S_ISBLK(inode->i_mode)) {
- if (hlist_unhashed(&inode->i_hash))
+ if (hlist_bl_unhashed(&inode->i_hash))
goto out;
}
if (inode->i_state & (I_FREEING|I_CLEAR))
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -723,7 +723,7 @@ struct posix_acl;
#define ACL_NOT_CACHED ((void *)(-1))

struct inode {
- struct hlist_node i_hash;
+ struct hlist_bl_node i_hash;
struct list_head i_list; /* backing dev IO list */
struct list_head i_sb_list;
struct list_head i_dentry;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -2122,7 +2122,7 @@ static int shmem_encode_fh(struct dentry
if (*len < 3)
return 255;

- if (hlist_unhashed(&inode->i_hash)) {
+ if (hlist_bl_unhashed(&inode->i_hash)) {
/* Unfortunately insert_inode_hash is not idempotent,
* so as we hash inodes here rather than at creation
* time, we need a lock to ensure we only try
@@ -2130,7 +2130,7 @@ static int shmem_encode_fh(struct dentry
*/
static DEFINE_SPINLOCK(lock);
spin_lock(&lock);
- if (hlist_unhashed(&inode->i_hash))
+ if (hlist_bl_unhashed(&inode->i_hash))
__insert_inode_hash(inode,
inode->i_ino + inode->i_generation);
spin_unlock(&lock);
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c
+++ linux-2.6/fs/btrfs/inode.c
@@ -3849,7 +3849,7 @@ again:
p = &root->inode_tree.rb_node;
parent = NULL;

- if (hlist_unhashed(&inode->i_hash))
+ if (hlist_bl_unhashed(&inode->i_hash))
return;

spin_lock(&root->inode_lock);
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c
+++ linux-2.6/fs/reiserfs/xattr.c
@@ -424,7 +424,7 @@ int reiserfs_prepare_write(struct file *
static void update_ctime(struct inode *inode)
{
struct timespec now = current_fs_time(inode->i_sb);
- if (hlist_unhashed(&inode->i_hash) || !inode->i_nlink ||
+ if (hlist_bl_unhashed(&inode->i_hash) || !inode->i_nlink ||
timespec_equal(&inode->i_ctime, &now))
return;

Index: linux-2.6/fs/hfs/hfs_fs.h
===================================================================
--- linux-2.6.orig/fs/hfs/hfs_fs.h
+++ linux-2.6/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {

int fs_div;

- struct hlist_head rsrc_inodes;
+ struct hlist_bl_head rsrc_inodes;
};

#define HFS_FLG_BITMAP_DIRTY 0
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c
+++ linux-2.6/fs/hfs/inode.c
@@ -500,7 +500,7 @@ static struct dentry *hfs_file_lookup(st
HFS_I(inode)->rsrc_inode = dir;
HFS_I(dir)->rsrc_inode = inode;
igrab(dir);
- hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
+ hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
mark_inode_dirty(inode);
out:
d_add(dentry, inode);
Index: linux-2.6/fs/hfsplus/hfsplus_fs.h
===================================================================
--- linux-2.6.orig/fs/hfsplus/hfsplus_fs.h
+++ linux-2.6/fs/hfsplus/hfsplus_fs.h
@@ -144,7 +144,7 @@ struct hfsplus_sb_info {

unsigned long flags;

- struct hlist_head rsrc_inodes;
+ struct hlist_bl_head rsrc_inodes;
};

#define HFSPLUS_SB_WRITEBACKUP 0x0001
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c
+++ linux-2.6/fs/hfsplus/inode.c
@@ -178,7 +178,7 @@ static struct dentry *hfsplus_file_looku
HFSPLUS_I(inode).rsrc_inode = dir;
HFSPLUS_I(dir).rsrc_inode = inode;
igrab(dir);
- hlist_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
+ hlist_bl_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
mark_inode_dirty(inode);
out:
d_add(dentry, inode);
Index: linux-2.6/fs/nilfs2/gcinode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/gcinode.c
+++ linux-2.6/fs/nilfs2/gcinode.c
@@ -187,13 +187,13 @@ int nilfs_init_gccache(struct the_nilfs
INIT_LIST_HEAD(&nilfs->ns_gc_inodes);

nilfs->ns_gc_inodes_h =
- kmalloc(sizeof(struct hlist_head) * NILFS_GCINODE_HASH_SIZE,
+ kmalloc(sizeof(struct hlist_bl_head) * NILFS_GCINODE_HASH_SIZE,
GFP_NOFS);
if (nilfs->ns_gc_inodes_h == NULL)
return -ENOMEM;

for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++)
- INIT_HLIST_HEAD(&nilfs->ns_gc_inodes_h[loop]);
+ INIT_HLIST_BL_HEAD(&nilfs->ns_gc_inodes_h[loop]);
return 0;
}

@@ -245,18 +245,18 @@ static unsigned long ihash(ino_t ino, __
*/
struct inode *nilfs_gc_iget(struct the_nilfs *nilfs, ino_t ino, __u64 cno)
{
- struct hlist_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
- struct hlist_node *node;
+ struct hlist_bl_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
+ struct hlist_bl_node *node;
struct inode *inode;

- hlist_for_each_entry(inode, node, head, i_hash) {
+ hlist_bl_for_each_entry(inode, node, head, i_hash) {
if (inode->i_ino == ino && NILFS_I(inode)->i_cno == cno)
return inode;
}

inode = alloc_gcinode(nilfs, ino, cno);
if (likely(inode)) {
- hlist_add_head(&inode->i_hash, head);
+ hlist_bl_add_head(&inode->i_hash, head);
list_add(&NILFS_I(inode)->i_dirty, &nilfs->ns_gc_inodes);
}
return inode;
@@ -275,14 +275,14 @@ void nilfs_clear_gcinode(struct inode *i
*/
void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
{
- struct hlist_head *head = nilfs->ns_gc_inodes_h;
- struct hlist_node *node, *n;
+ struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
+ struct hlist_bl_node *node, *n;
struct inode *inode;
int loop;

for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
- hlist_for_each_entry_safe(inode, node, n, head, i_hash) {
- hlist_del_init(&inode->i_hash);
+ hlist_bl_for_each_entry_safe(inode, node, n, head, i_hash) {
+ hlist_bl_del_init(&inode->i_hash);
list_del_init(&NILFS_I(inode)->i_dirty);
nilfs_clear_gcinode(inode); /* might sleep */
}
Index: linux-2.6/fs/nilfs2/segment.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/segment.c
+++ linux-2.6/fs/nilfs2/segment.c
@@ -2447,7 +2447,7 @@ nilfs_remove_written_gcinodes(struct the
list_for_each_entry_safe(ii, n, head, i_dirty) {
if (!test_bit(NILFS_I_UPDATED, &ii->i_state))
continue;
- hlist_del_init(&ii->vfs_inode.i_hash);
+ hlist_bl_del_init(&ii->vfs_inode.i_hash);
list_del_init(&ii->i_dirty);
nilfs_clear_gcinode(&ii->vfs_inode);
}
Index: linux-2.6/fs/nilfs2/the_nilfs.h
===================================================================
--- linux-2.6.orig/fs/nilfs2/the_nilfs.h
+++ linux-2.6/fs/nilfs2/the_nilfs.h
@@ -164,7 +164,7 @@ struct the_nilfs {

/* GC inode list and hash table head */
struct list_head ns_gc_inodes;
- struct hlist_head *ns_gc_inodes_h;
+ struct hlist_bl_head *ns_gc_inodes_h;

/* Disk layout information (static) */
unsigned int ns_blocksize_bits;


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:34 UTC
Permalink
dget_locked was a shortcut to avoid the lazy lru manipulation when we
already held dcache_lock (lru manipulation was relatively cheap at that
point). However, how that the lru lock is an innermost one, we never
hold it at any caller, so the lock cost can now be avoided. We already
have well working lazy dcache LRU, so it should be fine to defer LRU
manipulations to scan time.

Signed-off-by: Nick Piggin <***@suse.de>
---
arch/powerpc/platforms/cell/spufs/inode.c | 2 -
drivers/infiniband/hw/ipath/ipath_fs.c | 2 -
fs/autofs4/root.c | 4 +--
fs/configfs/inode.c | 2 -
fs/dcache.c | 34 ++++++++----------------------
fs/exportfs/expfs.c | 2 -
fs/ncpfs/dir.c | 2 -
fs/ocfs2/dcache.c | 2 -
fs/smbfs/cache.c | 2 -
fs/sysfs/dir.c | 2 -
include/linux/dcache.h | 15 ++-----------
kernel/cgroup.c | 2 -
security/selinux/selinuxfs.c | 2 -
13 files changed, 25 insertions(+), 48 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -160,7 +160,7 @@ static void spufs_prune_dir(struct dentr
list_for_each_entry_safe(dentry, tmp, &dir->d_subdirs, d_u.d_child) {
spin_lock(&dentry->d_lock);
if (!(d_unhashed(dentry)) && dentry->d_inode) {
- dget_locked_dlock(dentry);
+ dget_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
simple_unlink(dir->d_inode, dentry);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_fs.c
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -275,7 +275,7 @@ static int remove_file(struct dentry *pa

spin_lock(&tmp->d_lock);
if (!(d_unhashed(tmp) && tmp->d_inode)) {
- dget_locked_dlock(tmp);
+ dget_dlock(tmp);
__d_drop(tmp);
spin_unlock(&tmp->d_lock);
simple_unlink(parent->d_inode, tmp);
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -251,7 +251,7 @@ void configfs_drop_dentry(struct configf
if (dentry) {
spin_lock(&dentry->d_lock);
if (!(d_unhashed(dentry) && dentry->d_inode)) {
- dget_locked_dlock(dentry);
+ dget_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
simple_unlink(parent->d_inode, dentry);
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -287,29 +287,23 @@ void d_drop(struct dentry *dentry)
EXPORT_SYMBOL(d_drop);

/* This must be called with d_lock held */
-static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+static inline struct dentry *__dget_dlock(struct dentry *dentry)
{
dentry->d_count++;
- dentry_lru_del_init(dentry);
return dentry;
}

-static inline struct dentry * __dget_locked(struct dentry *dentry)
+static inline struct dentry *__dget(struct dentry *dentry)
{
spin_lock(&dentry->d_lock);
- __dget_locked_dlock(dentry);
+ __dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
return dentry;
}

-struct dentry * dget_locked_dlock(struct dentry *dentry)
-{
- return __dget_locked_dlock(dentry);
-}
-
struct dentry * dget_locked(struct dentry *dentry)
{
- return __dget_locked(dentry);
+ return __dget(dentry);
}
EXPORT_SYMBOL(dget_locked);

@@ -534,7 +528,7 @@ static struct dentry * __d_find_alias(st
(alias->d_flags & DCACHE_DISCONNECTED))
discon_alias = alias;
else if (!want_discon) {
- __dget_locked_dlock(alias);
+ __dget_dlock(alias);
spin_unlock(&alias->d_lock);
return alias;
}
@@ -542,7 +536,7 @@ static struct dentry * __d_find_alias(st
spin_unlock(&alias->d_lock);
}
if (discon_alias)
- __dget_locked(discon_alias);
+ __dget(discon_alias);
return discon_alias;
}

@@ -571,7 +565,7 @@ restart:
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
if (!dentry->d_count) {
- __dget_locked_dlock(dentry);
+ __dget_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
@@ -1349,7 +1343,7 @@ static struct dentry *__d_instantiate_un
continue;
if (memcmp(qstr->name, name, len))
continue;
- dget_locked(alias);
+ dget(alias);
return alias;
}

@@ -1593,7 +1587,7 @@ struct dentry *d_add_ci(struct dentry *d
* reference to it, move it in place and use it.
*/
new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
- dget_locked(new);
+ dget(new);
spin_unlock(&inode->i_lock);
security_d_instantiate(found, inode);
d_move(new, found);
@@ -1629,8 +1623,7 @@ EXPORT_SYMBOL(d_add_ci);
*
* The dentry unused LRU is not updated even if lookup finds the required dentry
* in there. It is updated in places such as prune_dcache, shrink_dcache_sb,
- * select_parent and __dget_locked. This laziness saves lookup from LRU lock
- * acquisition.
+ * select_parent. This laziness saves lookup from LRU lock acquisition.
*
* d_lookup() is protected against the concurrent renames in some unrelated
* directory using the seqlockt_t rename_lock.
@@ -1767,7 +1760,7 @@ int d_validate(struct dentry *dentry, st
hlist_bl_for_each_entry_rcu(lhp, node, &b->head, d_hash) {
if (dentry == lhp) {
rcu_read_unlock();
- __dget_locked_dlock(dentry);
+ __dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
return 1;
}
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -51,7 +51,7 @@ find_acceptable_alias(struct dentry *res
inode = result->d_inode;
spin_lock(&inode->i_lock);
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
- dget_locked(dentry);
+ dget(dentry);
spin_unlock(&inode->i_lock);
if (toput)
dput(toput);
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c
+++ linux-2.6/fs/ncpfs/dir.c
@@ -370,7 +370,7 @@ ncp_dget_fpos(struct dentry *dentry, str
dent = list_entry(next, struct dentry, d_u.d_child);
if ((unsigned long)dent->d_fsdata == fpos) {
if (dent->d_inode)
- dget_locked(dent);
+ dget(dent);
else
dent = NULL;
spin_unlock(&parent->d_lock);
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -160,7 +160,7 @@ struct dentry *ocfs2_find_local_alias(st
mlog(0, "dentry found: %.*s\n",
dentry->d_name.len, dentry->d_name.name);

- dget_locked_dlock(dentry);
+ dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
break;
}
Index: linux-2.6/fs/smbfs/cache.c
===================================================================
--- linux-2.6.orig/fs/smbfs/cache.c
+++ linux-2.6/fs/smbfs/cache.c
@@ -102,7 +102,7 @@ smb_dget_fpos(struct dentry *dentry, str
dent = list_entry(next, struct dentry, d_u.d_child);
if ((unsigned long)dent->d_fsdata == fpos) {
if (dent->d_inode)
- dget_locked(dent);
+ dget(dent);
else
dent = NULL;
goto out_unlock;
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -306,23 +306,17 @@ extern char *dentry_path(struct dentry *
/* Allocation counts.. */

/**
- * dget, dget_locked - get a reference to a dentry
+ * dget, dget_dlock - get a reference to a dentry
* @dentry: dentry to get a reference to
*
* Given a dentry or %NULL pointer increment the reference count
* if appropriate and return the dentry. A dentry will not be
- * destroyed when it has references. dget() should never be
- * called for dentries with zero reference counter. For these cases
- * (preferably none, functions in dcache.c are sufficient for normal
- * needs and they take necessary precautions) you should hold dcache_lock
- * and call dget_locked() instead of dget().
+ * destroyed when it has references.
*/
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
- if (dentry) {
- BUG_ON(!dentry->d_count);
+ if (dentry)
dentry->d_count++;
- }
return dentry;
}

@@ -336,9 +330,6 @@ static inline struct dentry *dget(struct
return dentry;
}

-extern struct dentry * dget_locked(struct dentry *);
-extern struct dentry * dget_locked_dlock(struct dentry *);
-
extern struct dentry *dget_parent(struct dentry *dentry);

/**
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -880,7 +880,7 @@ static void cgroup_clear_directory(struc
/* This should never be called on a cgroup
* directory with child cgroups */
BUG_ON(d->d_inode->i_mode & S_IFDIR);
- dget_locked_dlock(d);
+ dget_dlock(d);
spin_unlock(&d->d_lock);
spin_unlock(&dentry->d_lock);
d_delete(d);
Index: linux-2.6/security/selinux/selinuxfs.c
===================================================================
--- linux-2.6.orig/security/selinux/selinuxfs.c
+++ linux-2.6/security/selinux/selinuxfs.c
@@ -950,7 +950,7 @@ static void sel_remove_entries(struct de
list_del_init(node);

if (d->d_inode) {
- dget_locked_dlock(d);
+ dget_dlock(d);
spin_unlock(&de->d_lock);
spin_unlock(&d->d_lock);
d_delete(d);
Index: linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/qib/qib_fs.c
+++ linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
@@ -453,7 +453,7 @@ static int remove_file(struct dentry *pa

spin_lock(&tmp->d_lock);
if (!(d_unhashed(tmp) && tmp->d_inode)) {
- dget_locked_dlock(tmp);
+ dget_dlock(tmp);
__d_drop(tmp);
spin_unlock(&tmp->d_lock);
simple_unlink(parent->d_inode, tmp);


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:37 UTC
Permalink
dentry referenced bit is only set when installing the dentry back
onto the LRU. However with lazy LRU, the dentry can already be on
the LRU list at dput time, thus missing out on setting the referenced
bit. Fix this.

Signed-off-by: Nick Piggin <***@suse.de>

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -387,10 +387,10 @@ repeat:
/* Unreachable? Get rid of it */
if (d_unhashed(dentry))
goto kill_it;
- if (list_empty(&dentry->d_lru)) {
- dentry->d_flags |= DCACHE_REFERENCED;
+ /* Otherwise leave it cached and ensure it's on the LRU */
+ if (list_empty(&dentry->d_lru))
dentry_lru_add(dentry);
- }
+ dentry->d_flags |= DCACHE_REFERENCED;
dentry->d_count--;
spin_unlock(&dentry->d_lock);
return;
n***@suse.de
2010-06-24 03:03:00 UTC
Permalink
Split inode reclaim and writeback lists in preparation to scale them up
(per-bdi locking for i_io and per-zone locking for i_lru)

Signed-off-by: Nick Piggin <***@suse.de>
--
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -291,11 +291,11 @@ static void redirty_tail(struct inode *i
if (!list_empty(&wb->b_dirty)) {
struct inode *tail;

- tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+ tail = list_entry(wb->b_dirty.next, struct inode, i_io);
if (time_before(inode->dirtied_when, tail->dirtied_when))
inode->dirtied_when = jiffies;
}
- list_move(&inode->i_list, &wb->b_dirty);
+ list_move(&inode->i_io, &wb->b_dirty);
}

/*
@@ -306,7 +306,7 @@ static void requeue_io(struct inode *ino
struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;

assert_spin_locked(&wb_inode_list_lock);
- list_move(&inode->i_list, &wb->b_more_io);
+ list_move(&inode->i_io, &wb->b_more_io);
}

static void inode_sync_complete(struct inode *inode)
@@ -348,14 +348,14 @@ static void move_expired_inodes(struct l

assert_spin_locked(&wb_inode_list_lock);
while (!list_empty(delaying_queue)) {
- inode = list_entry(delaying_queue->prev, struct inode, i_list);
+ inode = list_entry(delaying_queue->prev, struct inode, i_io);
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
if (sb && sb != inode->i_sb)
do_sb_sort = 1;
sb = inode->i_sb;
- list_move(&inode->i_list, &tmp);
+ list_move(&inode->i_io, &tmp);
}

/* just one sb in list, splice to dispatch_queue and we're done */
@@ -366,12 +366,12 @@ static void move_expired_inodes(struct l

/* Move inodes from one superblock together */
while (!list_empty(&tmp)) {
- inode = list_entry(tmp.prev, struct inode, i_list);
+ inode = list_entry(tmp.prev, struct inode, i_io);
sb = inode->i_sb;
list_for_each_prev_safe(pos, node, &tmp) {
- inode = list_entry(pos, struct inode, i_list);
+ inode = list_entry(pos, struct inode, i_io);
if (inode->i_sb == sb)
- list_move(&inode->i_list, dispatch_queue);
+ list_move(&inode->i_io, dispatch_queue);
}
}
}
@@ -556,7 +556,11 @@ select_queue:
}
} else {
/* The inode is clean */
- list_move(&inode->i_list, &inode_unused);
+ list_del_init(&inode->i_io);
+ if (list_empty(&inode->i_lru)) {
+ list_add(&inode->i_lru, &inode_unused);
+ inodes_stat.nr_unused++;
+ }
}
}
inode_sync_complete(inode);
@@ -623,7 +627,7 @@ again:
while (!list_empty(&wb->b_io)) {
long pages_skipped;
struct inode *inode = list_entry(wb->b_io.prev,
- struct inode, i_list);
+ struct inode, i_io);
if (!spin_trylock(&inode->i_lock)) {
spin_unlock(&wb_inode_list_lock);
spin_lock(&wb_inode_list_lock);
@@ -696,7 +700,7 @@ again:

while (!list_empty(&wb->b_io)) {
struct inode *inode = list_entry(wb->b_io.prev,
- struct inode, i_list);
+ struct inode, i_io);
struct super_block *sb = inode->i_sb;
enum sb_pin_state state;

@@ -845,7 +849,7 @@ retry:
spin_lock(&wb_inode_list_lock);
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
- struct inode, i_list);
+ struct inode, i_io);
if (!spin_trylock(&inode->i_lock)) {
spin_unlock(&wb_inode_list_lock);
goto retry;
@@ -1164,7 +1168,7 @@ void __mark_inode_dirty(struct inode *in

inode->dirtied_when = jiffies;
spin_lock(&wb_inode_list_lock);
- list_move(&inode->i_list, &wb->b_dirty);
+ list_move(&inode->i_io, &wb->b_dirty);
spin_unlock(&wb_inode_list_lock);
}
}
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -726,7 +726,8 @@ struct posix_acl;

struct inode {
struct hlist_bl_node i_hash;
- struct list_head i_list; /* backing dev IO list */
+ struct list_head i_io; /* backing dev IO list */
+ struct list_head i_lru;
struct list_head i_sb_list;
union {
struct list_head i_dentry;
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -80,11 +80,11 @@ static int bdi_debug_stats_show(struct s
spin_lock(&wb_inode_list_lock);
list_for_each_entry(wb, &bdi->wb_list, list) {
nr_wb++;
- list_for_each_entry(inode, &wb->b_dirty, i_list)
+ list_for_each_entry(inode, &wb->b_dirty, i_io)
nr_dirty++;
- list_for_each_entry(inode, &wb->b_io, i_list)
+ list_for_each_entry(inode, &wb->b_io, i_io)
nr_io++;
- list_for_each_entry(inode, &wb->b_more_io, i_list)
+ list_for_each_entry(inode, &wb->b_more_io, i_io)
nr_more_io++;
}
spin_unlock(&wb_inode_list_lock);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -35,12 +35,13 @@
* inode_hash_bucket lock protects:
* inode hash table, i_hash
* wb_inode_list_lock protects:
- * inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
+ * inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_io, i_lru
* inode->i_lock protects:
* i_state
* i_count
* i_hash
- * i_list
+ * i_io
+ * i_lru
* i_sb_list
*
* Ordering:
@@ -313,6 +314,7 @@ static void i_callback(struct rcu_head *

void destroy_inode(struct inode *inode)
{
+ BUG_ON(!list_empty(&inode->i_io));
__destroy_inode(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
@@ -331,7 +333,8 @@ void inode_init_once(struct inode *inode
INIT_HLIST_BL_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_devices);
- INIT_LIST_HEAD(&inode->i_list);
+ INIT_LIST_HEAD(&inode->i_io);
+ INIT_LIST_HEAD(&inode->i_lru);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -401,8 +404,8 @@ static void dispose_list(struct list_hea
while (!list_empty(head)) {
struct inode *inode;

- inode = list_first_entry(head, struct inode, i_list);
- list_del_init(&inode->i_list);
+ inode = list_first_entry(head, struct inode, i_lru);
+ list_del_init(&inode->i_lru);

if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
@@ -436,13 +439,14 @@ static int invalidate_sb_inodes(struct s
invalidate_inode_buffers(inode);
if (!inode->i_count) {
spin_lock(&wb_inode_list_lock);
- list_del(&inode->i_list);
+ list_del_init(&inode->i_io);
+ list_del(&inode->i_lru);
inodes_stat.nr_unused--;
spin_unlock(&wb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- list_add(&inode->i_list, dispose);
+ list_add(&inode->i_lru, dispose);
continue;
}
spin_unlock(&inode->i_lock);
@@ -511,26 +515,26 @@ again:
if (list_empty(&inode_unused))
break;

- inode = list_entry(inode_unused.prev, struct inode, i_list);
+ inode = list_entry(inode_unused.prev, struct inode, i_lru);

if (!spin_trylock(&inode->i_lock)) {
spin_unlock(&wb_inode_list_lock);
goto again;
}
if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
- list_del_init(&inode->i_list);
+ list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
inodes_stat.nr_unused--;
continue;
}
if (inode->i_state) {
- list_move(&inode->i_list, &inode_unused);
+ list_move(&inode->i_lru, &inode_unused);
inode->i_state &= ~I_REFERENCED;
spin_unlock(&inode->i_lock);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
- list_move(&inode->i_list, &inode_unused);
+ list_move(&inode->i_lru, &inode_unused);
spin_unlock(&wb_inode_list_lock);
__iget(inode);
spin_unlock(&inode->i_lock);
@@ -542,7 +546,7 @@ again:
spin_lock(&wb_inode_list_lock);
continue;
}
- list_move(&inode->i_list, &freeable);
+ list_move(&inode->i_lru, &freeable);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
@@ -1395,11 +1399,15 @@ void generic_delete_inode(struct inode *
{
const struct super_operations *op = inode->i_sb->s_op;

- if (!list_empty(&inode->i_list)) {
+ if (!list_empty(&inode->i_lru)) {
spin_lock(&wb_inode_list_lock);
- list_del_init(&inode->i_list);
- if (!inode->i_state)
- inodes_stat.nr_unused--;
+ list_del_init(&inode->i_lru);
+ inodes_stat.nr_unused--;
+ spin_unlock(&wb_inode_list_lock);
+ }
+ if (!list_empty(&inode->i_io)) {
+ spin_lock(&wb_inode_list_lock);
+ list_del_init(&inode->i_io);
spin_unlock(&wb_inode_list_lock);
}
inode_sb_list_del(inode);
@@ -1451,9 +1459,9 @@ int generic_detach_inode(struct inode *i
if (sb->s_flags & MS_ACTIVE) {
inode->i_state |= I_REFERENCED;
if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
- list_empty(&inode->i_list)) {
+ list_empty(&inode->i_lru)) {
spin_lock(&wb_inode_list_lock);
- list_add(&inode->i_list, &inode_unused);
+ list_add(&inode->i_lru, &inode_unused);
inodes_stat.nr_unused++;
spin_unlock(&wb_inode_list_lock);
}
@@ -1469,11 +1477,15 @@ int generic_detach_inode(struct inode *i
inode->i_state &= ~I_WILL_FREE;
__remove_inode_hash(inode);
}
- if (!list_empty(&inode->i_list)) {
+ if (!list_empty(&inode->i_lru)) {
spin_lock(&wb_inode_list_lock);
- list_del_init(&inode->i_list);
- if (!inode->i_state)
- inodes_stat.nr_unused--;
+ list_del_init(&inode->i_lru);
+ inodes_stat.nr_unused--;
+ spin_unlock(&wb_inode_list_lock);
+ }
+ if (!list_empty(&inode->i_io)) {
+ spin_lock(&wb_inode_list_lock);
+ list_del_init(&inode->i_io);
spin_unlock(&wb_inode_list_lock);
}
inode_sb_list_del(inode);
n***@suse.de
2010-06-24 03:02:44 UTC
Permalink
Protect i_hash, i_sb_list etc members with i_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/hugetlbfs/inode.c | 1 +
fs/inode.c | 29 ++++++++++++++++++++++++++---
2 files changed, 27 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -35,7 +35,11 @@
* wb_inode_list_lock protects:
* inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
* inode->i_lock protects:
- * i_state, i_count
+ * i_state
+ * i_count
+ * i_hash
+ * i_list
+ * i_sb_list
*
* Ordering:
* inode_lock
@@ -373,12 +377,14 @@ static void dispose_list(struct list_hea
clear_inode(inode);

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
- spin_lock(&sb_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);

wake_up_inode(inode);
@@ -680,7 +686,6 @@ __inode_add_to_lists(struct super_block
struct inode *inode)
{
atomic_inc(&inodes_stat.nr_inodes);
- spin_lock(&sb_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
spin_lock(&wb_inode_list_lock);
@@ -710,7 +715,10 @@ void inode_add_to_lists(struct super_blo
struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
__inode_add_to_lists(sb, head, inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -742,9 +750,12 @@ struct inode *new_inode(struct super_blo
inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
inode->i_ino = ++last_ino;
inode->i_state = 0;
__inode_add_to_lists(sb, NULL, inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
}
return inode;
@@ -808,11 +819,14 @@ static struct inode *get_new_inode(struc
/* We released the lock, so.. */
old = find_inode(sb, head, test, data);
if (!old) {
+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
if (set(inode, data))
goto set_failed;

inode->i_state = I_NEW;
__inode_add_to_lists(sb, head, inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);

/* Return the locked inode with I_NEW set, the
@@ -837,6 +851,7 @@ static struct inode *get_new_inode(struc

set_failed:
spin_unlock(&inode->i_lock);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
destroy_inode(inode);
return NULL;
@@ -859,9 +874,12 @@ static struct inode *get_new_inode_fast(
/* We released the lock, so.. */
old = find_inode_fast(sb, head, ino);
if (!old) {
+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
inode->i_ino = ino;
inode->i_state = I_NEW;
__inode_add_to_lists(sb, head, inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);

/* Return the locked inode with I_NEW set, the
@@ -1275,10 +1293,13 @@ EXPORT_SYMBOL(insert_inode_locked4);
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_add_head(&inode->i_hash, head);
spin_unlock(&inode_hash_lock);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(__insert_inode_hash);
@@ -1292,9 +1313,11 @@ EXPORT_SYMBOL(__insert_inode_hash);
void remove_inode_hash(struct inode *inode)
{
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(remove_inode_hash);
@@ -1338,9 +1361,11 @@ void generic_delete_inode(struct inode *
clear_inode(inode);
}
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
wake_up_inode(inode);
BUG_ON(inode->i_state != I_CLEAR);
@@ -1385,10 +1410,10 @@ int generic_detach_inode(struct inode *i
spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
- atomic_dec(&inodes_stat.nr_unused);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
+ atomic_dec(&inodes_stat.nr_unused);
}
spin_lock(&wb_inode_list_lock);
list_del_init(&inode->i_list);
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -382,6 +382,7 @@ static void hugetlbfs_forget_inode(struc
if (generic_detach_inode(inode)) {
truncate_hugepages(inode, 0);
clear_inode(inode);
+ /* XXX: why no wake_up_inode? */
destroy_inode(inode);
}
}
n***@suse.de
2010-06-24 03:02:50 UTC
Permalink
RCU free the struct inode. This will allow:

- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- eventually, completely write-free RCU path walking. The inode must be
consulted for permissions when walking, so a write-free reference (ie.
RCU is helpful).
- can potentially simplify things a bit in VM land. May not need to take the
page lock to get back to the page->mapping.
- can remove some nested trylock loops in dcache code
- Could remove the 'wq' allocation from socket now that the entire thing is
rcu freed.

todo: convert all filesystems
---
fs/block_dev.c | 9 ++++++++-
fs/ext2/super.c | 9 ++++++++-
fs/ext3/super.c | 9 ++++++++-
fs/fat/inode.c | 9 ++++++++-
fs/hugetlbfs/inode.c | 9 ++++++++-
fs/inode.c | 9 ++++++++-
fs/nfs/inode.c | 9 ++++++++-
fs/proc/inode.c | 9 ++++++++-
include/linux/fs.h | 5 ++++-
ipc/mqueue.c | 9 ++++++++-
mm/shmem.c | 9 ++++++++-
net/socket.c | 9 ++++++++-
net/sunrpc/rpc_pipe.c | 10 +++++++++-
13 files changed, 101 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -161,11 +161,18 @@ static struct inode *ext2_alloc_inode(st
return &ei->vfs_inode;
}

-static void ext2_destroy_inode(struct inode *inode)
+static void ext2_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
}

+static void ext2_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, ext2_i_callback);
+}
+
static void init_once(void *foo)
{
struct ext2_inode_info *ei = (struct ext2_inode_info *) foo;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -277,13 +277,20 @@ void __destroy_inode(struct inode *inode
}
EXPORT_SYMBOL(__destroy_inode);

+static void i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(inode_cachep, inode);
+}
+
void destroy_inode(struct inode *inode)
{
__destroy_inode(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
- kmem_cache_free(inode_cachep, (inode));
+ call_rcu(&inode->i_rcu, i_callback);
}

/*
@@ -346,6 +353,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+ /* don't need i_lock here */
inode->i_state = I_CLEAR;
}
EXPORT_SYMBOL(clear_inode);
@@ -661,7 +669,7 @@ __inode_add_to_lists(struct super_block
spin_unlock(&sb_inode_list_lock);
if (b) {
spin_lock_bucket(b);
- hlist_bl_add_head(&inode->i_hash, &b->head);
+ hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
spin_unlock_bucket(b);
}
}
@@ -713,6 +721,7 @@ struct inode *new_inode(struct super_blo

inode = alloc_inode(sb);
if (inode) {
+ /* XXX: init as locked for speedup */
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
inode->i_ino = atomic_inc_return(&last_ino);
@@ -870,6 +879,7 @@ static int test_inode_iunique(struct sup
spin_unlock_bucket(b);
return 0;
}
+ /* XXX: test for I_FREEING|I_CLEAR|etc? */
}
spin_unlock_bucket(b);
return 1;
@@ -1156,42 +1166,41 @@ int insert_inode_locked(struct inode *in
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
+ struct hlist_bl_node *node;
+ struct inode *old;

inode->i_state |= I_NEW;
- while (1) {
- struct hlist_bl_node *node;
- struct inode *old = NULL;

repeat:
- spin_lock_bucket(b);
- hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
- if (old->i_ino != ino)
- continue;
- if (old->i_sb != sb)
- continue;
- if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
- continue;
- if (!spin_trylock(&old->i_lock)) {
- spin_unlock_bucket(b);
- goto repeat;
- }
- break;
- }
- if (likely(!node)) {
- hlist_bl_add_head(&inode->i_hash, &b->head);
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
+ if (old->i_ino != ino)
+ continue;
+ if (old->i_sb != sb)
+ continue;
+ if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+ continue;
+ if (!spin_trylock(&old->i_lock)) {
spin_unlock_bucket(b);
- return 0;
- }
- spin_unlock_bucket(b);
- __iget(old);
- spin_unlock(&old->i_lock);
- wait_on_inode(old);
- if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
- iput(old);
- return -EBUSY;
+ goto repeat;
}
+ goto found_old;
+ }
+ hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
+ spin_unlock_bucket(b);
+ return 0;
+
+found_old:
+ spin_unlock_bucket(b);
+ __iget(old);
+ spin_unlock(&old->i_lock);
+ wait_on_inode(old);
+ if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
iput(old);
+ return -EBUSY;
}
+ iput(old);
+ goto repeat;
}
EXPORT_SYMBOL(insert_inode_locked);

@@ -1200,43 +1209,44 @@ int insert_inode_locked4(struct inode *i
{
struct super_block *sb = inode->i_sb;
struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
+ struct hlist_bl_node *node;
+ struct inode *old;

inode->i_state |= I_NEW;

- while (1) {
- struct hlist_bl_node *node;
- struct inode *old = NULL;
-
repeat:
- spin_lock_bucket(b);
- hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
- if (old->i_sb != sb)
- continue;
- if (!test(old, data))
- continue;
- if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
- continue;
- if (!spin_trylock(&old->i_lock)) {
- spin_unlock_bucket(b);
- goto repeat;
- }
- break;
- }
- if (likely(!node)) {
- hlist_bl_add_head(&inode->i_hash, &b->head);
+ spin_lock_bucket(b);
+ hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
+ if (old->i_sb != sb)
+ continue;
+ /* XXX: audit put test outside i_lock? */
+ if (!test(old, data))
+ continue;
+ if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+ continue;
+ if (!spin_trylock(&old->i_lock)) {
spin_unlock_bucket(b);
- return 0;
- }
- spin_unlock_bucket(b);
- __iget(old);
- spin_unlock(&old->i_lock);
- wait_on_inode(old);
- if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
- iput(old);
- return -EBUSY;
+ cpu_relax();
+ cpu_relax();
+ goto repeat;
}
+ goto found_old;
+ }
+ hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
+ spin_unlock_bucket(b);
+ return 0;
+
+found_old:
+ spin_unlock_bucket(b);
+ __iget(old);
+ spin_unlock(&old->i_lock);
+ wait_on_inode(old);
+ if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
iput(old);
+ return -EBUSY;
}
+ iput(old);
+ goto repeat;
}
EXPORT_SYMBOL(insert_inode_locked4);

@@ -1254,7 +1264,7 @@ void __insert_inode_hash(struct inode *i

spin_lock(&inode->i_lock);
spin_lock_bucket(b);
- hlist_bl_add_head(&inode->i_hash, &b->head);
+ hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
spin_unlock_bucket(b);
spin_unlock(&inode->i_lock);
}
@@ -1271,7 +1281,7 @@ void __remove_inode_hash(struct inode *i
{
struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
spin_lock_bucket(b);
- hlist_bl_del_init(&inode->i_hash);
+ hlist_bl_del_init_rcu(&inode->i_hash);
spin_unlock_bucket(b);
}

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -726,7 +726,10 @@ struct inode {
struct hlist_bl_node i_hash;
struct list_head i_list; /* backing dev IO list */
struct list_head i_sb_list;
- struct list_head i_dentry;
+ union {
+ struct list_head i_dentry;
+ struct rcu_head i_rcu;
+ };
unsigned long i_ino;
unsigned int i_count;
unsigned int i_nlink;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -397,13 +397,20 @@ static struct inode *bdev_alloc_inode(st
return &ei->vfs_inode;
}

-static void bdev_destroy_inode(struct inode *inode)
+static void bdev_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
struct bdev_inode *bdi = BDEV_I(inode);

+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(bdev_cachep, bdi);
}

+static void bdev_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, bdev_i_callback);
+}
+
static void init_once(void *foo)
{
struct bdev_inode *ei = (struct bdev_inode *) foo;
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -485,6 +485,13 @@ static struct inode *ext3_alloc_inode(st
return &ei->vfs_inode;
}

+static void ext3_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+}
+
static void ext3_destroy_inode(struct inode *inode)
{
if (!list_empty(&(EXT3_I(inode)->i_orphan))) {
@@ -495,7 +502,7 @@ static void ext3_destroy_inode(struct in
false);
dump_stack();
}
- kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+ call_rcu(&inode->i_rcu, ext3_i_callback);
}

static void init_once(void *foo)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -665,11 +665,18 @@ static struct inode *hugetlbfs_alloc_ino
return &p->vfs_inode;
}

+static void hugetlbfs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+}
+
static void hugetlbfs_destroy_inode(struct inode *inode)
{
hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
- kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+ call_rcu(&inode->i_rcu, hugetlbfs_i_callback);
}

static const struct address_space_operations hugetlbfs_aops = {
Index: linux-2.6/fs/proc/inode.c
===================================================================
--- linux-2.6.orig/fs/proc/inode.c
+++ linux-2.6/fs/proc/inode.c
@@ -66,11 +66,18 @@ static struct inode *proc_alloc_inode(st
return inode;
}

-static void proc_destroy_inode(struct inode *inode)
+static void proc_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(proc_inode_cachep, PROC_I(inode));
}

+static void proc_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, proc_i_callback);
+}
+
static void init_once(void *foo)
{
struct proc_inode *ei = (struct proc_inode *) foo;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c
+++ linux-2.6/ipc/mqueue.c
@@ -236,11 +236,18 @@ static struct inode *mqueue_alloc_inode(
return &ei->vfs_inode;
}

-static void mqueue_destroy_inode(struct inode *inode)
+static void mqueue_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(mqueue_inode_cachep, MQUEUE_I(inode));
}

+static void mqueue_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, mqueue_i_callback);
+}
+
static void mqueue_delete_inode(struct inode *inode)
{
struct mqueue_inode_info *info;
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c
+++ linux-2.6/net/socket.c
@@ -271,20 +271,20 @@ static struct inode *sock_alloc_inode(st
}


-static void wq_free_rcu(struct rcu_head *head)
+static void sock_free_rcu(struct rcu_head *head)
{
- struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct socket_alloc *ei = container_of(inode, struct socket_alloc,
+ vfs_inode);

- kfree(wq);
+ kfree(ei->socket.wq);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(sock_inode_cachep, ei);
}

static void sock_destroy_inode(struct inode *inode)
{
- struct socket_alloc *ei;
-
- ei = container_of(inode, struct socket_alloc, vfs_inode);
- call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
- kmem_cache_free(sock_inode_cachep, ei);
+ call_rcu(&inode->i_rcu, sock_free_rcu);
}

static void init_once(void *foo)
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -520,11 +520,18 @@ static struct inode *fat_alloc_inode(str
return &ei->vfs_inode;
}

-static void fat_destroy_inode(struct inode *inode)
+static void fat_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(fat_inode_cachep, MSDOS_I(inode));
}

+static void fat_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, fat_i_callback);
+}
+
static void init_once(void *foo)
{
struct msdos_inode_info *ei = (struct msdos_inode_info *)foo;
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -1365,11 +1365,18 @@ struct inode *nfs_alloc_inode(struct sup
return &nfsi->vfs_inode;
}

-void nfs_destroy_inode(struct inode *inode)
+static void nfs_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
}

+void nfs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, nfs_i_callback);
+}
+
static inline void nfs4_init_once(struct nfs_inode *nfsi)
{
#ifdef CONFIG_NFS_V4
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -2389,13 +2389,20 @@ static struct inode *shmem_alloc_inode(s
return &p->vfs_inode;
}

+static void shmem_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+}
+
static void shmem_destroy_inode(struct inode *inode)
{
if ((inode->i_mode & S_IFMT) == S_IFREG) {
/* only struct inode is valid if it's an inline symlink */
mpol_free_shared_policy(&SHMEM_I(inode)->policy);
}
- kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+ call_rcu(&inode->i_rcu, shmem_i_callback);
}

static void init_once(void *foo)
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c
+++ linux-2.6/net/sunrpc/rpc_pipe.c
@@ -163,11 +163,19 @@ rpc_alloc_inode(struct super_block *sb)
}

static void
-rpc_destroy_inode(struct inode *inode)
+rpc_i_callback(struct rcu_head *head)
{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ INIT_LIST_HEAD(&inode->i_dentry);
kmem_cache_free(rpc_inode_cachep, RPC_I(inode));
}

+static void
+rpc_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, rpc_i_callback);
+}
+
static int
rpc_pipe_open(struct inode *inode, struct file *filp)
{
Index: linux-2.6/include/linux/net.h
===================================================================
--- linux-2.6.orig/include/linux/net.h
+++ linux-2.6/include/linux/net.h
@@ -120,7 +120,6 @@ enum sock_shutdown_cmd {
struct socket_wq {
wait_queue_head_t wait;
struct fasync_struct *fasync_list;
- struct rcu_head rcu;
} ____cacheline_aligned_in_smp;

/**
n***@suse.de
2010-06-24 03:02:39 UTC
Permalink
Add a new lock, inode_hash_lock, to protect the inode hash table lists.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/inode.c | 29 ++++++++++++++++++++++++++++-
include/linux/writeback.h | 1 +
2 files changed, 29 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -30,10 +30,14 @@
* Usage:
* sb_inode_list_lock protects:
* s_inodes, i_sb_list
+ * inode_hash_lock protects:
+ * inode hash table, i_hash
*
* Ordering:
* inode_lock
* sb_inode_list_lock
+ * inode_lock
+ * inode_hash_lock
*/
/*
* This is needed for the following functions:
@@ -94,6 +98,7 @@ static struct hlist_head *inode_hashtabl
*/
DEFINE_SPINLOCK(inode_lock);
DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(inode_hash_lock);

/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -353,7 +358,9 @@ static void dispose_list(struct list_hea
clear_inode(inode);

spin_lock(&inode_lock);
+ spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
+ spin_unlock(&inode_hash_lock);
spin_lock(&sb_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
@@ -563,17 +570,20 @@ static struct inode *find_inode(struct s
struct inode *inode = NULL;

repeat:
+ spin_lock(&inode_hash_lock);
hlist_for_each_entry(inode, node, head, i_hash) {
if (inode->i_sb != sb)
continue;
if (!test(inode, data))
continue;
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
+ spin_unlock(&inode_hash_lock);
__wait_on_freeing_inode(inode);
goto repeat;
}
break;
}
+ spin_unlock(&inode_hash_lock);
return node ? inode : NULL;
}

@@ -588,17 +598,20 @@ static struct inode *find_inode_fast(str
struct inode *inode = NULL;

repeat:
+ spin_lock(&inode_hash_lock);
hlist_for_each_entry(inode, node, head, i_hash) {
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
+ spin_unlock(&inode_hash_lock);
__wait_on_freeing_inode(inode);
goto repeat;
}
break;
}
+ spin_unlock(&inode_hash_lock);
return node ? inode : NULL;
}

@@ -621,8 +634,11 @@ __inode_add_to_lists(struct super_block
spin_lock(&sb_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
- if (head)
+ if (head) {
+ spin_lock(&inode_hash_lock);
hlist_add_head(&inode->i_hash, head);
+ spin_unlock(&inode_hash_lock);
+ }
}

/**
@@ -1100,7 +1116,9 @@ int insert_inode_locked(struct inode *in
while (1) {
struct hlist_node *node;
struct inode *old = NULL;
+
spin_lock(&inode_lock);
+ spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, node, head, i_hash) {
if (old->i_ino != ino)
continue;
@@ -1112,9 +1130,11 @@ int insert_inode_locked(struct inode *in
}
if (likely(!node)) {
hlist_add_head(&inode->i_hash, head);
+ spin_unlock(&inode_hash_lock);
spin_unlock(&inode_lock);
return 0;
}
+ spin_unlock(&inode_hash_lock);
__iget(old);
spin_unlock(&inode_lock);
wait_on_inode(old);
@@ -1140,6 +1160,7 @@ int insert_inode_locked4(struct inode *i
struct inode *old = NULL;

spin_lock(&inode_lock);
+ spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, node, head, i_hash) {
if (old->i_sb != sb)
continue;
@@ -1151,9 +1172,11 @@ int insert_inode_locked4(struct inode *i
}
if (likely(!node)) {
hlist_add_head(&inode->i_hash, head);
+ spin_unlock(&inode_hash_lock);
spin_unlock(&inode_lock);
return 0;
}
+ spin_unlock(&inode_hash_lock);
__iget(old);
spin_unlock(&inode_lock);
wait_on_inode(old);
@@ -1178,7 +1201,9 @@ void __insert_inode_hash(struct inode *i
{
struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
spin_lock(&inode_lock);
+ spin_lock(&inode_hash_lock);
hlist_add_head(&inode->i_hash, head);
+ spin_unlock(&inode_hash_lock);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(__insert_inode_hash);
@@ -1192,7 +1217,9 @@ EXPORT_SYMBOL(__insert_inode_hash);
void remove_inode_hash(struct inode *inode)
{
spin_lock(&inode_lock);
+ spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
+ spin_unlock(&inode_hash_lock);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(remove_inode_hash);
@@ -1234,7 +1261,9 @@ void generic_delete_inode(struct inode *
clear_inode(inode);
}
spin_lock(&inode_lock);
+ spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
+ spin_unlock(&inode_hash_lock);
spin_unlock(&inode_lock);
wake_up_inode(inode);
BUG_ON(inode->i_state != I_CLEAR);
@@ -1271,7 +1300,9 @@ int generic_detach_inode(struct inode *i
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
inodes_stat.nr_unused--;
+ spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
+ spin_unlock(&inode_hash_lock);
}
list_del_init(&inode->i_list);
spin_lock(&sb_inode_list_lock);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -11,6 +11,7 @@ struct backing_dev_info;

extern spinlock_t inode_lock;
extern spinlock_t sb_inode_list_lock;
+extern spinlock_t inode_hash_lock;
extern struct list_head inode_in_use;
extern struct list_head inode_unused;



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:29 UTC
Permalink
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <***@suse.de>
---
Documentation/filesystems/Locking | 2
arch/powerpc/platforms/cell/spufs/inode.c | 5 -
drivers/infiniband/hw/ipath/ipath_fs.c | 6 -
drivers/infiniband/hw/qib/qib_fs.c | 3
drivers/staging/pohmelfs/path_entry.c | 2
drivers/usb/core/inode.c | 3
fs/affs/amigaffs.c | 2
fs/autofs4/autofs_i.h | 3
fs/autofs4/expire.c | 24 ++---
fs/autofs4/root.c | 14 +--
fs/autofs4/waitq.c | 7 -
fs/ceph/dir.c | 6 -
fs/ceph/inode.c | 4
fs/coda/cache.c | 2
fs/configfs/configfs_internal.h | 2
fs/configfs/inode.c | 6 -
fs/dcache.c | 135 ++++--------------------------
fs/exportfs/expfs.c | 4
fs/namei.c | 9 --
fs/ncpfs/dir.c | 3
fs/ncpfs/ncplib_kernel.h | 4
fs/nfs/dir.c | 3
fs/nfs/getroot.c | 2
fs/nfs/namespace.c | 3
fs/notify/fsnotify.c | 2
fs/notify/inotify/inotify.c | 4
fs/ocfs2/dcache.c | 2
fs/seq_file.c | 2
fs/smbfs/cache.c | 4
include/linux/dcache.h | 17 +--
include/linux/fs.h | 4
include/linux/fsnotify_backend.h | 5 -
kernel/cgroup.c | 6 -
security/selinux/selinuxfs.c | 4
security/tomoyo/realpath.c | 2
35 files changed, 67 insertions(+), 239 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -54,11 +54,10 @@
* - d_alias, d_inode
*
* Ordering:
- * dcache_lock
- * dcache_inode_lock
- * dentry->d_lock
- * dcache_lru_lock
- * dcache_hash_lock
+ * dcache_inode_lock
+ * dentry->d_lock
+ * dcache_lru_lock
+ * dcache_hash_lock
*
* If there is an ancestor relationship:
* dentry->d_parent->...->d_parent->d_lock
@@ -77,13 +76,11 @@ EXPORT_SYMBOL_GPL(sysctl_vfs_cache_press
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);
EXPORT_SYMBOL(dcache_inode_lock);
EXPORT_SYMBOL(dcache_hash_lock);
-EXPORT_SYMBOL(dcache_lock);

static struct kmem_cache *dentry_cache __read_mostly;

@@ -125,7 +122,7 @@ static void d_callback(struct rcu_head *
}

/*
- * no dcache_lock, please.
+ * no locks, please.
*/
static void d_free(struct dentry *dentry)
{
@@ -147,7 +144,6 @@ static void d_free(struct dentry *dentry
static void dentry_iput(struct dentry * dentry)
__releases(dentry->d_lock)
__releases(dcache_inode_lock)
- __releases(dcache_lock)
{
struct inode *inode = dentry->d_inode;
if (inode) {
@@ -155,7 +151,6 @@ static void dentry_iput(struct dentry *
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
if (!inode->i_nlink)
fsnotify_inoderemove(inode);
if (dentry->d_op && dentry->d_op->d_iput)
@@ -165,7 +160,6 @@ static void dentry_iput(struct dentry *
} else {
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}
}

@@ -231,13 +225,12 @@ static void dentry_lru_del_init(struct d
*
* If this is the root of the dentry tree, return NULL.
*
- * dcache_lock and d_lock and d_parent->d_lock must be held by caller, and
+ * d_lock and d_parent->d_lock must be held by caller, and
* are dropped by d_kill.
*/
static struct dentry *d_kill(struct dentry *dentry)
__releases(dentry->d_lock)
__releases(dcache_inode_lock)
- __releases(dcache_lock)
{
struct dentry *parent;

@@ -294,21 +287,10 @@ repeat:
might_sleep();
spin_lock(&dentry->d_lock);
if (dentry->d_count == 1) {
- if (!spin_trylock(&dcache_lock)) {
- /*
- * Something of a livelock possibility we could avoid
- * by taking dcache_lock and trying again, but we
- * want to reduce dcache_lock anyway so this will
- * get improved.
- */
-drop1:
- spin_unlock(&dentry->d_lock);
- goto repeat;
- }
if (!spin_trylock(&dcache_inode_lock)) {
drop2:
- spin_unlock(&dcache_lock);
- goto drop1;
+ spin_unlock(&dentry->d_lock);
+ goto repeat;
}
parent = dentry->d_parent;
if (parent && parent != dentry) {
@@ -323,7 +305,6 @@ drop2:
spin_unlock(&dentry->d_lock);
if (parent && parent != dentry)
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
return;
}

@@ -345,7 +326,6 @@ drop2:
if (parent && parent != dentry)
spin_unlock(&parent->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
return;

unhash_it:
@@ -376,11 +356,9 @@ int d_invalidate(struct dentry * dentry)
/*
* If it's already been dropped, return OK.
*/
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return 0;
}
/*
@@ -389,9 +367,7 @@ int d_invalidate(struct dentry * dentry)
*/
if (!list_empty(&dentry->d_subdirs)) {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
shrink_dcache_parent(dentry);
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
}

@@ -408,19 +384,17 @@ int d_invalidate(struct dentry * dentry)
if (dentry->d_count > 1) {
if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return -EBUSY;
}
}

__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return 0;
}
EXPORT_SYMBOL(d_invalidate);

-/* This must be called with dcache_lock and d_lock held */
+/* This must be called with d_lock held */
static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
{
dentry->d_count++;
@@ -428,7 +402,7 @@ static inline struct dentry * __dget_loc
return dentry;
}

-/* This should be called _only_ with dcache_lock held */
+/* This must be called with d_lock held */
static inline struct dentry * __dget_locked(struct dentry *dentry)
{
spin_lock(&dentry->d_lock);
@@ -526,11 +500,9 @@ struct dentry * d_find_alias(struct inod
struct dentry *de = NULL;

if (!list_empty(&inode->i_dentry)) {
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
de = __d_find_alias(inode, 0);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}
return de;
}
@@ -544,7 +516,6 @@ void d_prune_aliases(struct inode *inode
{
struct dentry *dentry;
restart:
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
@@ -553,14 +524,12 @@ restart:
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
dput(dentry);
goto restart;
}
spin_unlock(&dentry->d_lock);
}
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}
EXPORT_SYMBOL(d_prune_aliases);

@@ -574,20 +543,16 @@ EXPORT_SYMBOL(d_prune_aliases);
*/
static void prune_one_dentry(struct dentry * dentry)
__releases(dentry->d_lock)
- __releases(dcache_lock)
- __acquires(dcache_lock)
{
__d_drop(dentry);
dentry = d_kill(dentry);

/*
- * Prune ancestors. Locking is simpler than in dput(),
- * because dcache_lock needs to be taken anyway.
+ * Prune ancestors.
*/
while (dentry) {
struct dentry *parent = NULL;

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
again:
spin_lock(&dentry->d_lock);
@@ -604,7 +569,6 @@ again:
spin_unlock(&parent->d_lock);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
return;
}

@@ -673,7 +637,6 @@ restart:
}
spin_unlock(&dcache_lru_lock);

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
again:
spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
@@ -703,14 +666,13 @@ again1:
}
__dentry_lru_del_init(dentry);
spin_unlock(&dcache_lru_lock);
+
prune_one_dentry(dentry);
- /* dcache_lock and dentry->d_lock dropped */
- spin_lock(&dcache_lock);
+ /* dentry->d_lock dropped */
spin_lock(&dcache_inode_lock);
spin_lock(&dcache_lru_lock);
}
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);

if (count == NULL && !list_empty(&sb->s_dentry_lru))
goto restart;
@@ -740,7 +702,6 @@ static void prune_dcache(int count)

if (unused == 0 || count == 0)
return;
- spin_lock(&dcache_lock);
if (count >= unused)
prune_ratio = 1;
else
@@ -777,11 +738,9 @@ static void prune_dcache(int count)
if (down_read_trylock(&sb->s_umount)) {
if ((sb->s_root != NULL) &&
(!list_empty(&sb->s_dentry_lru))) {
- spin_unlock(&dcache_lock);
__shrink_dcache_sb(sb, &w_count,
DCACHE_REFERENCED);
pruned -= w_count;
- spin_lock(&dcache_lock);
}
up_read(&sb->s_umount);
}
@@ -795,7 +754,6 @@ static void prune_dcache(int count)
break;
}
spin_unlock(&sb_lock);
- spin_unlock(&dcache_lock);
}

/**
@@ -825,12 +783,10 @@ static void shrink_dcache_for_umount_sub
BUG_ON(!IS_ROOT(dentry));

/* detach this root from the system */
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
dentry_lru_del_init(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);

for (;;) {
/* descend to the first leaf in the current subtree */
@@ -839,7 +795,6 @@ static void shrink_dcache_for_umount_sub

/* this is a branch with children - detach all of them
* from the system in one go */
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
list_for_each_entry(loop, &dentry->d_subdirs,
d_u.d_child) {
@@ -849,7 +804,6 @@ static void shrink_dcache_for_umount_sub
spin_unlock(&loop->d_lock);
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);

/* move to the first child */
dentry = list_entry(dentry->d_subdirs.next,
@@ -920,8 +874,7 @@ out:

/*
* destroy the dentries attached to a superblock on unmounting
- * - we don't need to use dentry->d_lock, and only need dcache_lock when
- * removing the dentry from the system lists and hashes because:
+ * - we don't need to use dentry->d_lock because:
* - the superblock is detached from all mountings and open files, so the
* dentry trees will not be rearranged by the VFS
* - s_umount is write-locked, so the memory pressure shrinker will ignore
@@ -972,7 +925,6 @@ rename_retry:
this_parent = parent;
seq = read_seqbegin(&rename_lock);

- spin_lock(&dcache_lock);
if (d_mountpoint(parent))
goto positive;
spin_lock(&this_parent->d_lock);
@@ -1019,7 +971,6 @@ resume:
// d_unlinked(this_parent) || XXX
read_seqretry(&rename_lock, seq)) {
spin_unlock(&this_parent->d_lock);
- spin_unlock(&dcache_lock);
rcu_read_unlock();
goto rename_retry;
}
@@ -1028,12 +979,10 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- spin_unlock(&dcache_lock);
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
return 0; /* No mount points found in tree */
positive:
- spin_unlock(&dcache_lock);
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
return 1;
@@ -1066,7 +1015,6 @@ rename_retry:
this_parent = parent;
seq = read_seqbegin(&rename_lock);

- spin_lock(&dcache_lock);
spin_lock(&this_parent->d_lock);
repeat:
next = this_parent->d_subdirs.next;
@@ -1129,7 +1077,6 @@ resume:
// d_unlinked(this_parent) || XXX
read_seqretry(&rename_lock, seq)) {
spin_unlock(&this_parent->d_lock);
- spin_unlock(&dcache_lock);
rcu_read_unlock();
goto rename_retry;
}
@@ -1139,7 +1086,6 @@ resume:
}
out:
spin_unlock(&this_parent->d_lock);
- spin_unlock(&dcache_lock);
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
return found;
@@ -1240,7 +1186,6 @@ struct dentry *d_alloc(struct dentry * p
INIT_LIST_HEAD(&dentry->d_u.d_child);

if (parent) {
- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
dentry->d_parent = dget_dlock(parent);
@@ -1248,7 +1193,6 @@ struct dentry *d_alloc(struct dentry * p
list_add(&dentry->d_u.d_child, &parent->d_subdirs);
spin_unlock(&dentry->d_lock);
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
}

atomic_inc(&dentry_stat.nr_dentry);
@@ -1268,7 +1212,6 @@ struct dentry *d_alloc_name(struct dentr
}
EXPORT_SYMBOL(d_alloc_name);

-/* the caller must hold dcache_lock */
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
spin_lock(&dentry->d_lock);
@@ -1297,11 +1240,9 @@ static void __d_instantiate(struct dentr
void d_instantiate(struct dentry *entry, struct inode * inode)
{
BUG_ON(!list_empty(&entry->d_alias));
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
__d_instantiate(entry, inode);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
security_d_instantiate(entry, inode);
}
EXPORT_SYMBOL(d_instantiate);
@@ -1360,11 +1301,9 @@ struct dentry *d_instantiate_unique(stru

BUG_ON(!list_empty(&entry->d_alias));

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
result = __d_instantiate_unique(entry, inode);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);

if (!result) {
security_d_instantiate(entry, inode);
@@ -1453,12 +1392,10 @@ struct dentry *d_obtain_alias(struct ino
}
tmp->d_parent = tmp; /* make sure dput doesn't croak */

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
res = __d_find_alias(inode, 0);
if (res) {
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
dput(tmp);
goto out_iput;
}
@@ -1474,7 +1411,6 @@ struct dentry *d_obtain_alias(struct ino
spin_unlock(&tmp->d_lock);
spin_unlock(&dcache_inode_lock);

- spin_unlock(&dcache_lock);
return tmp;

out_iput:
@@ -1504,21 +1440,18 @@ struct dentry *d_splice_alias(struct ino
struct dentry *new = NULL;

if (inode && S_ISDIR(inode->i_mode)) {
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
new = __d_find_alias(inode, 1);
if (new) {
BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
security_d_instantiate(new, inode);
d_move(new, dentry);
iput(inode);
} else {
- /* already taking dcache_lock, so d_add() by hand */
+ /* already got dcache_inode_lock, so d_add() by hand */
__d_instantiate(dentry, inode);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
security_d_instantiate(dentry, inode);
d_rehash(dentry);
}
@@ -1591,12 +1524,10 @@ struct dentry *d_add_ci(struct dentry *d
* Negative dentry: instantiate it unless the inode is a directory and
* already has a dentry.
*/
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
__d_instantiate(found, inode);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
security_d_instantiate(found, inode);
return found;
}
@@ -1608,7 +1539,6 @@ struct dentry *d_add_ci(struct dentry *d
new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
dget_locked(new);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
security_d_instantiate(found, inode);
d_move(new, found);
iput(inode);
@@ -1631,7 +1561,7 @@ EXPORT_SYMBOL(d_add_ci);
* is returned. The caller must use dput to free the entry when it has
* finished using it. %NULL is returned on failure.
*
- * __d_lookup is dcache_lock free. The hash list is protected using RCU.
+ * __d_lookup is global lock free. The hash list is protected using RCU.
* Memory barriers are used while updating and doing lockless traversal.
* To avoid races with d_move while rename is happening, d_lock is used.
*
@@ -1643,7 +1573,7 @@ EXPORT_SYMBOL(d_add_ci);
*
* The dentry unused LRU is not updated even if lookup finds the required dentry
* in there. It is updated in places such as prune_dcache, shrink_dcache_sb,
- * select_parent and __dget_locked. This laziness saves lookup from dcache_lock
+ * select_parent and __dget_locked. This laziness saves lookup from LRU lock
* acquisition.
*
* d_lookup() is protected against the concurrent renames in some unrelated
@@ -1774,25 +1704,22 @@ int d_validate(struct dentry *dentry, st
if (dentry->d_parent != dparent)
goto out;

- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
spin_lock(&dcache_hash_lock);
base = d_hash(dparent, dentry->d_name.hash);
hlist_for_each(lhp,base) {
/* hlist_for_each_entry_rcu() not required for d_hash list
- * as it is parsed under dcache_lock
+ * as it is parsed under dcache_hash_lock
*/
if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
spin_unlock(&dcache_hash_lock);
__dget_locked_dlock(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return 1;
}
}
spin_unlock(&dcache_hash_lock);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
out:
return 0;
}
@@ -1825,7 +1752,6 @@ void d_delete(struct dentry * dentry)
/*
* Are we the only user?
*/
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
@@ -1841,7 +1767,6 @@ void d_delete(struct dentry * dentry)

spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);

fsnotify_nameremove(dentry, isdir);
}
@@ -1868,13 +1793,11 @@ static void _d_rehash(struct dentry * en

void d_rehash(struct dentry * entry)
{
- spin_lock(&dcache_lock);
spin_lock(&entry->d_lock);
spin_lock(&dcache_hash_lock);
_d_rehash(entry);
spin_unlock(&dcache_hash_lock);
spin_unlock(&entry->d_lock);
- spin_unlock(&dcache_lock);
}
EXPORT_SYMBOL(d_rehash);

@@ -2032,9 +1955,7 @@ static void d_move_locked(struct dentry

void d_move(struct dentry * dentry, struct dentry * target)
{
- spin_lock(&dcache_lock);
d_move_locked(dentry, target);
- spin_unlock(&dcache_lock);
}
EXPORT_SYMBOL(d_move);

@@ -2061,13 +1982,12 @@ struct dentry *d_ancestor(struct dentry
* This helper attempts to cope with remotely renamed directories
*
* It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex and the dcache_lock
+ * dentry->d_parent->d_inode->i_mutex
*
* Note: If ever the locking in lock_rename() changes, then please
* remember to update this too...
*/
static struct dentry *__d_unalias(struct dentry *dentry, struct dentry *alias)
- __releases(dcache_lock)
{
struct mutex *m1 = NULL, *m2 = NULL;
struct dentry *ret;
@@ -2094,7 +2014,6 @@ out_unalias:
ret = alias;
out_err:
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
if (m2)
mutex_unlock(m2);
if (m1)
@@ -2158,7 +2077,6 @@ struct dentry *d_materialise_unique(stru

BUG_ON(!d_unhashed(dentry));

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);

if (!inode) {
@@ -2204,7 +2122,6 @@ found:
spin_unlock(&dcache_hash_lock);
spin_unlock(&actual->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
out_nolock:
if (actual == dentry) {
security_d_instantiate(dentry, inode);
@@ -2216,7 +2133,6 @@ out_nolock:

shouldnt_be_hashed:
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
BUG();
}
EXPORT_SYMBOL_GPL(d_materialise_unique);
@@ -2249,8 +2165,7 @@ static int prepend_name(char **buffer, i
* Returns a pointer into the buffer or an error code if the
* path was too long.
*
- * "buflen" should be positive. Caller holds the dcache_lock and
- * path->dentry->d_lock.
+ * "buflen" should be positive. Caller holds the path->dentry->d_lock.
*
* If path is not reachable from the supplied root, then the value of
* root is changed (without modifying refcounts).
@@ -2371,12 +2286,10 @@ char *d_path(const struct path *path, ch
path_get(&root);
spin_unlock(&current->fs->lock);

- spin_lock(&dcache_lock);
br_read_lock(vfsmount_lock);
tmp = root;
res = __d_path(path, &tmp, buf, buflen);
br_read_unlock(vfsmount_lock);
- spin_unlock(&dcache_lock);

path_put(&root);
return res;
@@ -2418,7 +2331,6 @@ rename_retry:
prepend(&end, &buflen, "\0", 1);

seq = read_seqbegin(&rename_lock);
- spin_lock(&dcache_lock);
br_read_lock(vfsmount_lock);
rcu_read_lock(); /* protect parent */
spin_lock(&dentry->d_lock);
@@ -2451,7 +2363,6 @@ out:
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
br_read_unlock(vfsmount_lock);
- spin_unlock(&dcache_lock);
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
return retval;
@@ -2495,7 +2406,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
spin_unlock(&current->fs->lock);

error = -ENOENT;
- spin_lock(&dcache_lock);
br_read_lock(vfsmount_lock);
spin_lock(&pwd.dentry->d_lock);
if (!d_unlinked(pwd.dentry)) {
@@ -2507,7 +2417,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
/* XXX: race here, have to close (eg. return unlinked from __d_path) */
cwd = __d_path(&pwd, &tmp, page, PAGE_SIZE);
br_read_unlock(vfsmount_lock);
- spin_unlock(&dcache_lock);

error = PTR_ERR(cwd);
if (IS_ERR(cwd))
@@ -2523,7 +2432,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
} else {
spin_unlock(&pwd.dentry->d_lock);
br_read_unlock(vfsmount_lock);
- spin_unlock(&dcache_lock);
}

out:
@@ -2609,7 +2517,6 @@ void d_genocide(struct dentry *root)
rename_retry:
this_parent = root;
seq = read_seqbegin(&rename_lock);
- spin_lock(&dcache_lock);
spin_lock(&this_parent->d_lock);
repeat:
next = this_parent->d_subdirs.next;
@@ -2657,7 +2564,6 @@ resume:
// d_unlinked(this_parent) || XXX
read_seqretry(&rename_lock, seq)) {
spin_unlock(&this_parent->d_lock);
- spin_unlock(&dcache_lock);
rcu_read_unlock();
goto rename_retry;
}
@@ -2666,7 +2572,6 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- spin_unlock(&dcache_lock);
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
}
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -618,8 +618,8 @@ int follow_up(struct path *path)
return 1;
}

-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * serialization is taken care of in namespace.c
*/
static int __follow_mount(struct path *path)
{
@@ -651,9 +651,6 @@ static void follow_mount(struct path *pa
}
}

-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
- */
int follow_down(struct path *path)
{
struct vfsmount *mounted;
@@ -2152,12 +2149,10 @@ void dentry_unhash(struct dentry *dentry
{
dget(dentry);
shrink_dcache_parent(dentry);
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (dentry->d_count == 2)
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
}

int vfs_rmdir(struct inode *dir, struct dentry *dentry)
Index: linux-2.6/fs/seq_file.c
===================================================================
--- linux-2.6.orig/fs/seq_file.c
+++ linux-2.6/fs/seq_file.c
@@ -465,11 +465,9 @@ int seq_path_root(struct seq_file *m, st
if (size) {
char *p;

- spin_lock(&dcache_lock);
br_read_lock(vfsmount_lock);
p = __d_path(path, root, buf, size);
br_read_unlock(vfsmount_lock);
- spin_unlock(&dcache_lock);

res = PTR_ERR(p);
if (!IS_ERR(p)) {
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -150,13 +150,13 @@ struct dentry_operations {

/*
locking rules:
- big lock dcache_lock d_lock may block
-d_revalidate: no no no yes
-d_hash no no no yes
-d_compare: no yes yes no
-d_delete: no yes no no
-d_release: no no no yes
-d_iput: no no no yes
+ big lock d_lock may block
+d_revalidate: no no yes
+d_hash no no yes
+d_compare: no yes no
+d_delete: no no no
+d_release: no no yes
+d_iput: no no yes
*/

/* d_flags entries */
@@ -191,7 +191,6 @@ d_iput: no no no yes

extern spinlock_t dcache_inode_lock;
extern spinlock_t dcache_hash_lock;
-extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;

/**
@@ -222,11 +221,9 @@ static inline void __d_drop(struct dentr

static inline void d_drop(struct dentry *dentry)
{
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
}

static inline int dname_external(struct dentry *dentry)
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -181,7 +181,6 @@ static void set_dentry_child_flags(struc
{
struct dentry *alias;

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
list_for_each_entry(alias, &inode->i_dentry, d_alias) {
struct dentry *child;
@@ -201,7 +200,6 @@ static void set_dentry_child_flags(struc
spin_unlock(&alias->d_lock);
}
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}

/*
@@ -269,6 +267,7 @@ void inotify_d_instantiate(struct dentry
if (!inode)
return;

+ /* XXX: need parent lock in place of dcache_lock? */
spin_lock(&entry->d_lock);
parent = entry->d_parent;
if (parent->d_inode && inotify_inode_watched(parent->d_inode))
@@ -283,6 +282,7 @@ void inotify_d_move(struct dentry *entry
{
struct dentry *parent;

+ /* XXX: need parent lock in place of dcache_lock? */
parent = entry->d_parent;
if (inotify_inode_watched(parent->d_inode))
entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -47,24 +47,20 @@ find_acceptable_alias(struct dentry *res
if (acceptable(context, result))
return result;

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
dget_locked(dentry);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
if (toput)
dput(toput);
if (dentry != result && acceptable(context, dentry)) {
dput(result);
return dentry;
}
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
toput = dentry;
}
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);

if (toput)
dput(toput);
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -17,7 +17,7 @@ prototypes:
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);

-locking rules:
+locking rules: XXX: update these!!
none have BKL
dcache_lock rename_lock ->d_lock may block
d_revalidate: no no no yes
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -158,21 +158,18 @@ static void spufs_prune_dir(struct dentr

mutex_lock(&dir->d_inode->i_mutex);
list_for_each_entry_safe(dentry, tmp, &dir->d_subdirs, d_u.d_child) {
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (!(d_unhashed(dentry)) && dentry->d_inode) {
dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
simple_unlink(dir->d_inode, dentry);
- /* XXX: what is dcache_lock protecting here? Other
+ /* XXX: what was dcache_lock protecting here? Other
* filesystems (IB, configfs) release dcache_lock
* before unlink */
- spin_unlock(&dcache_lock);
dput(dentry);
} else {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
}
}
shrink_dcache_parent(dir);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_fs.c
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -273,18 +273,14 @@ static int remove_file(struct dentry *pa
goto bail;
}

- spin_lock(&dcache_lock);
spin_lock(&tmp->d_lock);
if (!(d_unhashed(tmp) && tmp->d_inode)) {
dget_locked_dlock(tmp);
__d_drop(tmp);
spin_unlock(&tmp->d_lock);
- spin_unlock(&dcache_lock);
simple_unlink(parent->d_inode, tmp);
- } else {
+ } else
spin_unlock(&tmp->d_lock);
- spin_unlock(&dcache_lock);
- }

ret = 0;
bail:
Index: linux-2.6/drivers/usb/core/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -347,7 +347,6 @@ static int usbfs_empty (struct dentry *d
{
struct list_head *list;

- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
list_for_each(list, &dentry->d_subdirs) {
struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
@@ -356,13 +355,11 @@ static int usbfs_empty (struct dentry *d
if (usbfs_positive(de)) {
spin_unlock(&de->d_lock);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return 0;
}
spin_unlock(&de->d_lock);
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return 1;
}

Index: linux-2.6/fs/affs/amigaffs.c
===================================================================
--- linux-2.6.orig/fs/affs/amigaffs.c
+++ linux-2.6/fs/affs/amigaffs.c
@@ -128,7 +128,6 @@ affs_fix_dcache(struct dentry *dentry, u
void *data = dentry->d_fsdata;
struct list_head *head, *next;

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
head = &inode->i_dentry;
next = head->next;
@@ -141,7 +140,6 @@ affs_fix_dcache(struct dentry *dentry, u
next = next->next;
}
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}


Index: linux-2.6/fs/coda/cache.c
===================================================================
--- linux-2.6.orig/fs/coda/cache.c
+++ linux-2.6/fs/coda/cache.c
@@ -86,7 +86,6 @@ static void coda_flag_children(struct de
struct list_head *child;
struct dentry *de;

- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
list_for_each(child, &parent->d_subdirs)
{
@@ -97,7 +96,6 @@ static void coda_flag_children(struct de
coda_flag_inode(de->d_inode, flag);
}
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
return;
}

Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -120,7 +120,6 @@ static inline struct config_item *config
{
struct config_item * item = NULL;

- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (!d_unhashed(dentry)) {
struct configfs_dirent * sd = dentry->d_fsdata;
@@ -131,7 +130,6 @@ static inline struct config_item *config
item = config_item_get(sd->s_element);
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);

return item;
}
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -249,18 +249,14 @@ void configfs_drop_dentry(struct configf
struct dentry * dentry = sd->s_dentry;

if (dentry) {
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (!(d_unhashed(dentry) && dentry->d_inode)) {
dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
simple_unlink(parent->d_inode, dentry);
- } else {
+ } else
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
- }
}
}

Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c
+++ linux-2.6/fs/ncpfs/dir.c
@@ -364,7 +364,6 @@ ncp_dget_fpos(struct dentry *dentry, str
}

/* If a pointer is invalid, we search the dentry. */
- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
@@ -375,13 +374,11 @@ ncp_dget_fpos(struct dentry *dentry, str
else
dent = NULL;
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
goto out;
}
next = next->next;
}
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
return NULL;

out:
Index: linux-2.6/fs/ncpfs/ncplib_kernel.h
===================================================================
--- linux-2.6.orig/fs/ncpfs/ncplib_kernel.h
+++ linux-2.6/fs/ncpfs/ncplib_kernel.h
@@ -192,7 +192,6 @@ ncp_renew_dentries(struct dentry *parent
struct list_head *next;
struct dentry *dentry;

- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
@@ -206,7 +205,6 @@ ncp_renew_dentries(struct dentry *parent
next = next->next;
}
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
}

static inline void
@@ -216,7 +214,6 @@ ncp_invalidate_dircache_entries(struct d
struct list_head *next;
struct dentry *dentry;

- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
@@ -226,7 +223,6 @@ ncp_invalidate_dircache_entries(struct d
next = next->next;
}
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
}

struct ncp_cache_head {
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1464,11 +1464,9 @@ static int nfs_unlink(struct inode *dir,
dfprintk(VFS, "NFS: unlink(%s/%ld, %s)\n", dir->i_sb->s_id,
dir->i_ino, dentry->d_name.name);

- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
/* Start asynchronous writeout of the inode */
write_inode_now(dentry->d_inode, 0);
error = nfs_sillyrename(dir, dentry);
@@ -1479,7 +1477,6 @@ static int nfs_unlink(struct inode *dir,
need_rehash = 1;
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
error = nfs_safe_remove(dentry);
if (!error || error == -ENOENT) {
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -151,7 +151,6 @@ struct dentry *ocfs2_find_local_alias(st
struct list_head *p;
struct dentry *dentry = NULL;

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
list_for_each(p, &inode->i_dentry) {
dentry = list_entry(p, struct dentry, d_alias);
@@ -171,7 +170,6 @@ struct dentry *ocfs2_find_local_alias(st
}

spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);

return dentry;
}
Index: linux-2.6/fs/smbfs/cache.c
===================================================================
--- linux-2.6.orig/fs/smbfs/cache.c
+++ linux-2.6/fs/smbfs/cache.c
@@ -62,7 +62,6 @@ smb_invalidate_dircache_entries(struct d
struct list_head *next;
struct dentry *dentry;

- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
@@ -72,7 +71,6 @@ smb_invalidate_dircache_entries(struct d
next = next->next;
}
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
}

/*
@@ -98,7 +96,6 @@ smb_dget_fpos(struct dentry *dentry, str
}

/* If a pointer is invalid, we search the dentry. */
- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
@@ -115,7 +112,6 @@ smb_dget_fpos(struct dentry *dentry, str
dent = NULL;
out_unlock:
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
return dent;
}

Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -869,7 +869,6 @@ static void cgroup_clear_directory(struc
struct list_head *node;

BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
node = dentry->d_subdirs.next;
while (node != &dentry->d_subdirs) {
@@ -884,18 +883,15 @@ static void cgroup_clear_directory(struc
dget_locked_dlock(d);
spin_unlock(&d->d_lock);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
d_delete(d);
simple_unlink(dentry->d_inode, d);
dput(d);
- spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
} else
spin_unlock(&d->d_lock);
node = dentry->d_subdirs.next;
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
}

/*
@@ -907,14 +903,12 @@ static void cgroup_d_remove_dir(struct d

cgroup_clear_directory(dentry);

- spin_lock(&dcache_lock);
parent = dentry->d_parent;
spin_lock(&parent->d_lock);
spin_lock(&dentry->d_lock);
list_del_init(&dentry->d_u.d_child);
spin_unlock(&dentry->d_lock);
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
remove_dir(dentry);
}

Index: linux-2.6/security/selinux/selinuxfs.c
===================================================================
--- linux-2.6.orig/security/selinux/selinuxfs.c
+++ linux-2.6/security/selinux/selinuxfs.c
@@ -941,7 +941,6 @@ static void sel_remove_entries(struct de
{
struct list_head *node;

- spin_lock(&dcache_lock);
spin_lock(&de->d_lock);
node = de->d_subdirs.next;
while (node != &de->d_subdirs) {
@@ -954,11 +953,9 @@ static void sel_remove_entries(struct de
dget_locked_dlock(d);
spin_unlock(&de->d_lock);
spin_unlock(&d->d_lock);
- spin_unlock(&dcache_lock);
d_delete(d);
simple_unlink(de->d_inode, d);
dput(d);
- spin_lock(&dcache_lock);
spin_lock(&de->d_lock);
} else
spin_unlock(&d->d_lock);
@@ -966,7 +963,6 @@ static void sel_remove_entries(struct de
}

spin_unlock(&de->d_lock);
- spin_unlock(&dcache_lock);
}

#define BOOL_DIR_NAME "booleans"
Index: linux-2.6/security/tomoyo/realpath.c
===================================================================
--- linux-2.6.orig/security/tomoyo/realpath.c
+++ linux-2.6/security/tomoyo/realpath.c
@@ -92,12 +92,10 @@ int tomoyo_realpath_from_path2(struct pa
} else {
struct path ns_root = {.mnt = NULL, .dentry = NULL};

- spin_lock(&dcache_lock);
br_read_lock(vfsmount_lock);
/* go to whatever namespace root we are under */
sp = __d_path(path, &ns_root, newname, newname_len);
br_read_unlock(vfsmount_lock);
- spin_unlock(&dcache_lock);
/* Prepend "/proc" prefix if using internal proc vfs mount. */
if (!IS_ERR(sp) && (path->mnt->mnt_flags & MNT_INTERNAL) &&
(path->mnt->mnt_sb->s_magic == PROC_SUPER_MAGIC)) {
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -64,13 +64,11 @@ static int nfs_superblock_set_dummy_root
* This again causes shrink_dcache_for_umount_subtree() to
* Oops, since the test for IS_ROOT() will fail.
*/
- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
spin_lock(&sb->s_root->d_lock);
list_del_init(&sb->s_root->d_alias);
spin_unlock(&sb->s_root->d_lock);
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}
return 0;
}
Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c
+++ linux-2.6/drivers/staging/pohmelfs/path_entry.c
@@ -101,7 +101,6 @@ rename_retry:
d = first;
seq = read_seqbegin(&rename_lock);
rcu_read_lock();
- spin_lock(&dcache_lock);

if (!IS_ROOT(d) && d_unhashed(d))
len += UNHASHED_OBSCURE_STRING_SIZE; /* Obscure " (deleted)" string */
@@ -110,7 +109,6 @@ rename_retry:
len += d->d_name.len + 1; /* Plus slash */
d = d->d_parent;
}
- spin_unlock(&dcache_lock);
rcu_read_unlock();
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
Index: linux-2.6/fs/nfs/namespace.c
===================================================================
--- linux-2.6.orig/fs/nfs/namespace.c
+++ linux-2.6/fs/nfs/namespace.c
@@ -60,7 +60,6 @@ rename_retry:

seq = read_seqbegin(&rename_lock);
rcu_read_lock();
- spin_lock(&dcache_lock);
while (!IS_ROOT(dentry) && dentry != droot) {
namelen = dentry->d_name.len;
buflen -= namelen + 1;
@@ -71,7 +70,6 @@ rename_retry:
*--end = '/';
dentry = dentry->d_parent;
}
- spin_unlock(&dcache_lock);
rcu_read_unlock();
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
@@ -91,7 +89,6 @@ rename_retry:
memcpy(end, base, namelen);
return end;
Elong_unlock:
- spin_unlock(&dcache_lock);
rcu_read_unlock();
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
Index: linux-2.6/include/linux/fsnotify_backend.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify_backend.h
+++ linux-2.6/include/linux/fsnotify_backend.h
@@ -276,10 +276,10 @@ static inline void __fsnotify_update_dca
{
struct dentry *parent;

- assert_spin_locked(&dcache_lock);
assert_spin_locked(&dentry->d_lock);

parent = dentry->d_parent;
+ /* XXX: after dcache_lock removal, there is a race with parent->d_inode and fsnotify_inode_watches_children. must fix */
if (parent->d_inode && fsnotify_inode_watches_children(parent->d_inode))
dentry->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
else
@@ -288,15 +288,12 @@ static inline void __fsnotify_update_dca

/*
* fsnotify_d_instantiate - instantiate a dentry for inode
- * Called with dcache_lock held.
*/
static inline void __fsnotify_d_instantiate(struct dentry *dentry, struct inode *inode)
{
if (!inode)
return;

- assert_spin_locked(&dcache_lock);
-
spin_lock(&dentry->d_lock);
__fsnotify_update_dcache_flags(dentry);
spin_unlock(&dentry->d_lock);
Index: linux-2.6/fs/notify/fsnotify.c
===================================================================
--- linux-2.6.orig/fs/notify/fsnotify.c
+++ linux-2.6/fs/notify/fsnotify.c
@@ -53,7 +53,6 @@ void __fsnotify_update_child_dentry_flag
/* determine if the children should tell inode about their events */
watched = fsnotify_inode_watches_children(inode);

- spin_lock(&dcache_lock);
spin_lock(&dcache_inode_lock);
/* run all of the dentries associated with this inode. Since this is a
* directory, there damn well better only be one item on this list */
@@ -78,7 +77,6 @@ void __fsnotify_update_child_dentry_flag
spin_unlock(&alias->d_lock);
}
spin_unlock(&dcache_inode_lock);
- spin_unlock(&dcache_lock);
}

/* Notify this dentry's parent about a child's events. */
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -2425,6 +2425,10 @@ static inline ino_t parent_ino(struct de
{
ino_t res;

+ /*
+ * Don't strictly need d_lock here? If the parent ino could change
+ * then surely we'd have a deeper race in the caller?
+ */
spin_lock(&dentry->d_lock);
res = dentry->d_parent->d_inode->i_ino;
spin_unlock(&dentry->d_lock);
Index: linux-2.6/fs/autofs4/autofs_i.h
===================================================================
--- linux-2.6.orig/fs/autofs4/autofs_i.h
+++ linux-2.6/fs/autofs4/autofs_i.h
@@ -16,6 +16,7 @@
#include <linux/auto_fs4.h>
#include <linux/auto_dev-ioctl.h>
#include <linux/mutex.h>
+#include <linux/spinlock.h>
#include <linux/list.h>

/* This is the range of ioctl() numbers we claim as ours */
@@ -60,6 +61,8 @@ do { \
current->pid, __func__, ##args); \
} while (0)

+extern spinlock_t autofs4_lock;
+
/* Unified info structure. This is pointed to by both the dentry and
inode structures. Each file in the filesystem has an instance of this
structure. It holds a reference to the dentry, so dentries are never
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -94,9 +94,9 @@ done:
* Calculate next entry in top down tree traversal.
* From next_mnt in namespace.c - elegant.
*
- * How is this supposed to work if we drop dcache_lock between calls anyway?
+ * How is this supposed to work if we drop autofs4_lock between calls anyway?
* How does it cope with renames?
- * And also callers dput the returned dentry before taking dcache_lock again
+ * And also callers dput the returned dentry before taking autofs4_lock again
* so what prevents it from being freed??
*/
static struct dentry *get_next_positive_dentry(struct dentry *p,
@@ -105,7 +105,7 @@ static struct dentry *get_next_positive_
struct list_head *next;
struct dentry *ret;

- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
again:
spin_lock(&p->d_lock);
next = p->d_subdirs.next;
@@ -115,7 +115,7 @@ again:

if (p == root) {
spin_unlock(&p->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
return NULL;
}

@@ -143,7 +143,7 @@ again:
dget_dlock(ret);
spin_unlock(&ret->d_lock);
spin_unlock(&p->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);

return ret;
}
@@ -313,7 +313,7 @@ struct dentry *autofs4_expire_direct(str
* A tree is eligible if :-
* - it is unused by any user process
* - it has been unused for exp_timeout time
- * This seems to be racy dropping dcache_lock and asking for next->next after
+ * This seems to be racy dropping autofs4_lock and asking for next->next after
* the lock has been dropped.
*/
struct dentry *autofs4_expire_indirect(struct super_block *sb,
@@ -336,7 +336,7 @@ struct dentry *autofs4_expire_indirect(s
now = jiffies;
timeout = sbi->exp_timeout;

- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
spin_lock(&root->d_lock);
next = root->d_subdirs.next;

@@ -355,7 +355,7 @@ struct dentry *autofs4_expire_indirect(s
dentry = dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&root->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);

spin_lock(&sbi->fs_lock);
ino = autofs4_dentry_ino(dentry);
@@ -420,12 +420,12 @@ struct dentry *autofs4_expire_indirect(s
next:
spin_unlock(&sbi->fs_lock);
dput(dentry);
- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
spin_lock(&root->d_lock);
next = next->next;
}
spin_unlock(&root->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
return NULL;

found:
@@ -435,13 +435,13 @@ found:
ino->flags |= AUTOFS_INF_EXPIRING;
init_completion(&ino->expire_complete);
spin_unlock(&sbi->fs_lock);
- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
spin_lock(&expired->d_parent->d_lock);
spin_lock_nested(&expired->d_lock, DENTRY_D_LOCK_NESTED);
list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
spin_unlock(&expired->d_lock);
spin_unlock(&expired->d_parent->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
return expired;
}

Index: linux-2.6/fs/autofs4/root.c
===================================================================
--- linux-2.6.orig/fs/autofs4/root.c
+++ linux-2.6/fs/autofs4/root.c
@@ -134,15 +134,15 @@ static int autofs4_dir_open(struct inode
* autofs file system so just let the libfs routines handle
* it.
*/
- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
spin_lock(&dentry->d_lock);
if (!d_mountpoint(dentry) && list_empty(&dentry->d_subdirs)) {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
return -ENOENT;
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);

out:
return dcache_dir_open(inode, file);
@@ -248,11 +248,11 @@ static void *autofs4_follow_link(struct
/* We trigger a mount for almost all flags */
lookup_type = autofs4_need_mount(nd->flags);
spin_lock(&sbi->fs_lock);
- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
spin_lock(&dentry->d_lock);
if (!(lookup_type || ino->flags & AUTOFS_INF_PENDING)) {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
spin_unlock(&sbi->fs_lock);
goto follow;
}
@@ -274,7 +274,7 @@ static void *autofs4_follow_link(struct
goto follow;
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
spin_unlock(&sbi->fs_lock);
follow:
/*
@@ -476,7 +476,7 @@ static struct dentry *autofs4_lookup_exp
const unsigned char *str = name->name;
struct list_head *p, *head;

- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
spin_lock(&sbi->lookup_lock);
head = &sbi->expiring_list;
list_for_each(p, head) {
Index: linux-2.6/fs/autofs4/waitq.c
===================================================================
--- linux-2.6.orig/fs/autofs4/waitq.c
+++ linux-2.6/fs/autofs4/waitq.c
@@ -194,14 +194,15 @@ static int autofs4_getpath(struct autofs
rename_retry:
buf = *name;
len = 0;
+
seq = read_seqbegin(&rename_lock);
rcu_read_lock();
- spin_lock(&dcache_lock);
+ spin_lock(&autofs4_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
len += tmp->d_name.len + 1;

if (!len || --len > NAME_MAX) {
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
rcu_read_unlock();
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
@@ -217,7 +218,7 @@ rename_retry:
p -= tmp->d_name.len;
strncpy(p, tmp->d_name.name, tmp->d_name.len);
}
- spin_unlock(&dcache_lock);
+ spin_unlock(&autofs4_lock);
rcu_read_unlock();
if (read_seqretry(&rename_lock, seq))
goto rename_retry;
Index: linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/qib/qib_fs.c
+++ linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
@@ -451,17 +451,14 @@ static int remove_file(struct dentry *pa
goto bail;
}

- spin_lock(&dcache_lock);
spin_lock(&tmp->d_lock);
if (!(d_unhashed(tmp) && tmp->d_inode)) {
dget_locked_dlock(tmp);
__d_drop(tmp);
spin_unlock(&tmp->d_lock);
- spin_unlock(&dcache_lock);
simple_unlink(parent->d_inode, tmp);
} else {
spin_unlock(&tmp->d_lock);
- spin_unlock(&dcache_lock);
}

ret = 0;
Index: linux-2.6/fs/ceph/dir.c
===================================================================
--- linux-2.6.orig/fs/ceph/dir.c
+++ linux-2.6/fs/ceph/dir.c
@@ -111,7 +111,6 @@ static int __dcache_readdir(struct file
dout("__dcache_readdir %p at %llu (last %p)\n", dir, filp->f_pos,
last);

- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);

/* start at beginning? */
@@ -155,7 +154,6 @@ more:
dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
spin_unlock(&inode->i_lock);

dout(" %llu (%llu) dentry %p %.*s %p\n", di->offset, filp->f_pos,
@@ -178,7 +176,6 @@ more:
}

spin_lock(&inode->i_lock);
- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);

last = dentry;
@@ -189,7 +186,7 @@ more:
p = p->prev;
filp->f_pos++;

- /* make sure a dentry wasn't dropped while we didn't have dcache_lock */
+ /* make sure a dentry wasn't dropped while we didn't have parent->d_lock */
if ((ceph_inode(dir)->i_ceph_flags & CEPH_I_COMPLETE))
goto more;
dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
@@ -197,7 +194,6 @@ more:

out_unlock:
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);

if (last) {
spin_unlock(&inode->i_lock);
Index: linux-2.6/fs/ceph/inode.c
===================================================================
--- linux-2.6.orig/fs/ceph/inode.c
+++ linux-2.6/fs/ceph/inode.c
@@ -825,7 +825,6 @@ static void ceph_set_dentry_offset(struc
di->offset = ceph_inode(inode)->i_max_offset++;
spin_unlock(&inode->i_lock);

- spin_lock(&dcache_lock);
spin_lock(&dir->d_lock);
spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
list_move(&dn->d_u.d_child, &dir->d_subdirs);
@@ -833,7 +832,6 @@ static void ceph_set_dentry_offset(struc
dn->d_u.d_child.prev, dn->d_u.d_child.next);
spin_unlock(&dn->d_lock);
spin_unlock(&dir->d_lock);
- spin_unlock(&dcache_lock);
}

/*
@@ -1213,13 +1211,11 @@ retry_lookup:
goto retry_lookup;
} else {
/* reorder parent's d_subdirs */
- spin_lock(&dcache_lock);
spin_lock(&parent->d_lock);
spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
list_move(&dn->d_u.d_child, &parent->d_subdirs);
spin_unlock(&dn->d_lock);
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_lock);
}

di = dn->d_fsdata;
n***@suse.de
2010-06-24 03:02:53 UTC
Permalink
Put inode counters under locks to reduce atomics.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/fs-writeback.c | 6 ++----
fs/inode.c | 30 ++++++++++++------------------
include/linux/fs.h | 4 ++--
3 files changed, 16 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -141,8 +141,8 @@ static DECLARE_RWSEM(iprune_sem);
* Statistics gathering..
*/
struct inodes_stat_t inodes_stat = {
- .nr_inodes = ATOMIC_INIT(0),
- .nr_unused = ATOMIC_INIT(0),
+ .nr_inodes = 0,
+ .nr_unused = 0,
};

static struct kmem_cache *inode_cachep __read_mostly;
@@ -390,7 +390,6 @@ static void dispose_list(struct list_hea
destroy_inode(inode);
nr_disposed++;
}
- atomic_sub(nr_disposed, &inodes_stat.nr_inodes);
}

/*
@@ -399,7 +398,7 @@ static void dispose_list(struct list_hea
static int invalidate_list(struct list_head *head, struct list_head *dispose)
{
struct list_head *next;
- int busy = 0, count = 0;
+ int busy = 0;

next = head->next;
for (;;) {
@@ -419,19 +418,17 @@ static int invalidate_list(struct list_h
if (!inode->i_count) {
spin_lock(&wb_inode_list_lock);
list_del(&inode->i_list);
+ inodes_stat.nr_unused--;
spin_unlock(&wb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
list_add(&inode->i_list, dispose);
- count++;
continue;
}
spin_unlock(&inode->i_lock);
busy = 1;
}
- /* only unused inodes may be cached with i_count zero */
- atomic_sub(count, &inodes_stat.nr_unused);
return busy;
}

@@ -483,7 +480,6 @@ EXPORT_SYMBOL(invalidate_inodes);
static void prune_icache(int nr_to_scan)
{
LIST_HEAD(freeable);
- int nr_pruned = 0;
unsigned long reap = 0;

down_read(&iprune_sem);
@@ -504,7 +500,7 @@ again:
if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_list);
spin_unlock(&inode->i_lock);
- atomic_dec(&inodes_stat.nr_unused);
+ inodes_stat.nr_unused--;
continue;
}
if (inode->i_state) {
@@ -530,9 +526,8 @@ again:
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- nr_pruned++;
+ inodes_stat.nr_unused--;
}
- atomic_sub(nr_pruned, &inodes_stat.nr_unused);
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
else
@@ -564,8 +559,7 @@ static int shrink_icache_memory(int nr,
return -1;
prune_icache(nr);
}
- return (atomic_read(&inodes_stat.nr_unused) / 100) *
- sysctl_vfs_cache_pressure;
+ return inodes_stat.nr_unused / 100 * sysctl_vfs_cache_pressure;
}

static struct shrinker icache_shrinker = {
@@ -661,9 +655,9 @@ static inline void
__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
struct inode *inode)
{
- atomic_inc(&inodes_stat.nr_inodes);
spin_lock(&sb_inode_list_lock);
list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
+ inodes_stat.nr_inodes++;
spin_unlock(&sb_inode_list_lock);
if (b) {
spin_lock_bucket(b);
@@ -1309,17 +1303,17 @@ void generic_delete_inode(struct inode *
if (!list_empty(&inode->i_list)) {
spin_lock(&wb_inode_list_lock);
list_del_init(&inode->i_list);
- spin_unlock(&wb_inode_list_lock);
if (!inode->i_state)
- atomic_dec(&inodes_stat.nr_unused);
+ inodes_stat.nr_unused--;
+ spin_unlock(&wb_inode_list_lock);
}
spin_lock(&sb_inode_list_lock);
list_del_rcu(&inode->i_sb_list);
+ inodes_stat.nr_inodes--;
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- atomic_dec(&inodes_stat.nr_inodes);

if (op->delete_inode) {
void (*delete)(struct inode *) = op->delete_inode;
@@ -1367,8 +1361,8 @@ int generic_detach_inode(struct inode *i
list_empty(&inode->i_list)) {
spin_lock(&wb_inode_list_lock);
list_add(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
spin_unlock(&wb_inode_list_lock);
- atomic_inc(&inodes_stat.nr_unused);
}
spin_unlock(&inode->i_lock);
return 0;
@@ -1385,17 +1379,17 @@ int generic_detach_inode(struct inode *i
if (!list_empty(&inode->i_list)) {
spin_lock(&wb_inode_list_lock);
list_del_init(&inode->i_list);
- spin_unlock(&wb_inode_list_lock);
if (!inode->i_state)
- atomic_dec(&inodes_stat.nr_unused);
+ inodes_stat.nr_unused--;
+ spin_unlock(&wb_inode_list_lock);
}
spin_lock(&sb_inode_list_lock);
list_del_rcu(&inode->i_sb_list);
+ inodes_stat.nr_inodes--;
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- atomic_dec(&inodes_stat.nr_inodes);
return 1;
}
EXPORT_SYMBOL_GPL(generic_detach_inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -31,6 +31,12 @@
#define SEEK_END 2 /* seek relative to end of file */
#define SEEK_MAX SEEK_END

+struct inodes_stat_t {
+ int nr_inodes;
+ int nr_unused;
+ int dummy[5]; /* padding for sysctl ABI compatibility */
+};
+
/* And dynamically-tunable limits and defaults: */
struct files_stat_struct {
int nr_files; /* read only */
@@ -411,12 +417,6 @@ typedef int (get_block_t)(struct inode *
typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
ssize_t bytes, void *private);

-struct inodes_stat_t {
- atomic_t nr_inodes;
- atomic_t nr_unused;
- int dummy[5]; /* padding for sysctl ABI compatibility */
-};
-
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -904,8 +904,7 @@ static long wb_check_old_data_flush(stru
wb->last_old_flush = jiffies;
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- (atomic_read(&inodes_stat.nr_inodes) -
- atomic_read(&inodes_stat.nr_unused));
+ inodes_stat.nr_inodes - inodes_stat.nr_unused;

if (nr_pages) {
struct wb_writeback_args args = {
@@ -1258,8 +1257,7 @@ void writeback_inodes_sb(struct super_bl
long nr_to_write;

nr_to_write = nr_dirty + nr_unstable +
- (atomic_read(&inodes_stat.nr_inodes) -
- atomic_read(&inodes_stat.nr_unused));
+ inodes_stat.nr_inodes - inodes_stat.nr_unused;

bdi_start_writeback(sb->s_bdi, sb, nr_to_write);
}
n***@suse.de
2010-06-24 03:02:58 UTC
Permalink
Signed-off-by: Nick Piggin <***@suse.de>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -484,15 +484,17 @@ writeback_single_inode(struct inode *ino
spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY;
inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
- spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
- int err = write_inode(inode, wbc);
+ int err;
+
+ spin_unlock(&inode->i_lock);
+ err = write_inode(inode, wbc);
if (ret == 0)
ret = err;
+ spin_lock(&inode->i_lock);
}

- spin_lock(&inode->i_lock);
spin_lock(&wb_inode_list_lock);
inode->i_state &= ~I_SYNC;
if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -810,12 +810,9 @@ struct inode *new_inode(struct super_blo

inode = alloc_inode(sb);
if (inode) {
- /* XXX: init as locked for speedup */
- spin_lock(&inode->i_lock);
inode->i_ino = last_ino_get();
inode->i_state = 0;
__inode_add_to_lists(sb, NULL, inode);
- spin_unlock(&inode->i_lock);
}
return inode;
}
n***@suse.de
2010-06-24 03:02:32 UTC
Permalink
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1789,10 +1789,15 @@ void d_delete(struct dentry * dentry)
/*
* Are we the only user?
*/
- spin_lock(&dcache_inode_lock);
+again:
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (dentry->d_count == 1) {
+ if (!spin_trylock(&dcache_inode_lock)) {
+ spin_unlock(&dentry->d_lock);
+ cpu_relax();
+ goto again;
+ }
dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_iput(dentry);
fsnotify_nameremove(dentry, isdir);
@@ -1803,7 +1808,6 @@ void d_delete(struct dentry * dentry)
__d_drop(dentry);

spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_inode_lock);

fsnotify_nameremove(dentry, isdir);
}
@@ -2116,14 +2120,15 @@ struct dentry *d_materialise_unique(stru

BUG_ON(!d_unhashed(dentry));

- spin_lock(&dcache_inode_lock);
-
if (!inode) {
actual = dentry;
__d_instantiate(dentry, NULL);
- goto found_lock;
+ d_rehash(actual);
+ goto out_nolock;
}

+ spin_lock(&dcache_inode_lock);
+
if (S_ISDIR(inode->i_mode)) {
struct dentry *alias;

@@ -2150,10 +2155,9 @@ struct dentry *d_materialise_unique(stru
actual = __d_instantiate_unique(dentry, inode);
if (!actual)
actual = dentry;
- else if (unlikely(!d_unhashed(actual)))
- goto shouldnt_be_hashed;
+ else
+ BUG_ON(!d_unhashed(actual));

-found_lock:
spin_lock(&actual->d_lock);
found:
_d_rehash(actual);
@@ -2167,10 +2171,6 @@ out_nolock:

iput(inode);
return actual;
-
-shouldnt_be_hashed:
- spin_unlock(&dcache_inode_lock);
- BUG();
}
EXPORT_SYMBOL_GPL(d_materialise_unique);
n***@suse.de
2010-06-24 03:02:38 UTC
Permalink
Protect sb->s_inodes with a new lock, sb_inode_list_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/drop_caches.c | 4 ++++
fs/fs-writeback.c | 4 ++++
fs/inode.c | 12 ++++++++++++
fs/notify/inode_mark.c | 2 ++
fs/notify/inotify/inotify.c | 2 ++
fs/quota/dquot.c | 6 ++++++
include/linux/writeback.h | 1 +
7 files changed, 31 insertions(+)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -17,18 +17,22 @@ static void drop_pagecache_sb(struct sup
struct inode *inode, *toput_inode = NULL;

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
continue;
if (inode->i_mapping->nrpages == 0)
continue;
__iget(inode);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
invalidate_mapping_pages(inode->i_mapping, 0, -1);
iput(toput_inode);
toput_inode = inode;
spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
}
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
iput(toput_inode);
}
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -1166,6 +1166,7 @@ static void wait_sb_inodes(struct super_
WARN_ON(!rwsem_is_locked(&sb->s_umount));

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);

/*
* Data integrity sync. Must wait for all pages under writeback,
@@ -1183,6 +1184,7 @@ static void wait_sb_inodes(struct super_
if (mapping->nrpages == 0)
continue;
__iget(inode);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
/*
* We hold a reference to 'inode' so it couldn't have
@@ -1200,7 +1202,9 @@ static void wait_sb_inodes(struct super_
cond_resched();

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
}
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
iput(old_inode);
}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -27,6 +27,15 @@
#include <linux/posix_acl.h>

/*
+ * Usage:
+ * sb_inode_list_lock protects:
+ * s_inodes, i_sb_list
+ *
+ * Ordering:
+ * inode_lock
+ * sb_inode_list_lock
+ */
+/*
* This is needed for the following functions:
* - inode_has_buffers
* - invalidate_inode_buffers
@@ -84,6 +93,7 @@ static struct hlist_head *inode_hashtabl
* the i_state of an inode while it is in use..
*/
DEFINE_SPINLOCK(inode_lock);
+DEFINE_SPINLOCK(sb_inode_list_lock);

/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -344,7 +354,9 @@ static void dispose_list(struct list_hea

spin_lock(&inode_lock);
hlist_del_init(&inode->i_hash);
+ spin_lock(&sb_inode_list_lock);
list_del_init(&inode->i_sb_list);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);

wake_up_inode(inode);
@@ -376,6 +388,7 @@ static int invalidate_list(struct list_h
* shrink_icache_memory() away.
*/
cond_resched_lock(&inode_lock);
+ cond_resched_lock(&sb_inode_list_lock);

next = next->next;
if (tmp == head)
@@ -413,9 +426,11 @@ int invalidate_inodes(struct super_block

down_write(&iprune_sem);
spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
inotify_unmount_inodes(&sb->s_inodes);
fsnotify_unmount_inodes(&sb->s_inodes);
busy = invalidate_list(&sb->s_inodes, &throw_away);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);

dispose_list(&throw_away);
@@ -603,7 +618,9 @@ __inode_add_to_lists(struct super_block
{
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
+ spin_lock(&sb_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
+ spin_unlock(&sb_inode_list_lock);
if (head)
hlist_add_head(&inode->i_hash, head);
}
@@ -1197,7 +1214,9 @@ void generic_delete_inode(struct inode *
const struct super_operations *op = inode->i_sb->s_op;

list_del_init(&inode->i_list);
+ spin_lock(&sb_inode_list_lock);
list_del_init(&inode->i_sb_list);
+ spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
@@ -1255,7 +1274,9 @@ int generic_detach_inode(struct inode *i
hlist_del_init(&inode->i_hash);
}
list_del_init(&inode->i_list);
+ spin_lock(&sb_inode_list_lock);
list_del_init(&inode->i_sb_list);
+ spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -429,6 +429,7 @@ void inotify_unmount_inodes(struct list_
* will be added since the umount has begun. Finally,
* iprune_mutex keeps shrink_icache_memory() away.
*/
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);

if (need_iput_tmp)
@@ -451,6 +452,7 @@ void inotify_unmount_inodes(struct list_
iput(inode);

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
}
}
EXPORT_SYMBOL_GPL(inotify_unmount_inodes);
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -884,6 +884,7 @@ static void add_dquot_ref(struct super_b
#endif

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
continue;
@@ -897,6 +898,7 @@ static void add_dquot_ref(struct super_b
continue;

__iget(inode);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);

iput(old_inode);
@@ -908,7 +910,9 @@ static void add_dquot_ref(struct super_b
* keep the reference and iput it later. */
old_inode = inode;
spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
}
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
iput(old_inode);

@@ -988,6 +992,7 @@ static void remove_dquot_ref(struct supe
struct inode *inode;

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
/*
* We have to scan also I_NEW inodes because they can already
@@ -998,6 +1003,7 @@ static void remove_dquot_ref(struct supe
if (!IS_NOQUOTA(inode))
remove_inode_dquot_ref(inode, type, tofree_head);
}
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
}

Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -10,6 +10,7 @@
struct backing_dev_info;

extern spinlock_t inode_lock;
+extern spinlock_t sb_inode_list_lock;
extern struct list_head inode_in_use;
extern struct list_head inode_unused;

Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c
+++ linux-2.6/fs/notify/inode_mark.c
@@ -408,6 +408,7 @@ void fsnotify_unmount_inodes(struct list
* will be added since the umount has begun. Finally,
* iprune_mutex keeps shrink_icache_memory() away.
*/
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);

if (need_iput_tmp)
@@ -421,5 +422,6 @@ void fsnotify_unmount_inodes(struct list
iput(inode);

spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
}
}
n***@suse.de
2010-06-24 03:02:57 UTC
Permalink
Signed-off-by: Nick Piggin <***@suse.de>
---
fs/inode.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -603,27 +603,27 @@ static struct inode *find_inode(struct s
struct inode *inode = NULL;

repeat:
- spin_lock_bucket(b);
- hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
+ rcu_read_lock();
+ hlist_bl_for_each_entry_rcu(inode, node, &b->head, i_hash) {
if (inode->i_sb != sb)
continue;
- if (!spin_trylock(&inode->i_lock)) {
- spin_unlock_bucket(b);
- cpu_relax();
- goto repeat;
+ spin_lock(&inode->i_lock);
+ if (hlist_bl_unhashed(&inode->i_hash)) {
+ spin_unlock(&inode->i_lock);
+ continue;
}
if (!test(inode, data)) {
spin_unlock(&inode->i_lock);
continue;
}
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
- spin_unlock_bucket(b);
+ rcu_read_unlock();
__wait_on_freeing_inode(inode);
goto repeat;
}
break;
}
- spin_unlock_bucket(b);
+ rcu_read_unlock();
return node ? inode : NULL;
}

@@ -639,25 +639,25 @@ static struct inode *find_inode_fast(str
struct inode *inode = NULL;

repeat:
- spin_lock_bucket(b);
- hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
+ rcu_read_lock();
+ hlist_bl_for_each_entry_rcu(inode, node, &b->head, i_hash) {
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
- if (!spin_trylock(&inode->i_lock)) {
- spin_unlock_bucket(b);
- cpu_relax();
- goto repeat;
+ spin_lock(&inode->i_lock);
+ if (hlist_bl_unhashed(&inode->i_hash)) {
+ spin_unlock(&inode->i_lock);
+ continue;
}
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
- spin_unlock_bucket(b);
+ rcu_read_unlock();
__wait_on_freeing_inode(inode);
goto repeat;
}
break;
}
- spin_unlock_bucket(b);
+ rcu_read_unlock();
return node ? inode : NULL;
}
n***@suse.de
2010-06-24 03:02:33 UTC
Permalink
dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/affs/amigaffs.c | 4 -
fs/dcache.c | 117 +++++++++++++++++++++++++-------------------
fs/exportfs/expfs.c | 12 ++--
fs/nfs/getroot.c | 4 -
fs/notify/fsnotify.c | 4 -
fs/notify/inotify/inotify.c | 4 -
fs/ocfs2/dcache.c | 4 -
fs/sysfs/dir.c | 6 +-
include/linux/dcache.h | 1
9 files changed, 89 insertions(+), 67 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -39,8 +39,8 @@

/*
* Usage:
- * dcache_inode_lock protects:
- * - i_dentry, d_alias, d_inode
+ * dcache->d_inode->i_lock protects:
+ * - i_dentry, d_alias, d_inode of aliases
* dcache_hash_bucket lock protects:
* - the dcache hash table
* dcache_lru_lock protects:
@@ -56,7 +56,7 @@
* - d_alias, d_inode
*
* Ordering:
- * dcache_inode_lock
+ * dentry->d_inode->i_lock
* dentry->d_lock
* dcache_lru_lock
* dcache_hash_bucket lock
@@ -75,12 +75,10 @@
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);
-EXPORT_SYMBOL(dcache_inode_lock);

static struct kmem_cache *dentry_cache __read_mostly;

@@ -165,14 +163,13 @@ static void d_free(struct dentry *dentry
*/
static void dentry_iput(struct dentry * dentry)
__releases(dentry->d_lock)
- __releases(dcache_inode_lock)
{
struct inode *inode = dentry->d_inode;
if (inode) {
dentry->d_inode = NULL;
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
if (!inode->i_nlink)
fsnotify_inoderemove(inode);
if (dentry->d_op && dentry->d_op->d_iput)
@@ -181,7 +178,6 @@ static void dentry_iput(struct dentry *
iput(inode);
} else {
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_inode_lock);
}
}

@@ -252,7 +248,6 @@ static void dentry_lru_del_init(struct d
*/
static struct dentry *d_kill(struct dentry *dentry)
__releases(dentry->d_lock)
- __releases(dcache_inode_lock)
{
struct dentry *parent;

@@ -376,6 +371,7 @@ EXPORT_SYMBOL(dget_parent);
void dput(struct dentry *dentry)
{
struct dentry *parent;
+ struct inode *inode;

if (!dentry)
return;
@@ -412,14 +408,21 @@ repeat:
return;

kill_it:
- spin_unlock(&dentry->d_lock);
- spin_lock(&dcache_inode_lock);
+ inode = dentry->d_inode;
+ if (inode) {
+ if (!spin_trylock(&inode->i_lock)) {
relock:
- spin_lock(&dentry->d_lock);
+ spin_unlock(&dentry->d_lock);
+ cpu_relax();
+ spin_lock(&dentry->d_lock);
+ goto kill_it;
+ }
+ }
parent = dentry->d_parent;
if (parent && parent != dentry) {
if (!spin_trylock(&parent->d_lock)) {
- spin_unlock(&dentry->d_lock);
+ if (inode)
+ spin_unlock(&inode->i_lock);
goto relock;
}
}
@@ -429,7 +432,8 @@ relock:
spin_unlock(&dentry->d_lock);
if (parent && parent != dentry)
spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_inode_lock);
+ if (inode)
+ spin_unlock(&inode->i_lock);
return;
}
/* if dentry was on the d_lru list delete it from there */
@@ -547,9 +551,9 @@ struct dentry * d_find_alias(struct inod
struct dentry *de = NULL;

if (!list_empty(&inode->i_dentry)) {
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
de = __d_find_alias(inode, 0);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
}
return de;
}
@@ -563,20 +567,20 @@ void d_prune_aliases(struct inode *inode
{
struct dentry *dentry;
restart:
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
if (!dentry->d_count) {
__dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
dput(dentry);
goto restart;
}
spin_unlock(&dentry->d_lock);
}
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_prune_aliases);

@@ -599,8 +603,10 @@ static void prune_one_dentry(struct dent
*/
while (dentry) {
struct dentry *parent = NULL;
+ struct inode *inode = dentry->d_inode;

- spin_lock(&dcache_inode_lock);
+ if (inode)
+ spin_lock(&inode->i_lock);
again:
spin_lock(&dentry->d_lock);
if (dentry->d_parent && dentry != dentry->d_parent) {
@@ -615,7 +621,8 @@ again:
if (parent)
spin_unlock(&parent->d_lock);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_inode_lock);
+ if (inode)
+ spin_unlock(&inode->i_lock);
return;
}

@@ -684,10 +691,11 @@ restart:
}
spin_unlock(&dcache_lru_lock);

- spin_lock(&dcache_inode_lock);
again:
spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
while (!list_empty(&tmp)) {
+ struct inode *inode;
+
dentry = list_entry(tmp.prev, struct dentry, d_lru);

if (!spin_trylock(&dentry->d_lock)) {
@@ -705,10 +713,17 @@ again1:
spin_unlock(&dentry->d_lock);
continue;
}
+ inode = dentry->d_inode;
+ if (inode && !spin_trylock(&inode->i_lock)) {
+again2:
+ spin_unlock(&dentry->d_lock);
+ goto again1;
+ }
if (dentry->d_parent && dentry->d_parent != dentry) {
if (!spin_trylock(&dentry->d_parent->d_lock)) {
- spin_unlock(&dentry->d_lock);
- goto again1;
+ if (inode)
+ spin_unlock(&inode->i_lock);
+ goto again2;
}
}
__dentry_lru_del_init(dentry);
@@ -716,10 +731,8 @@ again1:

prune_one_dentry(dentry);
/* dentry->d_lock dropped */
- spin_lock(&dcache_inode_lock);
spin_lock(&dcache_lru_lock);
}
- spin_unlock(&dcache_inode_lock);

if (count == NULL && !list_empty(&sb->s_dentry_lru))
goto restart;
@@ -1287,9 +1300,11 @@ static void __d_instantiate(struct dentr
void d_instantiate(struct dentry *entry, struct inode * inode)
{
BUG_ON(!list_empty(&entry->d_alias));
- spin_lock(&dcache_inode_lock);
+ if (inode)
+ spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
- spin_unlock(&dcache_inode_lock);
+ if (inode)
+ spin_unlock(&inode->i_lock);
security_d_instantiate(entry, inode);
}
EXPORT_SYMBOL(d_instantiate);
@@ -1348,9 +1363,11 @@ struct dentry *d_instantiate_unique(stru

BUG_ON(!list_empty(&entry->d_alias));

- spin_lock(&dcache_inode_lock);
+ if (inode)
+ spin_lock(&inode->i_lock);
result = __d_instantiate_unique(entry, inode);
- spin_unlock(&dcache_inode_lock);
+ if (inode)
+ spin_unlock(&inode->i_lock);

if (!result) {
security_d_instantiate(entry, inode);
@@ -1431,10 +1448,10 @@ struct dentry *d_obtain_alias(struct ino
}
tmp->d_parent = tmp; /* make sure dput doesn't croak */

- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
res = __d_find_alias(inode, 0);
if (res) {
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
dput(tmp);
goto out_iput;
}
@@ -1448,7 +1465,7 @@ struct dentry *d_obtain_alias(struct ino
list_add(&tmp->d_alias, &inode->i_dentry);
hlist_bl_add_head(&tmp->d_hash, &inode->i_sb->s_anon); /* XXX: make s_anon a bl list */
spin_unlock(&tmp->d_lock);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);

return tmp;

@@ -1479,18 +1496,18 @@ struct dentry *d_splice_alias(struct ino
struct dentry *new = NULL;

if (inode && S_ISDIR(inode->i_mode)) {
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
new = __d_find_alias(inode, 1);
if (new) {
BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
security_d_instantiate(new, inode);
d_move(new, dentry);
iput(inode);
} else {
- /* already got dcache_inode_lock, so d_add() by hand */
+ /* already got inode->i_lock, so d_add() by hand */
__d_instantiate(dentry, inode);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
security_d_instantiate(dentry, inode);
d_rehash(dentry);
}
@@ -1563,10 +1580,10 @@ struct dentry *d_add_ci(struct dentry *d
* Negative dentry: instantiate it unless the inode is a directory and
* already has a dentry.
*/
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
__d_instantiate(found, inode);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
security_d_instantiate(found, inode);
return found;
}
@@ -1577,7 +1594,7 @@ struct dentry *d_add_ci(struct dentry *d
*/
new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
dget_locked(new);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
security_d_instantiate(found, inode);
d_move(new, found);
iput(inode);
@@ -1785,15 +1802,17 @@ EXPORT_SYMBOL(d_validate);

void d_delete(struct dentry * dentry)
{
+ struct inode *inode;
int isdir = 0;
/*
* Are we the only user?
*/
again:
spin_lock(&dentry->d_lock);
- isdir = S_ISDIR(dentry->d_inode->i_mode);
+ inode = dentry->d_inode;
+ isdir = S_ISDIR(inode->i_mode);
if (dentry->d_count == 1) {
- if (!spin_trylock(&dcache_inode_lock)) {
+ if (inode && !spin_trylock(&inode->i_lock)) {
spin_unlock(&dentry->d_lock);
cpu_relax();
goto again;
@@ -2034,6 +2053,7 @@ static struct dentry *__d_unalias(struct
{
struct mutex *m1 = NULL, *m2 = NULL;
struct dentry *ret;
+ struct inode *inode;

/* If alias and dentry share a parent, then no extra locks required */
if (alias->d_parent == dentry->d_parent)
@@ -2049,14 +2069,15 @@ static struct dentry *__d_unalias(struct
if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
goto out_err;
m1 = &dentry->d_sb->s_vfs_rename_mutex;
- if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
+ inode = alias->d_parent->d_inode;
+ if (!mutex_trylock(&inode->i_mutex))
goto out_err;
- m2 = &alias->d_parent->d_inode->i_mutex;
+ m2 = &inode->i_mutex;
out_unalias:
d_move_locked(alias, dentry);
ret = alias;
out_err:
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
if (m2)
mutex_unlock(m2);
if (m1)
@@ -2127,7 +2148,7 @@ struct dentry *d_materialise_unique(stru
goto out_nolock;
}

- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);

if (S_ISDIR(inode->i_mode)) {
struct dentry *alias;
@@ -2162,7 +2183,7 @@ struct dentry *d_materialise_unique(stru
found:
_d_rehash(actual);
spin_unlock(&actual->d_lock);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
out_nolock:
if (actual == dentry) {
security_d_instantiate(dentry, inode);
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -190,7 +190,6 @@ d_iput: no no yes
#define DCACHE_CANT_MOUNT 0x0100
#define DCACHE_GENOCIDE 0x0200

-extern spinlock_t dcache_inode_lock;
extern seqlock_t rename_lock;

/**
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -181,7 +181,7 @@ static void set_dentry_child_flags(struc
{
struct dentry *alias;

- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
list_for_each_entry(alias, &inode->i_dentry, d_alias) {
struct dentry *child;

@@ -199,7 +199,7 @@ static void set_dentry_child_flags(struc
}
spin_unlock(&alias->d_lock);
}
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
}

/*
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -43,24 +43,26 @@ find_acceptable_alias(struct dentry *res
void *context)
{
struct dentry *dentry, *toput = NULL;
+ struct inode *inode;

if (acceptable(context, result))
return result;

- spin_lock(&dcache_inode_lock);
- list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
+ inode = result->d_inode;
+ spin_lock(&inode->i_lock);
+ list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
dget_locked(dentry);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
if (toput)
dput(toput);
if (dentry != result && acceptable(context, dentry)) {
dput(result);
return dentry;
}
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
toput = dentry;
}
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);

if (toput)
dput(toput);
Index: linux-2.6/fs/affs/amigaffs.c
===================================================================
--- linux-2.6.orig/fs/affs/amigaffs.c
+++ linux-2.6/fs/affs/amigaffs.c
@@ -128,7 +128,7 @@ affs_fix_dcache(struct dentry *dentry, u
void *data = dentry->d_fsdata;
struct list_head *head, *next;

- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
head = &inode->i_dentry;
next = head->next;
while (next != head) {
@@ -139,7 +139,7 @@ affs_fix_dcache(struct dentry *dentry, u
}
next = next->next;
}
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
}


Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -64,11 +64,11 @@ static int nfs_superblock_set_dummy_root
* This again causes shrink_dcache_for_umount_subtree() to
* Oops, since the test for IS_ROOT() will fail.
*/
- spin_lock(&dcache_inode_lock);
+ spin_lock(&sb->s_root->d_inode->i_lock);
spin_lock(&sb->s_root->d_lock);
list_del_init(&sb->s_root->d_alias);
spin_unlock(&sb->s_root->d_lock);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&sb->s_root->d_inode->i_lock);
}
return 0;
}
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -151,7 +151,7 @@ struct dentry *ocfs2_find_local_alias(st
struct list_head *p;
struct dentry *dentry = NULL;

- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
list_for_each(p, &inode->i_dentry) {
dentry = list_entry(p, struct dentry, d_alias);

@@ -169,7 +169,7 @@ struct dentry *ocfs2_find_local_alias(st
dentry = NULL;
}

- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);

return dentry;
}
Index: linux-2.6/fs/notify/fsnotify.c
===================================================================
--- linux-2.6.orig/fs/notify/fsnotify.c
+++ linux-2.6/fs/notify/fsnotify.c
@@ -53,7 +53,7 @@ void __fsnotify_update_child_dentry_flag
/* determine if the children should tell inode about their events */
watched = fsnotify_inode_watches_children(inode);

- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
/* run all of the dentries associated with this inode. Since this is a
* directory, there damn well better only be one item on this list */
list_for_each_entry(alias, &inode->i_dentry, d_alias) {
@@ -76,7 +76,7 @@ void __fsnotify_update_child_dentry_flag
}
spin_unlock(&alias->d_lock);
}
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
}

/* Notify this dentry's parent about a child's events. */
n***@suse.de
2010-06-24 03:02:41 UTC
Permalink
Protect inode->i_count with i_lock, rather than having it atomic.
Next step should also be to move things together (eg. the refcount increment
into d_instantiate, which will remove a lock/unlock cycle on i_lock).

Signed-off-by: Nick Piggin <***@suse.de>
---
arch/powerpc/platforms/cell/spufs/file.c | 2 -
fs/affs/inode.c | 4 ++-
fs/afs/dir.c | 4 ++-
fs/anon_inodes.c | 4 ++-
fs/bfs/dir.c | 4 ++-
fs/block_dev.c | 15 ++++++++++--
fs/btrfs/inode.c | 4 ++-
fs/cifs/inode.c | 2 -
fs/coda/dir.c | 4 ++-
fs/exofs/inode.c | 12 +++++++---
fs/exofs/namei.c | 4 ++-
fs/ext2/namei.c | 4 ++-
fs/ext3/ialloc.c | 4 +--
fs/ext3/namei.c | 4 ++-
fs/ext4/ialloc.c | 4 +--
fs/ext4/namei.c | 4 ++-
fs/fs-writeback.c | 4 +--
fs/gfs2/ops_inode.c | 4 ++-
fs/hfsplus/dir.c | 4 ++-
fs/hpfs/inode.c | 2 -
fs/inode.c | 37 ++++++++++++++++++++-----------
fs/jffs2/dir.c | 8 +++++-
fs/jfs/jfs_txnmgr.c | 4 ++-
fs/jfs/namei.c | 4 ++-
fs/libfs.c | 4 ++-
fs/locks.c | 3 --
fs/minix/namei.c | 4 ++-
fs/namei.c | 7 ++++-
fs/nfs/dir.c | 4 ++-
fs/nfs/getroot.c | 4 ++-
fs/nfs/inode.c | 4 +--
fs/nilfs2/mdt.c | 2 -
fs/nilfs2/namei.c | 4 ++-
fs/notify/inode_mark.c | 22 +++++++++++-------
fs/notify/inotify/inotify.c | 28 +++++++++++++----------
fs/ntfs/super.c | 4 ++-
fs/ocfs2/namei.c | 4 ++-
fs/reiserfs/file.c | 4 +--
fs/reiserfs/namei.c | 4 ++-
fs/reiserfs/stree.c | 2 -
fs/sysv/namei.c | 4 ++-
fs/ubifs/dir.c | 4 ++-
fs/ubifs/super.c | 2 -
fs/udf/namei.c | 4 ++-
fs/ufs/namei.c | 4 ++-
fs/xfs/linux-2.6/xfs_iops.c | 4 ++-
fs/xfs/xfs_iget.c | 2 -
fs/xfs/xfs_inode.h | 6 +++--
include/linux/fs.h | 2 -
ipc/mqueue.c | 7 ++++-
kernel/futex.c | 4 ++-
mm/shmem.c | 4 ++-
52 files changed, 201 insertions(+), 96 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
@@ -1549,7 +1549,7 @@ static int spufs_mfc_open(struct inode *
if (ctx->owner != current->mm)
return -EINVAL;

- if (atomic_read(&inode->i_count) != 1)
+ if (inode->i_count != 1)
return -EBUSY;

mutex_lock(&ctx->mapping_lock);
Index: linux-2.6/fs/affs/inode.c
===================================================================
--- linux-2.6.orig/fs/affs/inode.c
+++ linux-2.6/fs/affs/inode.c
@@ -380,7 +380,9 @@ affs_add_entry(struct inode *dir, struct
affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
mark_buffer_dirty_inode(inode_bh, inode);
inode->i_nlink = 2;
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
}
affs_fix_checksum(sb, bh);
mark_buffer_dirty_inode(bh, inode);
Index: linux-2.6/fs/afs/dir.c
===================================================================
--- linux-2.6.orig/fs/afs/dir.c
+++ linux-2.6/fs/afs/dir.c
@@ -1002,7 +1002,9 @@ static int afs_link(struct dentry *from,
if (ret < 0)
goto link_error;

- atomic_inc(&vnode->vfs_inode.i_count);
+ spin_lock(&vnode->vfs_inode.i_lock);
+ vnode->vfs_inode.i_count++;
+ spin_unlock(&vnode->vfs_inode.i_lock);
d_instantiate(dentry, &vnode->vfs_inode);
key_put(key);
_leave(" = 0");
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c
+++ linux-2.6/fs/anon_inodes.c
@@ -114,7 +114,9 @@ struct file *anon_inode_getfile(const ch
* so we can avoid doing an igrab() and we can use an open-coded
* atomic_inc().
*/
- atomic_inc(&anon_inode_inode->i_count);
+ spin_lock(&anon_inode_inode->i_lock);
+ anon_inode_inode->i_count++;
+ spin_unlock(&anon_inode_inode->i_lock);

path.dentry->d_op = &anon_inodefs_dentry_operations;
d_instantiate(path.dentry, anon_inode_inode);
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -549,7 +549,12 @@ EXPORT_SYMBOL(bdget);
*/
struct block_device *bdgrab(struct block_device *bdev)
{
- atomic_inc(&bdev->bd_inode->i_count);
+ struct inode *inode = bdev->bd_inode;
+
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
+
return bdev;
}

@@ -579,7 +584,9 @@ static struct block_device *bd_acquire(s
spin_lock(&bdev_lock);
bdev = inode->i_bdev;
if (bdev) {
- atomic_inc(&bdev->bd_inode->i_count);
+ spin_lock(&inode->i_lock);
+ bdev->bd_inode->i_count++;
+ spin_unlock(&inode->i_lock);
spin_unlock(&bdev_lock);
return bdev;
}
@@ -595,7 +602,9 @@ static struct block_device *bd_acquire(s
* So, we can access it via ->i_mapping always
* without igrab().
*/
- atomic_inc(&bdev->bd_inode->i_count);
+ spin_lock(&inode->i_lock);
+ bdev->bd_inode->i_count++;
+ spin_unlock(&inode->i_lock);
inode->i_bdev = bdev;
inode->i_mapping = bdev->bd_inode->i_mapping;
list_add(&inode->i_devices, &bdev->bd_inodes);
Index: linux-2.6/fs/ext2/namei.c
===================================================================
--- linux-2.6.orig/fs/ext2/namei.c
+++ linux-2.6/fs/ext2/namei.c
@@ -206,7 +206,9 @@ static int ext2_link (struct dentry * ol

inode->i_ctime = CURRENT_TIME_SEC;
inode_inc_link_count(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

err = ext2_add_link(dentry, inode);
if (!err) {
Index: linux-2.6/fs/ext3/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/ialloc.c
+++ linux-2.6/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle,
struct ext3_sb_info *sbi;
int fatal = 0, err;

- if (atomic_read(&inode->i_count) > 1) {
+ if (inode->i_count > 1) {
printk ("ext3_free_inode: inode has count=%d\n",
- atomic_read(&inode->i_count));
+ inode->i_count);
return;
}
if (inode->i_nlink) {
Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c
+++ linux-2.6/fs/ext3/namei.c
@@ -2261,7 +2261,9 @@ retry:

inode->i_ctime = CURRENT_TIME_SEC;
inc_nlink(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

err = ext3_add_entry(handle, dentry, inode);
if (!err) {
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -427,7 +427,7 @@ writeback_single_inode(struct inode *ino
unsigned dirty;
int ret;

- if (!atomic_read(&inode->i_count))
+ if (!inode->i_count)
WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
else
WARN_ON(inode->i_state & I_WILL_FREE);
@@ -551,7 +551,7 @@ select_queue:
inode->i_state |= I_DIRTY_PAGES;
redirty_tail(inode);
}
- } else if (atomic_read(&inode->i_count)) {
+ } else if (inode->i_count) {
/*
* The inode is clean, inuse
*/
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -33,14 +33,13 @@
* inode_hash_lock protects:
* inode hash table, i_hash
* inode->i_lock protects:
- * i_state
+ * i_state, i_count
*
* Ordering:
* inode_lock
* sb_inode_list_lock
* inode->i_lock
- * inode_lock
- * inode_hash_lock
+ * inode_hash_lock
*/
/*
* This is needed for the following functions:
@@ -151,7 +150,7 @@ int inode_init_always(struct super_block
inode->i_sb = sb;
inode->i_blkbits = sb->s_blocksize_bits;
inode->i_flags = 0;
- atomic_set(&inode->i_count, 1);
+ inode->i_count = 1;
inode->i_op = &empty_iops;
inode->i_fop = &empty_fops;
inode->i_nlink = 1;
@@ -306,7 +305,8 @@ void __iget(struct inode *inode)
{
assert_spin_locked(&inode->i_lock);

- if (atomic_inc_return(&inode->i_count) != 1)
+ inode->i_count++;
+ if (inode->i_count > 1)
return;

if (!(inode->i_state & (I_DIRTY|I_SYNC)))
@@ -412,7 +412,7 @@ static int invalidate_list(struct list_h
continue;
}
invalidate_inode_buffers(inode);
- if (!atomic_read(&inode->i_count)) {
+ if (!inode->i_count) {
list_move(&inode->i_list, dispose);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
@@ -463,7 +463,7 @@ static int can_unuse(struct inode *inode
return 0;
if (inode_has_buffers(inode))
return 0;
- if (atomic_read(&inode->i_count))
+ if (inode->i_count)
return 0;
if (inode->i_data.nrpages)
return 0;
@@ -501,7 +501,7 @@ static void prune_icache(int nr_to_scan)
inode = list_entry(inode_unused.prev, struct inode, i_list);

spin_lock(&inode->i_lock);
- if (inode->i_state || atomic_read(&inode->i_count)) {
+ if (inode->i_state || inode->i_count) {
list_move(&inode->i_list, &inode_unused);
spin_unlock(&inode->i_lock);
continue;
@@ -1290,8 +1290,6 @@ void generic_delete_inode(struct inode *
{
const struct super_operations *op = inode->i_sb->s_op;

- spin_lock(&sb_inode_list_lock);
- spin_lock(&inode->i_lock);
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
@@ -1336,8 +1334,6 @@ int generic_detach_inode(struct inode *i
{
struct super_block *sb = inode->i_sb;

- spin_lock(&sb_inode_list_lock);
- spin_lock(&inode->i_lock);
if (!hlist_unhashed(&inode->i_hash)) {
if (!(inode->i_state & (I_DIRTY|I_SYNC)))
list_move(&inode->i_list, &inode_unused);
@@ -1436,8 +1432,24 @@ void iput(struct inode *inode)
if (inode) {
BUG_ON(inode->i_state == I_CLEAR);

- if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+retry:
+ spin_lock(&inode->i_lock);
+ if (inode->i_count == 1) {
+ if (!spin_trylock(&inode_lock)) {
+ spin_unlock(&inode->i_lock);
+ goto retry;
+ }
+ if (!spin_trylock(&sb_inode_list_lock)) {
+ spin_unlock(&inode_lock);
+ spin_unlock(&inode->i_lock);
+ goto retry;
+ }
+ inode->i_count--;
iput_final(inode);
+ } else {
+ inode->i_count--;
+ spin_unlock(&inode->i_lock);
+ }
}
}
EXPORT_SYMBOL(iput);
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -277,7 +277,9 @@ int simple_link(struct dentry *old_dentr

inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
inc_nlink(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
dget(dentry);
d_instantiate(dentry, inode);
return 0;
Index: linux-2.6/fs/locks.c
===================================================================
--- linux-2.6.orig/fs/locks.c
+++ linux-2.6/fs/locks.c
@@ -1375,8 +1375,7 @@ int generic_setlease(struct file *filp,
if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
goto out;
if ((arg == F_WRLCK)
- && (dentry->d_count > 1
- || (atomic_read(&inode->i_count) > 1)))
+ && (dentry->d_count > 1 || inode->i_count > 1))
goto out;
}

Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2306,8 +2306,11 @@ static long do_unlinkat(int dfd, const c
if (nd.last.name[nd.last.len])
goto slashes;
inode = dentry->d_inode;
- if (inode)
- atomic_inc(&inode->i_count);
+ if (inode) {
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
+ }
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit2;
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1570,7 +1570,9 @@ nfs_link(struct dentry *old_dentry, stru
d_drop(dentry);
error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
if (error == 0) {
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_add(dentry, inode);
}
return error;
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -55,7 +55,9 @@ static int nfs_superblock_set_dummy_root
return -ENOMEM;
}
/* Circumvent igrab(): we know the inode is not being freed */
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
/*
* Ensure that this dentry is invisible to d_find_alias().
* Otherwise, it may be spliced into the tree by
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -404,23 +404,28 @@ void inotify_unmount_inodes(struct list_
* evict all inodes with zero i_count from icache which is
* unnecessarily violent and may in fact be illegal to do.
*/
- if (!atomic_read(&inode->i_count))
+ if (!inode->i_count)
continue;

need_iput_tmp = need_iput;
need_iput = NULL;
/* In case inotify_remove_watch_locked() drops a reference. */
- if (inode != need_iput_tmp)
+ if (inode != need_iput_tmp) {
+ spin_lock(&inode->i_lock);
__iget(inode);
- else
+ spin_unlock(&inode->i_lock);
+ } else
need_iput_tmp = NULL;
/* In case the dropping of a reference would nuke next_i. */
- if ((&next_i->i_sb_list != list) &&
- atomic_read(&next_i->i_count) &&
- !(next_i->i_state & (I_CLEAR | I_FREEING |
- I_WILL_FREE))) {
- __iget(next_i);
- need_iput = next_i;
+ if (&next_i->i_sb_list != list) {
+ spin_lock(&next_i->i_lock);
+ if (next_i->i_count &&
+ !(next_i->i_state &
+ (I_CLEAR|I_FREEING|I_WILL_FREE))) {
+ __iget(next_i);
+ need_iput = next_i;
+ }
+ spin_unlock(&next_i->i_lock);
}

/*
@@ -439,11 +444,10 @@ void inotify_unmount_inodes(struct list_
mutex_lock(&inode->inotify_mutex);
watches = &inode->inotify_watches;
list_for_each_entry_safe(watch, next_w, watches, i_list) {
- struct inotify_handle *ih= watch->ih;
+ struct inotify_handle *ih = watch->ih;
get_inotify_watch(watch);
mutex_lock(&ih->mutex);
- ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
- NULL, NULL);
+ ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0, NULL, NULL);
inotify_remove_watch_locked(ih, watch);
mutex_unlock(&ih->mutex);
put_inotify_watch(watch);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_iops.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
@@ -360,7 +360,9 @@ xfs_vn_link(
if (unlikely(error))
return -error;

- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_instantiate(dentry, inode);
return 0;
}
Index: linux-2.6/fs/xfs/xfs_inode.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_inode.h
+++ linux-2.6/fs/xfs/xfs_inode.h
@@ -483,8 +483,10 @@ void xfs_mark_inode_dirty_sync(xfs_inod

#define IHOLD(ip) \
do { \
- ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
- atomic_inc(&(VFS_I(ip)->i_count)); \
+ spin_lock(&VFS_I(ip)->i_lock); \
+ ASSERT(VFS_I(ip)->i_count > 0) ; \
+ VFS_I(ip)->i_count++; \
+ spin_unlock(&VFS_I(ip)->i_lock); \
trace_xfs_ihold(ip, _THIS_IP_); \
} while (0)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -729,7 +729,7 @@ struct inode {
struct list_head i_sb_list;
struct list_head i_dentry;
unsigned long i_ino;
- atomic_t i_count;
+ unsigned int i_count;
unsigned int i_nlink;
uid_t i_uid;
gid_t i_gid;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c
+++ linux-2.6/ipc/mqueue.c
@@ -769,8 +769,11 @@ SYSCALL_DEFINE1(mq_unlink, const char __
}

inode = dentry->d_inode;
- if (inode)
- atomic_inc(&inode->i_count);
+ if (inode) {
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
+ }
err = mnt_want_write(ipc_ns->mq_mnt);
if (err)
goto out_err;
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -168,7 +168,9 @@ static void get_futex_key_refs(union fut

switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
case FUT_OFF_INODE:
- atomic_inc(&key->shared.inode->i_count);
+ spin_lock(&key->shared.inode->i_lock);
+ key->shared.inode->i_count++;
+ spin_unlock(&key->shared.inode->i_lock);
break;
case FUT_OFF_MMSHARED:
atomic_inc(&key->private.mm->mm_count);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1877,7 +1877,9 @@ static int shmem_link(struct dentry *old
dir->i_size += BOGO_DIRENT_SIZE;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
inc_nlink(inode);
- atomic_inc(&inode->i_count); /* New dentry reference */
+ spin_lock(&inode->i_lock);
+ inode->i_count++; /* New dentry reference */
+ spin_unlock(&inode->i_lock);
dget(dentry); /* Extra pinning count for the created dentry */
d_instantiate(dentry, inode);
out:
Index: linux-2.6/fs/bfs/dir.c
===================================================================
--- linux-2.6.orig/fs/bfs/dir.c
+++ linux-2.6/fs/bfs/dir.c
@@ -176,7 +176,9 @@ static int bfs_link(struct dentry *old,
inc_nlink(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_instantiate(new, inode);
mutex_unlock(&info->bfs_lock);
return 0;
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c
+++ linux-2.6/fs/btrfs/inode.c
@@ -4753,7 +4753,9 @@ static int btrfs_link(struct dentry *old
}

btrfs_set_trans_block_group(trans, dir);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

err = btrfs_add_nondir(trans, dentry, inode, 1, index);

Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c
+++ linux-2.6/fs/coda/dir.c
@@ -303,7 +303,9 @@ static int coda_link(struct dentry *sour
}

coda_dir_update_mtime(dir_inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_instantiate(de, inode);
inc_nlink(inode);

Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c
+++ linux-2.6/fs/exofs/inode.c
@@ -1124,7 +1124,9 @@ static void create_done(struct exofs_io_

set_obj_created(oi);

- atomic_dec(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count--;
+ spin_unlock(&inode->i_lock);
wake_up(&oi->i_wq);
}

@@ -1177,14 +1179,18 @@ struct inode *exofs_new_inode(struct ino
/* increment the refcount so that the inode will still be around when we
* reach the callback
*/
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

ios->done = create_done;
ios->private = inode;
ios->cred = oi->i_cred;
ret = exofs_sbi_create(ios);
if (ret) {
- atomic_dec(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count--;
+ spin_unlock(&inode->i_lock);
exofs_put_io_state(ios);
return ERR_PTR(ret);
}
Index: linux-2.6/fs/exofs/namei.c
===================================================================
--- linux-2.6.orig/fs/exofs/namei.c
+++ linux-2.6/fs/exofs/namei.c
@@ -153,7 +153,9 @@ static int exofs_link(struct dentry *old

inode->i_ctime = CURRENT_TIME;
inode_inc_link_count(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

return exofs_add_nondir(dentry, inode);
}
Index: linux-2.6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/ialloc.c
+++ linux-2.6/fs/ext4/ialloc.c
@@ -189,9 +189,9 @@ void ext4_free_inode(handle_t *handle, s
struct ext4_sb_info *sbi;
int fatal = 0, err, count, cleared;

- if (atomic_read(&inode->i_count) > 1) {
+ if (inode->i_count > 1) {
printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
- atomic_read(&inode->i_count));
+ inode->i_count);
return;
}
if (inode->i_nlink) {
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c
+++ linux-2.6/fs/ext4/namei.c
@@ -2340,7 +2340,9 @@ retry:

inode->i_ctime = ext4_current_time(inode);
ext4_inc_count(handle, inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

err = ext4_add_entry(handle, dentry, inode);
if (!err) {
Index: linux-2.6/fs/gfs2/ops_inode.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_inode.c
+++ linux-2.6/fs/gfs2/ops_inode.c
@@ -253,7 +253,9 @@ out_parent:
gfs2_holder_uninit(ghs);
gfs2_holder_uninit(ghs + 1);
if (!error) {
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_instantiate(dentry, inode);
mark_inode_dirty(inode);
}
Index: linux-2.6/fs/hfsplus/dir.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/dir.c
+++ linux-2.6/fs/hfsplus/dir.c
@@ -301,7 +301,9 @@ static int hfsplus_link(struct dentry *s

inc_nlink(inode);
hfsplus_instantiate(dst_dentry, inode, cnid);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
HFSPLUS_SB(sb).file_count++;
Index: linux-2.6/fs/hpfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hpfs/inode.c
+++ linux-2.6/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
struct inode *parent;
if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
- if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+ if (hpfs_inode->i_rddir_off && !i->i_count) {
if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
kfree(hpfs_inode->i_rddir_off);
hpfs_inode->i_rddir_off = NULL;
Index: linux-2.6/fs/jffs2/dir.c
===================================================================
--- linux-2.6.orig/fs/jffs2/dir.c
+++ linux-2.6/fs/jffs2/dir.c
@@ -290,7 +290,9 @@ static int jffs2_link (struct dentry *ol
mutex_unlock(&f->sem);
d_instantiate(dentry, old_dentry->d_inode);
dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
- atomic_inc(&old_dentry->d_inode->i_count);
+ spin_lock(&old_dentry->d_inode->i_lock);
+ old_dentry->d_inode->i_count++;
+ spin_unlock(&old_dentry->d_inode->i_lock);
}
return ret;
}
@@ -871,7 +873,9 @@ static int jffs2_rename (struct inode *o
printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
/* Might as well let the VFS know */
d_instantiate(new_dentry, old_dentry->d_inode);
- atomic_inc(&old_dentry->d_inode->i_count);
+ spin_lock(&old_dentry->d_inode->i_lock);
+ old_dentry->d_inode->i_count++;
+ spin_unlock(&old_dentry->d_inode->i_lock);
new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
return ret;
}
Index: linux-2.6/fs/jfs/jfs_txnmgr.c
===================================================================
--- linux-2.6.orig/fs/jfs/jfs_txnmgr.c
+++ linux-2.6/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,9 @@ int txCommit(tid_t tid, /* transaction
* lazy commit thread finishes processing
*/
if (tblk->xflag & COMMIT_DELETE) {
- atomic_inc(&tblk->u.ip->i_count);
+ spin_lock(&tblk->u.ip->i_lock);
+ tblk->u.ip->i_count++;
+ spin_unlock(&tblk->u.ip->i_lock);
/*
* Avoid a rare deadlock
*
Index: linux-2.6/fs/jfs/namei.c
===================================================================
--- linux-2.6.orig/fs/jfs/namei.c
+++ linux-2.6/fs/jfs/namei.c
@@ -839,7 +839,9 @@ static int jfs_link(struct dentry *old_d
ip->i_ctime = CURRENT_TIME;
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
mark_inode_dirty(dir);
- atomic_inc(&ip->i_count);
+ spin_lock(&ip->i_lock);
+ ip->i_count++;
+ spin_unlock(&ip->i_lock);

iplist[0] = ip;
iplist[1] = dir;
Index: linux-2.6/fs/minix/namei.c
===================================================================
--- linux-2.6.orig/fs/minix/namei.c
+++ linux-2.6/fs/minix/namei.c
@@ -101,7 +101,9 @@ static int minix_link(struct dentry * ol

inode->i_ctime = CURRENT_TIME_SEC;
inode_inc_link_count(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
return add_nondir(dentry, inode);
}

Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -377,7 +377,7 @@ nfs_fhget(struct super_block *sb, struct
dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
inode->i_sb->s_id,
(long long)NFS_FILEID(inode),
- atomic_read(&inode->i_count));
+ inode->i_count);

out:
return inode;
@@ -1123,7 +1123,7 @@ static int nfs_update_inode(struct inode

dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
__func__, inode->i_sb->s_id, inode->i_ino,
- atomic_read(&inode->i_count), fattr->valid);
+ inode->i_count, fattr->valid);

if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
goto out_fileid;
Index: linux-2.6/fs/nilfs2/mdt.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/mdt.c
+++ linux-2.6/fs/nilfs2/mdt.c
@@ -479,7 +479,7 @@ nilfs_mdt_new_common(struct the_nilfs *n
inode->i_sb = sb; /* sb may be NULL for some meta data files */
inode->i_blkbits = nilfs->ns_blocksize_bits;
inode->i_flags = 0;
- atomic_set(&inode->i_count, 1);
+ inode->i_count = 1;
inode->i_nlink = 1;
inode->i_ino = ino;
inode->i_mode = S_IFREG;
Index: linux-2.6/fs/nilfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/namei.c
+++ linux-2.6/fs/nilfs2/namei.c
@@ -219,7 +219,9 @@ static int nilfs_link(struct dentry *old

inode->i_ctime = CURRENT_TIME;
inode_inc_link_count(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

err = nilfs_add_nondir(dentry, inode);
if (!err)
Index: linux-2.6/fs/ocfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/namei.c
+++ linux-2.6/fs/ocfs2/namei.c
@@ -722,7 +722,9 @@ static int ocfs2_link(struct dentry *old
goto out_commit;
}

- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
dentry->d_op = &ocfs2_dentry_ops;
d_instantiate(dentry, inode);

Index: linux-2.6/fs/reiserfs/file.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/file.c
+++ linux-2.6/fs/reiserfs/file.c
@@ -39,7 +39,7 @@ static int reiserfs_file_release(struct
BUG_ON(!S_ISREG(inode->i_mode));

/* fast out for when nothing needs to be done */
- if ((atomic_read(&inode->i_count) > 1 ||
+ if ((inode->i_count > 1 ||
!(REISERFS_I(inode)->i_flags & i_pack_on_close_mask) ||
!tail_has_to_be_packed(inode)) &&
REISERFS_I(inode)->i_prealloc_count <= 0) {
@@ -94,7 +94,7 @@ static int reiserfs_file_release(struct
if (!err)
err = jbegin_failure;

- if (!err && atomic_read(&inode->i_count) <= 1 &&
+ if (!err && inode->i_count <= 1 &&
(REISERFS_I(inode)->i_flags & i_pack_on_close_mask) &&
tail_has_to_be_packed(inode)) {
/* if regular file is released by last holder and it has been
Index: linux-2.6/fs/reiserfs/namei.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/namei.c
+++ linux-2.6/fs/reiserfs/namei.c
@@ -1156,7 +1156,9 @@ static int reiserfs_link(struct dentry *
inode->i_ctime = CURRENT_TIME_SEC;
reiserfs_update_sd(&th, inode);

- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_instantiate(dentry, inode);
retval = journal_end(&th, dir->i_sb, jbegin_count);
reiserfs_write_unlock(dir->i_sb);
Index: linux-2.6/fs/reiserfs/stree.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/stree.c
+++ linux-2.6/fs/reiserfs/stree.c
@@ -1477,7 +1477,7 @@ static int maybe_indirect_to_direct(stru
** reading in the last block. The user will hit problems trying to
** read the file, but for now we just skip the indirect2direct
*/
- if (atomic_read(&inode->i_count) > 1 ||
+ if (inode->i_count > 1 ||
!tail_has_to_be_packed(inode) ||
!page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
/* leave tail in an unformatted node */
Index: linux-2.6/fs/sysv/namei.c
===================================================================
--- linux-2.6.orig/fs/sysv/namei.c
+++ linux-2.6/fs/sysv/namei.c
@@ -126,7 +126,9 @@ static int sysv_link(struct dentry * old

inode->i_ctime = CURRENT_TIME_SEC;
inode_inc_link_count(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

return add_nondir(dentry, inode);
}
Index: linux-2.6/fs/ubifs/dir.c
===================================================================
--- linux-2.6.orig/fs/ubifs/dir.c
+++ linux-2.6/fs/ubifs/dir.c
@@ -550,7 +550,9 @@ static int ubifs_link(struct dentry *old

lock_2_inodes(dir, inode);
inc_nlink(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
inode->i_ctime = ubifs_current_time(inode);
dir->i_size += sz_change;
dir_ui->ui_size = dir->i_size;
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c
+++ linux-2.6/fs/ubifs/super.c
@@ -342,7 +342,7 @@ static void ubifs_delete_inode(struct in
goto out;

dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
- ubifs_assert(!atomic_read(&inode->i_count));
+ ubifs_assert(!inode->i_count);
ubifs_assert(inode->i_nlink == 0);

truncate_inode_pages(&inode->i_data, 0);
Index: linux-2.6/fs/udf/namei.c
===================================================================
--- linux-2.6.orig/fs/udf/namei.c
+++ linux-2.6/fs/udf/namei.c
@@ -1101,7 +1101,9 @@ static int udf_link(struct dentry *old_d
inc_nlink(inode);
inode->i_ctime = current_fs_time(inode->i_sb);
mark_inode_dirty(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
d_instantiate(dentry, inode);
unlock_kernel();

Index: linux-2.6/fs/ufs/namei.c
===================================================================
--- linux-2.6.orig/fs/ufs/namei.c
+++ linux-2.6/fs/ufs/namei.c
@@ -180,7 +180,9 @@ static int ufs_link (struct dentry * old

inode->i_ctime = CURRENT_TIME_SEC;
inode_inc_link_count(inode);
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);

error = ufs_add_nondir(dentry, inode);
unlock_kernel();
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c
+++ linux-2.6/fs/notify/inode_mark.c
@@ -382,24 +382,30 @@ void fsnotify_unmount_inodes(struct list
* evict all inodes with zero i_count from icache which is
* unnecessarily violent and may in fact be illegal to do.
*/
- if (!atomic_read(&inode->i_count))
+ if (!inode->i_count)
continue;

need_iput_tmp = need_iput;
need_iput = NULL;

/* In case fsnotify_inode_delete() drops a reference. */
- if (inode != need_iput_tmp)
+ if (inode != need_iput_tmp) {
+ spin_lock(&inode->i_lock);
__iget(inode);
- else
+ spin_unlock(&inode->i_lock);
+ } else
need_iput_tmp = NULL;

/* In case the dropping of a reference would nuke next_i. */
- if ((&next_i->i_sb_list != list) &&
- atomic_read(&next_i->i_count) &&
- !(next_i->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))) {
- __iget(next_i);
- need_iput = next_i;
+ if (&next_i->i_sb_list != list) {
+ spin_lock(&next_i->i_lock);
+ if (next_i->i_count &&
+ !(next_i->i_state &
+ (I_CLEAR | I_FREEING | I_WILL_FREE))) {
+ __iget(next_i);
+ need_iput = next_i;
+ }
+ spin_unlock(&next_i->i_lock);
}

/*
Index: linux-2.6/fs/ntfs/super.c
===================================================================
--- linux-2.6.orig/fs/ntfs/super.c
+++ linux-2.6/fs/ntfs/super.c
@@ -2930,7 +2930,9 @@ static int ntfs_fill_super(struct super_
}
if ((sb->s_root = d_alloc_root(vol->root_ino))) {
/* We increment i_count simulating an ntfs_iget(). */
- atomic_inc(&vol->root_ino->i_count);
+ spin_lock(&vol->root_ino->i_lock);
+ vol->root_ino->i_count++;
+ spin_unlock(&vol->root_ino->i_lock);
ntfs_debug("Exiting, status successful.");
/* Release the default upcase if it has no users. */
mutex_lock(&ntfs_lock);
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c
+++ linux-2.6/fs/cifs/inode.c
@@ -1612,7 +1612,7 @@ int cifs_revalidate_dentry(struct dentry
}

cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
- "jiffies %ld", full_path, inode, inode->i_count.counter,
+ "jiffies %ld", full_path, inode, inode->i_count,
dentry, dentry->d_time, jiffies);

if (CIFS_SB(sb)->tcon->unix_ext)
Index: linux-2.6/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_trace.h
+++ linux-2.6/fs/xfs/linux-2.6/xfs_trace.h
@@ -576,7 +576,7 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev;
__entry->ino = ip->i_ino;
- __entry->count = atomic_read(&VFS_I(ip)->i_count);
+ __entry->count = VFS_I(ip)->i_count;
__entry->pincount = atomic_read(&ip->i_pincount);
__entry->caller_ip = caller_ip;
),
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c
+++ linux-2.6/net/socket.c
@@ -386,7 +386,9 @@ static int sock_alloc_file(struct socket
&socket_file_ops);
if (unlikely(!file)) {
/* drop dentry, keep inode */
- atomic_inc(&path.dentry->d_inode->i_count);
+ spin_lock(&path.dentry->d_inode->i_lock);
+ path.dentry->d_inode->i_count++;
+ spin_unlock(&path.dentry->d_inode->i_lock);
path_put(&path);
put_unused_fd(fd);
return -ENFILE;
n***@suse.de
2010-06-24 03:02:35 UTC
Permalink
The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. We could make a per-cpu counter or something for it, but it is only
accessed via proc, so we could use slab stats.

XXX: must implement slab routines to return stats for a single cache, and
implement the proc handler.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 5 +----
include/linux/dcache.h | 2 +-
2 files changed, 2 insertions(+), 5 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -105,7 +105,7 @@ static struct dcache_hash_bucket *dentry

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
- .nr_dentry = ATOMIC_INIT(0),
+ .nr_dentry = 0,
.age_limit = 45,
};

@@ -146,7 +146,6 @@ static void d_callback(struct rcu_head *
*/
static void d_free(struct dentry *dentry)
{
- atomic_dec(&dentry_stat.nr_dentry);
BUG_ON(dentry->d_count);
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
@@ -1249,8 +1248,6 @@ struct dentry *d_alloc(struct dentry * p
spin_unlock(&parent->d_lock);
}

- atomic_inc(&dentry_stat.nr_dentry);
-
return dentry;
}
EXPORT_SYMBOL(d_alloc);
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -38,7 +38,7 @@ struct qstr {
};

struct dentry_stat_t {
- atomic_t nr_dentry;
+ int nr_dentry; /* unused */
int nr_unused; /* protected by dcache_lru_lock */
int age_limit; /* age in seconds */
int want_pages; /* pages requested by system */
n***@suse.de
2010-06-24 03:02:25 UTC
Permalink
Protect d_unhashed(dentry) condition with d_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
arch/powerpc/platforms/cell/spufs/inode.c | 3 +
fs/configfs/configfs_internal.h | 2 +
fs/dcache.c | 58 ++++++++++++++++++++++++------
fs/libfs.c | 29 ++++++++++-----
fs/ocfs2/dcache.c | 5 ++
fs/seq_file.c | 3 +
security/tomoyo/realpath.c | 2 +
7 files changed, 81 insertions(+), 21 deletions(-)

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -16,6 +16,11 @@

#include <asm/uaccess.h>

+static inline int simple_positive(struct dentry *dentry)
+{
+ return dentry->d_inode && !d_unhashed(dentry);
+}
+
int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat)
{
@@ -100,8 +105,10 @@ loff_t dcache_dir_lseek(struct file *fil
while (n && p != &file->f_path.dentry->d_subdirs) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- if (!d_unhashed(next) && next->d_inode)
+ spin_lock(&next->d_lock);
+ if (simple_positive(next))
n--;
+ spin_unlock(&next->d_lock);
p = p->next;
}
list_add_tail(&cursor->d_u.d_child, p);
@@ -155,9 +162,13 @@ int dcache_readdir(struct file * filp, v
for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- if (d_unhashed(next) || !next->d_inode)
+ spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
+ if (!simple_positive(next)) {
+ spin_unlock(&next->d_lock);
continue;
+ }

+ spin_unlock(&next->d_lock);
spin_unlock(&dcache_lock);
if (filldir(dirent, next->d_name.name,
next->d_name.len, filp->f_pos,
@@ -262,20 +273,20 @@ int simple_link(struct dentry *old_dentr
return 0;
}

-static inline int simple_positive(struct dentry *dentry)
-{
- return dentry->d_inode && !d_unhashed(dentry);
-}
-
int simple_empty(struct dentry *dentry)
{
struct dentry *child;
int ret = 0;

spin_lock(&dcache_lock);
- list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
- if (simple_positive(child))
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+ if (simple_positive(child)) {
+ spin_unlock(&child->d_lock);
goto out;
+ }
+ spin_unlock(&child->d_lock);
+ }
ret = 1;
out:
spin_unlock(&dcache_lock);
Index: linux-2.6/fs/seq_file.c
===================================================================
--- linux-2.6.orig/fs/seq_file.c
+++ linux-2.6/fs/seq_file.c
@@ -6,10 +6,13 @@
*/

#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/module.h>
#include <linux/seq_file.h>
#include <linux/slab.h>

+#include "internal.h"
+
#include <asm/uaccess.h>
#include <asm/page.h>

@@ -463,7 +466,9 @@ int seq_path_root(struct seq_file *m, st
char *p;

spin_lock(&dcache_lock);
+ br_read_lock(vfsmount_lock);
p = __d_path(path, root, buf, size);
+ br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);
res = PTR_ERR(p);
if (!IS_ERR(p)) {
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -46,6 +46,7 @@
* - d_name
* - d_lru
* - d_count
+ * - d_unhashed()
*
* Ordering:
* dcache_lock
@@ -53,6 +54,13 @@
* dcache_lru_lock
* dcache_hash_lock
*
+ * If there is an ancestor relationship:
+ * dentry->d_parent->...->d_parent->d_lock
+ * ...
+ * dentry->d_parent->d_lock
+ * dentry->d_lock
+ *
+ * If no ancestor relationship:
* if (dentry1 < dentry2)
* dentry1->d_lock
* dentry2->d_lock
@@ -334,7 +342,9 @@ int d_invalidate(struct dentry * dentry)
* If it's already been dropped, return OK.
*/
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return 0;
}
@@ -343,9 +353,11 @@ int d_invalidate(struct dentry * dentry)
* to get rid of unused child entries.
*/
if (!list_empty(&dentry->d_subdirs)) {
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
shrink_dcache_parent(dentry);
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
}

/*
@@ -358,7 +370,6 @@ int d_invalidate(struct dentry * dentry)
* we might still populate it if it was a
* working directory or similar).
*/
- spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
spin_unlock(&dentry->d_lock);
@@ -457,15 +468,18 @@ static struct dentry * __d_find_alias(st
next = tmp->next;
prefetch(next);
alias = list_entry(tmp, struct dentry, d_alias);
+ spin_lock(&alias->d_lock);
if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
if (IS_ROOT(alias) &&
(alias->d_flags & DCACHE_DISCONNECTED))
discon_alias = alias;
else if (!want_discon) {
- __dget_locked(alias);
+ __dget_locked_dlock(alias);
+ spin_unlock(&alias->d_lock);
return alias;
}
}
+ spin_unlock(&alias->d_lock);
}
if (discon_alias)
__dget_locked(discon_alias);
@@ -750,8 +764,8 @@ static void shrink_dcache_for_umount_sub
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
dentry_lru_del_init(dentry);
- spin_unlock(&dentry->d_lock);
__d_drop(dentry);
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

for (;;) {
@@ -766,8 +780,8 @@ static void shrink_dcache_for_umount_sub
d_u.d_child) {
spin_lock(&loop->d_lock);
dentry_lru_del_init(loop);
- spin_unlock(&loop->d_lock);
__d_drop(loop);
+ spin_unlock(&loop->d_lock);
cond_resched_lock(&dcache_lock);
}
spin_unlock(&dcache_lock);
@@ -1788,7 +1802,10 @@ static void d_move_locked(struct dentry
/*
* XXXX: do we really need to take target->d_lock?
*/
- if (target < dentry) {
+ if (d_ancestor(dentry, target)) {
+ spin_lock(&dentry->d_lock);
+ spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
+ } else if (d_ancestor(target, dentry) || target < dentry) {
spin_lock(&target->d_lock);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
} else {
@@ -2046,7 +2063,8 @@ static int prepend_name(char **buffer, i
* Returns a pointer into the buffer or an error code if the
* path was too long.
*
- * "buflen" should be positive. Caller holds the dcache_lock.
+ * "buflen" should be positive. Caller holds the dcache_lock and
+ * path->dentry->d_lock.
*
* If path is not reachable from the supplied root, then the value of
* root is changed (without modifying refcounts).
@@ -2059,8 +2077,9 @@ char *__d_path(const struct path *path,
char *end = buffer + buflen;
char *retval;

- br_read_lock(vfsmount_lock);
prepend(&end, &buflen, "\0", 1);
+ spin_lock(&dentry->d_lock);
+unlinked:
if (d_unlinked(dentry) &&
(prepend(&end, &buflen, " (deleted)", 10) != 0))
goto Elong;
@@ -2072,7 +2091,7 @@ char *__d_path(const struct path *path,
*retval = '/';

for (;;) {
- struct dentry * parent;
+ struct dentry *parent;

if (dentry == root->dentry && vfsmnt == root->mnt)
break;
@@ -2081,8 +2100,10 @@ char *__d_path(const struct path *path,
if (vfsmnt->mnt_parent == vfsmnt) {
goto global_root;
}
+ spin_unlock(&dentry->d_lock);
dentry = vfsmnt->mnt_mountpoint;
vfsmnt = vfsmnt->mnt_parent;
+ spin_lock(&dentry->d_lock); /* can't get unlinked because locked vfsmount */
continue;
}
parent = dentry->d_parent;
@@ -2091,11 +2112,14 @@ char *__d_path(const struct path *path,
(prepend(&end, &buflen, "/", 1) != 0))
goto Elong;
retval = end;
+ spin_unlock(&dentry->d_lock);
dentry = parent;
+ if (d_unlinked(dentry))
+ goto unlinked;
}

out:
- br_read_unlock(vfsmount_lock);
+ spin_unlock(&dentry->d_lock);
return retval;

global_root:
@@ -2147,10 +2171,14 @@ char *d_path(const struct path *path, ch
root = current->fs->root;
path_get(&root);
spin_unlock(&current->fs->lock);
+
spin_lock(&dcache_lock);
+ br_read_lock(vfsmount_lock);
tmp = root;
res = __d_path(path, &tmp, buf, buflen);
+ br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);
+
path_put(&root);
return res;
}
@@ -2186,7 +2214,9 @@ char *dentry_path(struct dentry *dentry,
char *retval;

spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
prepend(&end, &buflen, "\0", 1);
+unlinked:
if (d_unlinked(dentry) &&
(prepend(&end, &buflen, "//deleted", 9) != 0))
goto Elong;
@@ -2205,11 +2235,17 @@ char *dentry_path(struct dentry *dentry,
goto Elong;

retval = end;
+ spin_unlock(&dentry->d_lock);
dentry = parent;
+ spin_lock(&dentry->d_lock);
+ if (d_unlinked(dentry))
+ goto unlinked;
}
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return retval;
Elong:
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return ERR_PTR(-ENAMETOOLONG);
}
@@ -2250,12 +2286,17 @@ SYSCALL_DEFINE2(getcwd, char __user *, b

error = -ENOENT;
spin_lock(&dcache_lock);
+ br_read_lock(vfsmount_lock);
+ spin_lock(&pwd.dentry->d_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
struct path tmp = root;
char * cwd;

+ spin_unlock(&pwd.dentry->d_lock);
+ /* XXX: race here, have to close (eg. return unlinked from __d_path) */
cwd = __d_path(&pwd, &tmp, page, PAGE_SIZE);
+ br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);

error = PTR_ERR(cwd);
@@ -2269,8 +2310,11 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
if (copy_to_user(buf, cwd, len))
error = -EFAULT;
}
- } else
+ } else {
+ spin_unlock(&pwd.dentry->d_lock);
+ br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);
+ }

out:
path_put(&pwd);
@@ -2359,13 +2403,16 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- if (d_unhashed(dentry)||!dentry->d_inode)
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+ if (d_unhashed(dentry) || !dentry->d_inode) {
+ spin_unlock(&dentry->d_lock);
continue;
+ }
if (!list_empty(&dentry->d_subdirs)) {
+ spin_unlock(&dentry->d_lock);
this_parent = dentry;
goto repeat;
}
- spin_lock(&dentry->d_lock);
dentry->d_count--;
spin_unlock(&dentry->d_lock);
}
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -165,6 +165,9 @@ static void spufs_prune_dir(struct dentr
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
simple_unlink(dir->d_inode, dentry);
+ /* XXX: what is dcache_lock protecting here? Other
+ * filesystems (IB, configfs) release dcache_lock
+ * before unlink */
spin_unlock(&dcache_lock);
dput(dentry);
} else {
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -121,6 +121,7 @@ static inline struct config_item *config
struct config_item * item = NULL;

spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
if (!d_unhashed(dentry)) {
struct configfs_dirent * sd = dentry->d_fsdata;
if (sd->s_type & CONFIGFS_ITEM_LINK) {
@@ -129,6 +130,7 @@ static inline struct config_item *config
} else
item = config_item_get(sd->s_element);
}
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

return item;
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -156,13 +156,16 @@ struct dentry *ocfs2_find_local_alias(st
list_for_each(p, &inode->i_dentry) {
dentry = list_entry(p, struct dentry, d_alias);

+ spin_lock(&dentry->d_lock);
if (ocfs2_match_dentry(dentry, parent_blkno, skip_unhashed)) {
mlog(0, "dentry found: %.*s\n",
dentry->d_name.len, dentry->d_name.name);

- dget_locked(dentry);
+ dget_locked_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
break;
}
+ spin_unlock(&dentry->d_lock);

dentry = NULL;
}
Index: linux-2.6/security/tomoyo/realpath.c
===================================================================
--- linux-2.6.orig/security/tomoyo/realpath.c
+++ linux-2.6/security/tomoyo/realpath.c
@@ -17,6 +17,7 @@
#include <linux/magic.h>
#include <linux/slab.h>
#include "common.h"
+#include "../../fs/internal.h"

/**
* tomoyo_encode: Convert binary string to ascii string.
@@ -92,8 +93,10 @@ int tomoyo_realpath_from_path2(struct pa
struct path ns_root = {.mnt = NULL, .dentry = NULL};

spin_lock(&dcache_lock);
+ br_read_lock(vfsmount_lock);
/* go to whatever namespace root we are under */
sp = __d_path(path, &ns_root, newname, newname_len);
+ br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);
/* Prepend "/proc" prefix if using internal proc vfs mount. */
if (!IS_ERR(sp) && (path->mnt->mnt_flags & MNT_INTERNAL) &&
Index: linux-2.6/drivers/usb/core/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -351,10 +351,13 @@ static int usbfs_empty (struct dentry *d

list_for_each(list, &dentry->d_subdirs) {
struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
+ spin_lock(&de->d_lock);
if (usbfs_positive(de)) {
+ spin_unlock(&de->d_lock);
spin_unlock(&dcache_lock);
return 0;
}
+ spin_unlock(&de->d_lock);
}

spin_unlock(&dcache_lock);
Index: linux-2.6/fs/ceph/dir.c
===================================================================
--- linux-2.6.orig/fs/ceph/dir.c
+++ linux-2.6/fs/ceph/dir.c
@@ -135,6 +135,7 @@ more:
fi->at_end = 1;
goto out_unlock;
}
+ spin_lock(&dentry->d_lock);
if (!d_unhashed(dentry) && dentry->d_inode &&
ceph_snap(dentry->d_inode) != CEPH_SNAPDIR &&
ceph_ino(dentry->d_inode) != CEPH_INO_CEPH &&
@@ -144,13 +145,13 @@ more:
dentry->d_name.len, dentry->d_name.name, di->offset,
filp->f_pos, d_unhashed(dentry) ? " unhashed" : "",
!dentry->d_inode ? " null" : "");
+ spin_unlock(&dentry->d_lock);
p = p->prev;
dentry = list_entry(p, struct dentry, d_u.d_child);
di = ceph_dentry(dentry);
}

- spin_lock(&dentry->d_lock);
- dentry->d_count++;
+ dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
spin_unlock(&inode->i_lock);
n***@suse.de
2010-06-24 03:02:30 UTC
Permalink
It is possible to run dput without taking locks up-front. In many cases
where we don't kill the dentry anyway, these locks are not required.

(I think... need to think about it more). Further changes ->d_delete
locking which is not all audited. d_delete must be idempotent, for one.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 60 ++++++++++++++++++++++++++++++++----------------------------
1 file changed, 32 insertions(+), 28 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -278,7 +278,8 @@ static struct dentry *d_kill(struct dent

void dput(struct dentry *dentry)
{
- struct dentry *parent = NULL;
+ struct dentry *parent;
+
if (!dentry)
return;

@@ -286,25 +287,10 @@ repeat:
if (dentry->d_count == 1)
might_sleep();
spin_lock(&dentry->d_lock);
- if (dentry->d_count == 1) {
- if (!spin_trylock(&dcache_inode_lock)) {
-drop2:
- spin_unlock(&dentry->d_lock);
- goto repeat;
- }
- parent = dentry->d_parent;
- if (parent && parent != dentry) {
- if (!spin_trylock(&parent->d_lock)) {
- spin_unlock(&dcache_inode_lock);
- goto drop2;
- }
- }
- }
- dentry->d_count--;
- if (dentry->d_count) {
+ BUG_ON(!dentry->d_count);
+ if (dentry->d_count > 1) {
+ dentry->d_count--;
spin_unlock(&dentry->d_lock);
- if (parent && parent != dentry)
- spin_unlock(&parent->d_lock);
return;
}

@@ -312,8 +298,10 @@ drop2:
* AV: ->d_delete() is _NOT_ allowed to block now.
*/
if (dentry->d_op && dentry->d_op->d_delete) {
- if (dentry->d_op->d_delete(dentry))
- goto unhash_it;
+ if (dentry->d_op->d_delete(dentry)) {
+ __d_drop(dentry);
+ goto kill_it;
+ }
}
/* Unreachable? Get rid of it */
if (d_unhashed(dentry))
@@ -322,15 +310,31 @@ drop2:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);
}
- spin_unlock(&dentry->d_lock);
- if (parent && parent != dentry)
- spin_unlock(&parent->d_lock);
- spin_unlock(&dcache_inode_lock);
- return;
+ dentry->d_count--;
+ spin_unlock(&dentry->d_lock);
+ return;

-unhash_it:
- __d_drop(dentry);
kill_it:
+ spin_unlock(&dentry->d_lock);
+ spin_lock(&dcache_inode_lock);
+relock:
+ spin_lock(&dentry->d_lock);
+ parent = dentry->d_parent;
+ if (parent && parent != dentry) {
+ if (!spin_trylock(&parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto relock;
+ }
+ }
+ dentry->d_count--;
+ if (dentry->d_count) {
+ /* This case should be fine */
+ spin_unlock(&dentry->d_lock);
+ if (parent && parent != dentry)
+ spin_unlock(&parent->d_lock);
+ spin_unlock(&dcache_inode_lock);
+ return;
+ }
/* if dentry was on the d_lru list delete it from there */
dentry_lru_del(dentry);
dentry = d_kill(dentry);
n***@suse.de
2010-06-24 03:02:28 UTC
Permalink
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.

XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.

Signed-off-by: Nick Piggin <***@suse.de>
---
drivers/staging/pohmelfs/path_entry.c | 15 ++-
fs/autofs4/waitq.c | 16 +++
fs/dcache.c | 151 +++++++++++++++++++++++++++-------
fs/nfs/namespace.c | 14 ++-
fs/seq_file.c | 1
5 files changed, 163 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -80,6 +80,7 @@ static __cacheline_aligned_in_smp DEFINE
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

+EXPORT_SYMBOL(rename_lock);
EXPORT_SYMBOL(dcache_inode_lock);
EXPORT_SYMBOL(dcache_hash_lock);
EXPORT_SYMBOL(dcache_lock);
@@ -961,11 +962,15 @@ void shrink_dcache_for_umount(struct sup
* Return true if the parent or its subdirectories contain
* a mount point
*/
-
int have_submounts(struct dentry *parent)
{
- struct dentry *this_parent = parent;
+ struct dentry *this_parent;
struct list_head *next;
+ unsigned seq;
+
+rename_retry:
+ this_parent = parent;
+ seq = read_seqbegin(&rename_lock);

spin_lock(&dcache_lock);
if (d_mountpoint(parent))
@@ -999,17 +1004,38 @@ resume:
* All done at this level ... ascend and resume the search.
*/
if (this_parent != parent) {
- next = this_parent->d_u.d_child.next;
+ struct dentry *tmp;
+ struct dentry *child;
+
+ tmp = this_parent->d_parent;
+ rcu_read_lock();
spin_unlock(&this_parent->d_lock);
- this_parent = this_parent->d_parent;
+ child = this_parent;
+ this_parent = tmp;
spin_lock(&this_parent->d_lock);
+ /* might go back up the wrong parent if we have had a rename
+ * or deletion */
+ if (this_parent != child->d_parent ||
+ // d_unlinked(this_parent) || XXX
+ read_seqretry(&rename_lock, seq)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ goto rename_retry;
+ }
+ rcu_read_unlock();
+ next = child->d_u.d_child.next;
goto resume;
}
spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return 0; /* No mount points found in tree */
positive:
spin_unlock(&dcache_lock);
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return 1;
}
EXPORT_SYMBOL(have_submounts);
@@ -1030,9 +1056,15 @@ EXPORT_SYMBOL(have_submounts);
*/
static int select_parent(struct dentry * parent)
{
- struct dentry *this_parent = parent;
+ struct dentry *this_parent;
struct list_head *next;
- int found = 0;
+ unsigned seq;
+ int found;
+
+rename_retry:
+ found = 0;
+ this_parent = parent;
+ seq = read_seqbegin(&rename_lock);

spin_lock(&dcache_lock);
spin_lock(&this_parent->d_lock);
@@ -1043,7 +1075,6 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- BUG_ON(this_parent == dentry);

spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
dentry_lru_del_init(dentry);
@@ -1084,17 +1115,33 @@ resume:
*/
if (this_parent != parent) {
struct dentry *tmp;
- next = this_parent->d_u.d_child.next;
+ struct dentry *child;
+
tmp = this_parent->d_parent;
+ rcu_read_lock();
spin_unlock(&this_parent->d_lock);
- BUG_ON(tmp == this_parent);
+ child = this_parent;
this_parent = tmp;
spin_lock(&this_parent->d_lock);
+ /* might go back up the wrong parent if we have had a rename
+ * or deletion */
+ if (this_parent != child->d_parent ||
+ // d_unlinked(this_parent) || XXX
+ read_seqretry(&rename_lock, seq)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ goto rename_retry;
+ }
+ rcu_read_unlock();
+ next = child->d_u.d_child.next;
goto resume;
}
out:
spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return found;
}

@@ -1606,7 +1653,7 @@ EXPORT_SYMBOL(d_add_ci);
struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
struct dentry * dentry = NULL;
- unsigned long seq;
+ unsigned seq;

do {
seq = read_seqbegin(&rename_lock);
@@ -2211,12 +2258,20 @@ static int prepend_name(char **buffer, i
char *__d_path(const struct path *path, struct path *root,
char *buffer, int buflen)
{
- struct dentry *dentry = path->dentry;
- struct vfsmount *vfsmnt = path->mnt;
- char *end = buffer + buflen;
+ struct dentry *dentry;
+ struct vfsmount *vfsmnt;
+ char *end;
char *retval;
+ unsigned seq;

+rename_retry:
+ dentry = path->dentry;
+ vfsmnt = path->mnt;
+ end = buffer + buflen;
prepend(&end, &buflen, "\0", 1);
+
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
spin_lock(&dentry->d_lock);
unlinked:
if (d_unlinked(dentry) &&
@@ -2253,12 +2308,16 @@ unlinked:
retval = end;
spin_unlock(&dentry->d_lock);
dentry = parent;
+ spin_lock(&dentry->d_lock);
if (d_unlinked(dentry))
goto unlinked;
}

out:
spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return retval;

global_root:
@@ -2267,6 +2326,7 @@ global_root:
goto Elong;
root->mnt = vfsmnt;
root->dentry = dentry;
+ /* XXX: this could wrongly modify root if we rename retry */
goto out;

Elong:
@@ -2349,12 +2409,19 @@ char *dynamic_dname(struct dentry *dentr
*/
char *dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
+ char *end;
char *retval;
+ unsigned seq;
+
+rename_retry:
+ end = buf + buflen;
+ prepend(&end, &buflen, "\0", 1);

+ seq = read_seqbegin(&rename_lock);
spin_lock(&dcache_lock);
+ br_read_lock(vfsmount_lock);
+ rcu_read_lock(); /* protect parent */
spin_lock(&dentry->d_lock);
- prepend(&end, &buflen, "\0", 1);
unlinked:
if (d_unlinked(dentry) &&
(prepend(&end, &buflen, "//deleted", 9) != 0))
@@ -2380,13 +2447,17 @@ unlinked:
if (d_unlinked(dentry))
goto unlinked;
}
+out:
spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+ br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return retval;
Elong:
- spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
- return ERR_PTR(-ENAMETOOLONG);
+ retval = ERR_PTR(-ENAMETOOLONG);
+ goto out;
}

/*
@@ -2481,25 +2552,25 @@ out:
int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
{
int result;
- unsigned long seq;
+ unsigned seq;

if (new_dentry == old_dentry)
return 1;

- /*
- * Need rcu_readlock to protect against the d_parent trashing
- * due to d_move
- */
- rcu_read_lock();
do {
/* for restarting inner loop in case of seq retry */
seq = read_seqbegin(&rename_lock);
+ /*
+ * Need rcu_readlock to protect against the d_parent trashing
+ * due to d_move
+ */
+ rcu_read_lock();
if (d_ancestor(old_dentry, new_dentry))
result = 1;
else
result = 0;
+ rcu_read_unlock();
} while (read_seqretry(&rename_lock, seq));
- rcu_read_unlock();

return result;
}
@@ -2531,9 +2602,13 @@ EXPORT_SYMBOL(path_is_under);

void d_genocide(struct dentry *root)
{
- struct dentry *this_parent = root;
+ struct dentry *this_parent;
struct list_head *next;
+ unsigned seq;

+rename_retry:
+ this_parent = root;
+ seq = read_seqbegin(&rename_lock);
spin_lock(&dcache_lock);
spin_lock(&this_parent->d_lock);
repeat:
@@ -2543,6 +2618,7 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
+
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
if (d_unhashed(dentry) || !dentry->d_inode) {
spin_unlock(&dentry->d_lock);
@@ -2555,19 +2631,44 @@ resume:
spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
goto repeat;
}
- dentry->d_count--;
+ if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
+ dentry->d_flags |= DCACHE_GENOCIDE;
+ dentry->d_count--;
+ }
spin_unlock(&dentry->d_lock);
}
if (this_parent != root) {
- next = this_parent->d_u.d_child.next;
- this_parent->d_count--;
+ struct dentry *tmp;
+ struct dentry *child;
+
+ tmp = this_parent->d_parent;
+ if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
+ this_parent->d_flags |= DCACHE_GENOCIDE;
+ this_parent->d_count--;
+ }
+ rcu_read_lock();
spin_unlock(&this_parent->d_lock);
- this_parent = this_parent->d_parent;
+ child = this_parent;
+ this_parent = tmp;
spin_lock(&this_parent->d_lock);
+ /* might go back up the wrong parent if we have had a rename
+ * or deletion */
+ if (this_parent != child->d_parent ||
+ // d_unlinked(this_parent) || XXX
+ read_seqretry(&rename_lock, seq)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ goto rename_retry;
+ }
+ rcu_read_unlock();
+ next = child->d_u.d_child.next;
goto resume;
}
spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
}

/**
Index: linux-2.6/fs/seq_file.c
===================================================================
--- linux-2.6.orig/fs/seq_file.c
+++ linux-2.6/fs/seq_file.c
@@ -470,6 +470,7 @@ int seq_path_root(struct seq_file *m, st
p = __d_path(path, root, buf, size);
br_read_unlock(vfsmount_lock);
spin_unlock(&dcache_lock);
+
res = PTR_ERR(p);
if (!IS_ERR(p)) {
char *end = mangle_path(buf, p, esc);
Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c
+++ linux-2.6/drivers/staging/pohmelfs/path_entry.c
@@ -83,10 +83,11 @@ out:
int pohmelfs_path_length(struct pohmelfs_inode *pi)
{
struct dentry *d, *root, *first;
- int len = 1; /* Root slash */
+ int len;
+ unsigned seq;

- first = d = d_find_alias(&pi->vfs_inode);
- if (!d) {
+ first = d_find_alias(&pi->vfs_inode);
+ if (!first) {
dprintk("%s: ino: %llu, mode: %o.\n", __func__, pi->ino, pi->vfs_inode.i_mode);
return -ENOENT;
}
@@ -95,6 +96,11 @@ int pohmelfs_path_length(struct pohmelfs
root = dget(current->fs->root.dentry);
spin_unlock(&current->fs->lock);

+rename_retry:
+ len = 1; /* Root slash */
+ d = first;
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
spin_lock(&dcache_lock);

if (!IS_ROOT(d) && d_unhashed(d))
@@ -105,6 +111,9 @@ int pohmelfs_path_length(struct pohmelfs
d = d->d_parent;
}
spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;

dput(root);
dput(first);
Index: linux-2.6/fs/autofs4/waitq.c
===================================================================
--- linux-2.6.orig/fs/autofs4/waitq.c
+++ linux-2.6/fs/autofs4/waitq.c
@@ -186,16 +186,25 @@ static int autofs4_getpath(struct autofs
{
struct dentry *root = sbi->sb->s_root;
struct dentry *tmp;
- char *buf = *name;
+ char *buf;
char *p;
- int len = 0;
+ int len;
+ unsigned seq;

+rename_retry:
+ buf = *name;
+ len = 0;
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
spin_lock(&dcache_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
len += tmp->d_name.len + 1;

if (!len || --len > NAME_MAX) {
spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return 0;
}

@@ -209,6 +218,9 @@ static int autofs4_getpath(struct autofs
strncpy(p, tmp->d_name.name, tmp->d_name.len);
}
spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;

return len;
}
Index: linux-2.6/fs/nfs/namespace.c
===================================================================
--- linux-2.6.orig/fs/nfs/namespace.c
+++ linux-2.6/fs/nfs/namespace.c
@@ -49,11 +49,17 @@ char *nfs_path(const char *base,
const struct dentry *dentry,
char *buffer, ssize_t buflen)
{
- char *end = buffer+buflen;
+ char *end;
int namelen;
+ unsigned seq;

+rename_retry:
+ end = buffer+buflen;
*--end = '\0';
buflen--;
+
+ seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
spin_lock(&dcache_lock);
while (!IS_ROOT(dentry) && dentry != droot) {
namelen = dentry->d_name.len;
@@ -66,6 +72,9 @@ char *nfs_path(const char *base,
dentry = dentry->d_parent;
}
spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
if (*end != '/') {
if (--buflen < 0)
goto Elong;
@@ -83,6 +92,9 @@ char *nfs_path(const char *base,
return end;
Elong_unlock:
spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
Elong:
return ERR_PTR(-ENAMETOOLONG);
}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -187,6 +187,7 @@ d_iput: no no no yes
#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 /* Parent inode is watched by some fsnotify listener */

#define DCACHE_CANT_MOUNT 0x0100
+#define DCACHE_GENOCIDE 0x0200

extern spinlock_t dcache_inode_lock;
extern spinlock_t dcache_hash_lock;
Peter Zijlstra
2010-06-24 07:58:10 UTC
Permalink
plain text document attachment (fs-dcache_lock-multi-step.patch)
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.
XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.
Ah, does this address John's issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2010-06-24 15:03:34 UTC
Permalink
Post by Peter Zijlstra
plain text document attachment (fs-dcache_lock-multi-step.patch)
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.
XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.
Ah, does this address John's issue?
This is where John's issue is introduced. I actually again couldn't
see the problem (thought I saw a problem, then lost it!).

Got to think about it and test more... I couldn't reproduce the problem
mind you, but I was testing mainline wheras bug was seen on -rt.
john stultz
2010-06-24 17:22:11 UTC
Permalink
Post by Nick Piggin
Post by Peter Zijlstra
plain text document attachment (fs-dcache_lock-multi-step.patch)
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.
XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.
Ah, does this address John's issue?
This is where John's issue is introduced. I actually again couldn't
see the problem (thought I saw a problem, then lost it!).
Ok. So I need to review this new set in full, but the issue we tripped
with the patches in -rt was the following:

In select parent, when it does the dentry ascending, it has to release
the current dentry lock so it can aquire the parents dentry (for proper
ordering). At the point it released the lock, before it grabs the
parent's lock, there is nothing that is preventing dput from being
called on the next dentry, it grabbing the parent and dentry's d_lock
and killing it. Then back in select_parent, when we try to lock the next
entry, it no longer exists and we oops.

So I can't see anything that is protecting the dentry (or even the new
parent dentry, should everything be killed under it) at this point. The
dentries d_count might be zero, we don't hold the d_lock, and we don't
hold the parent's d_lock. What am I missing here?
Post by Nick Piggin
Got to think about it and test more... I couldn't reproduce the problem
mind you, but I was testing mainline wheras bug was seen on -rt.
Yea, -rt may be allowing the race to more easily occur (I only found I
could trigger it on a UP machine) since we can be preempted right after
we released the dentry lock as we tried to grab the parent. Then dput
would jump in and wreck things. I had some pretty clear trace logs that
were easily repeated (well, took about an hour or two to trigger), that
showed about the same order of operations every time.

I don't remember if there's another lock being held at the point
select_parent is called, so on mainline it might be harder to trigger
(having to actually get the dput race on another cpu timed right).


thanks
-john
john stultz
2010-06-24 17:26:37 UTC
Permalink
plain text document attachment (fs-dcache_lock-multi-step.patch)
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.
XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.
I'll try to point out exactly the spot I think we were hitting in the
-rt tree (once the dcache_lock is removed).
@@ -1030,9 +1056,15 @@ EXPORT_SYMBOL(have_submounts);
*/
static int select_parent(struct dentry * parent)
{
- struct dentry *this_parent = parent;
+ struct dentry *this_parent;
struct list_head *next;
- int found = 0;
+ unsigned seq;
+ int found;
+
+ found = 0;
+ this_parent = parent;
+ seq = read_seqbegin(&rename_lock);
spin_lock(&dcache_lock);
spin_lock(&this_parent->d_lock);
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- BUG_ON(this_parent == dentry);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
dentry_lru_del_init(dentry);
*/
if (this_parent != parent) {
struct dentry *tmp;
- next = this_parent->d_u.d_child.next;
+ struct dentry *child;
+
tmp = this_parent->d_parent;
+ rcu_read_lock();
spin_unlock(&this_parent->d_lock);
- BUG_ON(tmp == this_parent);
+ child = this_parent;
this_parent = tmp;
Ok. So right here, we get preempted, or dput() is called by another cpu
on the child dentry, or the child->d_u.d_child.next dentry and its
d_kill'ed.
spin_lock(&this_parent->d_lock);
+ /* might go back up the wrong parent if we have had a rename
+ * or deletion */
+ if (this_parent != child->d_parent ||
+ // d_unlinked(this_parent) || XXX
+ read_seqretry(&rename_lock, seq)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ goto rename_retry;
+ }
+ rcu_read_unlock();
+ next = child->d_u.d_child.next;
Then at this point, next may point to junk.
goto resume;
}
spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
+ if (read_seqretry(&rename_lock, seq))
+ goto rename_retry;
return found;
}
thanks
-john
Nick Piggin
2010-06-25 06:45:56 UTC
Permalink
Post by john stultz
plain text document attachment (fs-dcache_lock-multi-step.patch)
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.
XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.
I'll try to point out exactly the spot I think we were hitting in the
-rt tree (once the dcache_lock is removed).
@@ -1030,9 +1056,15 @@ EXPORT_SYMBOL(have_submounts);
*/
static int select_parent(struct dentry * parent)
{
- struct dentry *this_parent = parent;
+ struct dentry *this_parent;
struct list_head *next;
- int found = 0;
+ unsigned seq;
+ int found;
+
+ found = 0;
+ this_parent = parent;
+ seq = read_seqbegin(&rename_lock);
spin_lock(&dcache_lock);
spin_lock(&this_parent->d_lock);
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- BUG_ON(this_parent == dentry);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
dentry_lru_del_init(dentry);
*/
if (this_parent != parent) {
struct dentry *tmp;
- next = this_parent->d_u.d_child.next;
+ struct dentry *child;
+
tmp = this_parent->d_parent;
+ rcu_read_lock();
spin_unlock(&this_parent->d_lock);
- BUG_ON(tmp == this_parent);
+ child = this_parent;
this_parent = tmp;
Ok. So right here, we get preempted, or dput() is called by another cpu
on the child dentry, or the child->d_u.d_child.next dentry and its
d_kill'ed.
spin_lock(&this_parent->d_lock);
+ /* might go back up the wrong parent if we have had a rename
+ * or deletion */
+ if (this_parent != child->d_parent ||
+ // d_unlinked(this_parent) || XXX
+ read_seqretry(&rename_lock, seq)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_unlock(&dcache_lock);
+ rcu_read_unlock();
+ goto rename_retry;
+ }
+ rcu_read_unlock();
+ next = child->d_u.d_child.next;
Then at this point, next may point to junk.
But see the test above it. We ensure that child->d_parent still points
to this_parent with this_parent d_lock held. Oh, I'm not clearing
d_parent! d_kill() should have
dentry->d_parent = NULL;
when it removes dentry from the list.

That should fix it I'd hope.

Thanks,
Nick
n***@suse.de
2010-06-24 03:02:47 UTC
Permalink
Make inode_hash_lock private by adding a function __remove_inode_hash
that can be used by filesystems defining their own drop_inode functions.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/inode.c | 42 +++++++++++++++++++++++++++---------------
include/linux/fs.h | 1 +
include/linux/writeback.h | 1 -
3 files changed, 28 insertions(+), 16 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -107,7 +107,7 @@ static struct hlist_head *inode_hashtabl
*/
DEFINE_SPINLOCK(sb_inode_list_lock);
DEFINE_SPINLOCK(wb_inode_list_lock);
-DEFINE_SPINLOCK(inode_hash_lock);
+static DEFINE_SPINLOCK(inode_hash_lock);

/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -377,9 +377,7 @@ static void dispose_list(struct list_hea

spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
- spin_lock(&inode_hash_lock);
- hlist_del_init(&inode->i_hash);
- spin_unlock(&inode_hash_lock);
+ __remove_inode_hash(inode);
list_del_init(&inode->i_sb_list);
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
@@ -1280,6 +1278,20 @@ void __insert_inode_hash(struct inode *i
EXPORT_SYMBOL(__insert_inode_hash);

/**
+ * __remove_inode_hash - remove an inode from the hash
+ * @inode: inode to unhash
+ *
+ * Remove an inode from the superblock. inode->i_lock must be
+ * held.
+ */
+void __remove_inode_hash(struct inode *inode)
+{
+ spin_lock(&inode_hash_lock);
+ hlist_del_init(&inode->i_hash);
+ spin_unlock(&inode_hash_lock);
+}
+
+/**
* remove_inode_hash - remove an inode from the hash
* @inode: inode to unhash
*
@@ -1288,9 +1300,7 @@ EXPORT_SYMBOL(__insert_inode_hash);
void remove_inode_hash(struct inode *inode)
{
spin_lock(&inode->i_lock);
- spin_lock(&inode_hash_lock);
- hlist_del_init(&inode->i_hash);
- spin_unlock(&inode_hash_lock);
+ __remove_inode_hash(inode);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(remove_inode_hash);
@@ -1332,11 +1342,15 @@ void generic_delete_inode(struct inode *
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
}
- spin_lock(&inode->i_lock);
- spin_lock(&inode_hash_lock);
- hlist_del_init(&inode->i_hash);
- spin_unlock(&inode_hash_lock);
- spin_unlock(&inode->i_lock);
+ /*
+ * i_lock not required to delete from hash. If there was a
+ * concurrency window, then it would be possible for the other
+ * thread to touch the inode after it has been freed, with
+ * destroy_inode.
+ * XXX: yes it is because find_inode_fast checks it. Maybe we
+ * can avoid it though...
+ */
+ remove_inode_hash(inode);
wake_up_inode(inode);
BUG_ON(inode->i_state != I_CLEAR);
destroy_inode(inode);
@@ -1377,9 +1391,7 @@ int generic_detach_inode(struct inode *i
spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
- spin_lock(&inode_hash_lock);
- hlist_del_init(&inode->i_hash);
- spin_unlock(&inode_hash_lock);
+ __remove_inode_hash(inode);
atomic_dec(&inodes_stat.nr_unused);
}
spin_lock(&wb_inode_list_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -2181,6 +2181,7 @@ extern int should_remove_suid(struct den
extern int file_remove_suid(struct file *);

extern void __insert_inode_hash(struct inode *, unsigned long hashval);
+extern void __remove_inode_hash(struct inode *);
extern void remove_inode_hash(struct inode *);
static inline void insert_inode_hash(struct inode *inode) {
__insert_inode_hash(inode, inode->i_ino);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -11,7 +11,6 @@ struct backing_dev_info;

extern spinlock_t sb_inode_list_lock;
extern spinlock_t wb_inode_list_lock;
-extern spinlock_t inode_hash_lock;
extern struct list_head inode_in_use;
extern struct list_head inode_unused;



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:31 UTC
Permalink
We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 234 +++++++++++++++++++++++++++----------------------
fs/super.c | 3
include/linux/dcache.h | 23 ----
include/linux/fs.h | 3
4 files changed, 141 insertions(+), 122 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -33,13 +33,15 @@
#include <linux/bootmem.h>
#include <linux/fs_struct.h>
#include <linux/hardirq.h>
+#include <linux/bit_spinlock.h>
+#include <linux/rculist_bl.h>
#include "internal.h"

/*
* Usage:
* dcache_inode_lock protects:
* - i_dentry, d_alias, d_inode
- * dcache_hash_lock protects:
+ * dcache_hash_bucket lock protects:
* - the dcache hash table
* dcache_lru_lock protects:
* - the dcache lru lists and counters
@@ -57,7 +59,7 @@
* dcache_inode_lock
* dentry->d_lock
* dcache_lru_lock
- * dcache_hash_lock
+ * dcache_hash_bucket lock
*
* If there is an ancestor relationship:
* dentry->d_parent->...->d_parent->d_lock
@@ -74,13 +76,11 @@ int sysctl_vfs_cache_pressure __read_mos
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);
EXPORT_SYMBOL(dcache_inode_lock);
-EXPORT_SYMBOL(dcache_hash_lock);

static struct kmem_cache *dentry_cache __read_mostly;

@@ -99,7 +99,11 @@ static struct kmem_cache *dentry_cache _

static unsigned int d_hash_mask __read_mostly;
static unsigned int d_hash_shift __read_mostly;
-static struct hlist_head *dentry_hashtable __read_mostly;
+
+struct dcache_hash_bucket {
+ struct hlist_bl_head head;
+};
+static struct dcache_hash_bucket *dentry_hashtable __read_mostly;

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
@@ -107,6 +111,24 @@ struct dentry_stat_t dentry_stat = {
.age_limit = 45,
};

+static inline struct dcache_hash_bucket *d_hash(struct dentry *parent,
+ unsigned long hash)
+{
+ hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
+ hash = hash ^ ((hash ^ GOLDEN_RATIO_PRIME) >> D_HASHBITS);
+ return dentry_hashtable + (hash & D_HASHMASK);
+}
+
+static inline void spin_lock_bucket(struct dcache_hash_bucket *b)
+{
+ bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void spin_unlock_bucket(struct dcache_hash_bucket *b)
+{
+ __bit_spin_unlock(0, (unsigned long *)b);
+}
+
static void __d_free(struct dentry *dentry)
{
WARN_ON(!list_empty(&dentry->d_alias));
@@ -131,7 +153,7 @@ static void d_free(struct dentry *dentry
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
/* if dentry was never inserted into hash, immediate free is OK */
- if (hlist_unhashed(&dentry->d_hash))
+ if (hlist_bl_unhashed(&dentry->d_hash))
__d_free(dentry);
else
call_rcu(&dentry->d_u.d_rcu, d_callback);
@@ -247,6 +269,81 @@ static struct dentry *d_kill(struct dent
return parent;
}

+void __d_drop(struct dentry *dentry)
+{
+ if (!(dentry->d_flags & DCACHE_UNHASHED)) {
+ struct dcache_hash_bucket *b;
+ b = d_hash(dentry->d_parent, dentry->d_name.hash);
+ /* XXX: put unhashed inside hash lock? (don't need to actually) */
+ dentry->d_flags |= DCACHE_UNHASHED;
+ spin_lock_bucket(b);
+ hlist_bl_del_rcu(&dentry->d_hash);
+ spin_unlock_bucket(b);
+ }
+}
+EXPORT_SYMBOL(__d_drop);
+
+void d_drop(struct dentry *dentry)
+{
+ spin_lock(&dentry->d_lock);
+ __d_drop(dentry);
+ spin_unlock(&dentry->d_lock);
+}
+EXPORT_SYMBOL(d_drop);
+
+/* This must be called with d_lock held */
+static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+{
+ dentry->d_count++;
+ dentry_lru_del_init(dentry);
+ return dentry;
+}
+
+static inline struct dentry * __dget_locked(struct dentry *dentry)
+{
+ spin_lock(&dentry->d_lock);
+ __dget_locked_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
+ return dentry;
+}
+
+struct dentry * dget_locked_dlock(struct dentry *dentry)
+{
+ return __dget_locked_dlock(dentry);
+}
+
+struct dentry * dget_locked(struct dentry *dentry)
+{
+ return __dget_locked(dentry);
+}
+EXPORT_SYMBOL(dget_locked);
+
+struct dentry *dget_parent(struct dentry *dentry)
+{
+ struct dentry *ret;
+
+repeat:
+ spin_lock(&dentry->d_lock);
+ ret = dentry->d_parent;
+ if (!ret)
+ goto out;
+ if (dentry == ret) {
+ ret->d_count++;
+ goto out;
+ }
+ if (!spin_trylock(&ret->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto repeat;
+ }
+ BUG_ON(!ret->d_count);
+ ret->d_count++;
+ spin_unlock(&ret->d_lock);
+out:
+ spin_unlock(&dentry->d_lock);
+ return ret;
+}
+EXPORT_SYMBOL(dget_parent);
+
/*
* This is dput
*
@@ -398,60 +495,6 @@ int d_invalidate(struct dentry * dentry)
}
EXPORT_SYMBOL(d_invalidate);

-/* This must be called with d_lock held */
-static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
-{
- dentry->d_count++;
- dentry_lru_del_init(dentry);
- return dentry;
-}
-
-/* This must be called with d_lock held */
-static inline struct dentry * __dget_locked(struct dentry *dentry)
-{
- spin_lock(&dentry->d_lock);
- __dget_locked_dlock(dentry);
- spin_unlock(&dentry->d_lock);
- return dentry;
-}
-
-struct dentry * dget_locked_dlock(struct dentry *dentry)
-{
- return __dget_locked_dlock(dentry);
-}
-
-struct dentry * dget_locked(struct dentry *dentry)
-{
- return __dget_locked(dentry);
-}
-EXPORT_SYMBOL(dget_locked);
-
-struct dentry *dget_parent(struct dentry *dentry)
-{
- struct dentry *ret;
-
-repeat:
- spin_lock(&dentry->d_lock);
- ret = dentry->d_parent;
- if (!ret)
- goto out;
- if (dentry == ret) {
- ret->d_count++;
- goto out;
- }
- if (!spin_trylock(&ret->d_lock)) {
- spin_unlock(&dentry->d_lock);
- goto repeat;
- }
- BUG_ON(!ret->d_count);
- ret->d_count++;
- spin_unlock(&ret->d_lock);
-out:
- spin_unlock(&dentry->d_lock);
- return ret;
-}
-EXPORT_SYMBOL(dget_parent);
-
/**
* d_find_alias - grab a hashed alias of inode
* @inode: inode in question
@@ -900,8 +943,8 @@ void shrink_dcache_for_umount(struct sup
spin_unlock(&dentry->d_lock);
shrink_dcache_for_umount_subtree(dentry);

- while (!hlist_empty(&sb->s_anon)) {
- dentry = hlist_entry(sb->s_anon.first, struct dentry, d_hash);
+ while (!hlist_bl_empty(&sb->s_anon)) {
+ dentry = hlist_bl_entry(hlist_bl_first(&sb->s_anon), struct dentry, d_hash);
shrink_dcache_for_umount_subtree(dentry);
}
}
@@ -1183,7 +1226,7 @@ struct dentry *d_alloc(struct dentry * p
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
dentry->d_mounted = 0;
- INIT_HLIST_NODE(&dentry->d_hash);
+ INIT_HLIST_BL_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
INIT_LIST_HEAD(&dentry->d_alias);
@@ -1348,14 +1391,6 @@ struct dentry * d_alloc_root(struct inod
}
EXPORT_SYMBOL(d_alloc_root);

-static inline struct hlist_head *d_hash(struct dentry *parent,
- unsigned long hash)
-{
- hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
- hash = hash ^ ((hash ^ GOLDEN_RATIO_PRIME) >> D_HASHBITS);
- return dentry_hashtable + (hash & D_HASHMASK);
-}
-
/**
* d_obtain_alias - find or allocate a dentry for a given inode
* @inode: inode to allocate the dentry for
@@ -1411,7 +1446,7 @@ struct dentry *d_obtain_alias(struct ino
tmp->d_flags |= DCACHE_DISCONNECTED;
tmp->d_flags &= ~DCACHE_UNHASHED;
list_add(&tmp->d_alias, &inode->i_dentry);
- hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
+ hlist_bl_add_head(&tmp->d_hash, &inode->i_sb->s_anon); /* XXX: make s_anon a bl list */
spin_unlock(&tmp->d_lock);
spin_unlock(&dcache_inode_lock);

@@ -1604,14 +1639,14 @@ struct dentry * __d_lookup(struct dentry
unsigned int len = name->len;
unsigned int hash = name->hash;
const unsigned char *str = name->name;
- struct hlist_head *head = d_hash(parent,hash);
+ struct dcache_hash_bucket *b = d_hash(parent, hash);
+ struct hlist_bl_node *node;
struct dentry *found = NULL;
- struct hlist_node *node;
struct dentry *dentry;

rcu_read_lock();

- hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
+ hlist_bl_for_each_entry_rcu(dentry, node, &b->head, d_hash) {
struct qstr *qstr;

if (dentry->d_name.hash != hash)
@@ -1698,8 +1733,9 @@ out:

int d_validate(struct dentry *dentry, struct dentry *dparent)
{
- struct hlist_head *base;
- struct hlist_node *lhp;
+ struct dcache_hash_bucket *b;
+ struct dentry *lhp;
+ struct hlist_bl_node *node;

/* Check whether the ptr might be valid at all.. */
if (!kmem_ptr_validate(dentry_cache, dentry))
@@ -1709,20 +1745,17 @@ int d_validate(struct dentry *dentry, st
goto out;

spin_lock(&dentry->d_lock);
- spin_lock(&dcache_hash_lock);
- base = d_hash(dparent, dentry->d_name.hash);
- hlist_for_each(lhp,base) {
- /* hlist_for_each_entry_rcu() not required for d_hash list
- * as it is parsed under dcache_hash_lock
- */
- if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
- spin_unlock(&dcache_hash_lock);
+ b = d_hash(dparent, dentry->d_name.hash);
+ rcu_read_lock();
+ hlist_bl_for_each_entry_rcu(lhp, node, &b->head, d_hash) {
+ if (dentry == lhp) {
+ rcu_read_unlock();
__dget_locked_dlock(dentry);
spin_unlock(&dentry->d_lock);
return 1;
}
}
- spin_unlock(&dcache_hash_lock);
+ rcu_read_unlock();
spin_unlock(&dentry->d_lock);
out:
return 0;
@@ -1776,11 +1809,12 @@ void d_delete(struct dentry * dentry)
}
EXPORT_SYMBOL(d_delete);

-static void __d_rehash(struct dentry * entry, struct hlist_head *list)
+static void __d_rehash(struct dentry * entry, struct dcache_hash_bucket *b)
{
-
entry->d_flags &= ~DCACHE_UNHASHED;
- hlist_add_head_rcu(&entry->d_hash, list);
+ spin_lock_bucket(b);
+ hlist_bl_add_head_rcu(&entry->d_hash, &b->head);
+ spin_unlock_bucket(b);
}

static void _d_rehash(struct dentry * entry)
@@ -1798,9 +1832,7 @@ static void _d_rehash(struct dentry * en
void d_rehash(struct dentry * entry)
{
spin_lock(&entry->d_lock);
- spin_lock(&dcache_hash_lock);
_d_rehash(entry);
- spin_unlock(&dcache_hash_lock);
spin_unlock(&entry->d_lock);
}
EXPORT_SYMBOL(d_rehash);
@@ -1879,6 +1911,7 @@ static void switch_names(struct dentry *
*/
static void d_move_locked(struct dentry * dentry, struct dentry * target)
{
+ struct dcache_hash_bucket *b;
if (!dentry->d_inode)
printk(KERN_WARNING "VFS: moving negative dcache entry\n");

@@ -1909,11 +1942,13 @@ static void d_move_locked(struct dentry
}

/* Move the dentry to the target hash queue, if on different bucket */
- spin_lock(&dcache_hash_lock);
- if (!d_unhashed(dentry))
- hlist_del_rcu(&dentry->d_hash);
+ if (!d_unhashed(dentry)) {
+ b = d_hash(dentry->d_parent, dentry->d_name.hash);
+ spin_lock_bucket(b);
+ hlist_bl_del_rcu(&dentry->d_hash);
+ spin_unlock_bucket(b);
+ }
__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
- spin_unlock(&dcache_hash_lock);

/* Unhash the target: dput() will then get rid of it */
__d_drop(target);
@@ -2121,9 +2156,7 @@ struct dentry *d_materialise_unique(stru
found_lock:
spin_lock(&actual->d_lock);
found:
- spin_lock(&dcache_hash_lock);
_d_rehash(actual);
- spin_unlock(&dcache_hash_lock);
spin_unlock(&actual->d_lock);
spin_unlock(&dcache_inode_lock);
out_nolock:
@@ -2631,7 +2664,7 @@ static void __init dcache_init_early(voi

dentry_hashtable =
alloc_large_system_hash("Dentry cache",
- sizeof(struct hlist_head),
+ sizeof(struct dcache_hash_bucket),
dhash_entries,
13,
HASH_EARLY,
@@ -2640,7 +2673,7 @@ static void __init dcache_init_early(voi
0);

for (loop = 0; loop < (1 << d_hash_shift); loop++)
- INIT_HLIST_HEAD(&dentry_hashtable[loop]);
+ INIT_HLIST_BL_HEAD(&dentry_hashtable[loop].head);
}

static void __init dcache_init(void)
@@ -2663,7 +2696,7 @@ static void __init dcache_init(void)

dentry_hashtable =
alloc_large_system_hash("Dentry cache",
- sizeof(struct hlist_head),
+ sizeof(struct dcache_hash_bucket),
dhash_entries,
13,
0,
@@ -2672,7 +2705,7 @@ static void __init dcache_init(void)
0);

for (loop = 0; loop < (1 << d_hash_shift); loop++)
- INIT_HLIST_HEAD(&dentry_hashtable[loop]);
+ INIT_HLIST_BL_HEAD(&dentry_hashtable[loop].head);
}

/* SLAB cache for __getname() consumers */
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -4,6 +4,7 @@
#include <asm/atomic.h>
#include <linux/list.h>
#include <linux/rculist.h>
+#include <linux/rculist_bl.h>
#include <linux/spinlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>
@@ -97,7 +98,7 @@ struct dentry {
* The next three fields are touched by __d_lookup. Place them here
* so they all fit in a cache line.
*/
- struct hlist_node d_hash; /* lookup hash list */
+ struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;

@@ -190,7 +191,6 @@ d_iput: no no yes
#define DCACHE_GENOCIDE 0x0200

extern spinlock_t dcache_inode_lock;
-extern spinlock_t dcache_hash_lock;
extern seqlock_t rename_lock;

/**
@@ -208,23 +208,8 @@ extern seqlock_t rename_lock;
*
* __d_drop requires dentry->d_lock.
*/
-
-static inline void __d_drop(struct dentry *dentry)
-{
- if (!(dentry->d_flags & DCACHE_UNHASHED)) {
- dentry->d_flags |= DCACHE_UNHASHED;
- spin_lock(&dcache_hash_lock);
- hlist_del_rcu(&dentry->d_hash);
- spin_unlock(&dcache_hash_lock);
- }
-}
-
-static inline void d_drop(struct dentry *dentry)
-{
- spin_lock(&dentry->d_lock);
- __d_drop(dentry);
- spin_unlock(&dentry->d_lock);
-}
+void d_drop(struct dentry *dentry);
+void __d_drop(struct dentry *dentry);

static inline int dname_external(struct dentry *dentry)
{
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -30,6 +30,7 @@
#include <linux/idr.h>
#include <linux/mutex.h>
#include <linux/backing-dev.h>
+#include <linux/rculist_bl.h>
#include "internal.h"


@@ -71,7 +72,7 @@ static struct super_block *alloc_super(s
INIT_LIST_HEAD(&s->s_files);
#endif
INIT_LIST_HEAD(&s->s_instances);
- INIT_HLIST_HEAD(&s->s_anon);
+ INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
INIT_LIST_HEAD(&s->s_dentry_lru);
init_rwsem(&s->s_umount);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -382,6 +382,7 @@ struct inodes_stat_t {
#include <linux/capability.h>
#include <linux/semaphore.h>
#include <linux/fiemap.h>
+#include <linux/rculist_bl.h>

#include <asm/atomic.h>
#include <asm/byteorder.h>
@@ -1341,7 +1342,7 @@ struct super_block {
const struct xattr_handler **s_xattr;

struct list_head s_inodes; /* all inodes */
- struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
+ struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */
#ifdef CONFIG_SMP
struct list_head __percpu *s_files;
#else
n***@suse.de
2010-06-24 03:02:42 UTC
Permalink
Add a new lock, wb_inode_list_lock, to protect i_list and various lists
which the inode can be put onto.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/fs-writeback.c | 40 ++++++++++++++++++++++++++++++++++++++--
fs/inode.c | 43 +++++++++++++++++++++++++++++++++++--------
include/linux/writeback.h | 1 +
mm/backing-dev.c | 4 ++++
4 files changed, 78 insertions(+), 10 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -287,6 +287,7 @@ static void redirty_tail(struct inode *i
{
struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;

+ assert_spin_locked(&wb_inode_list_lock);
if (!list_empty(&wb->b_dirty)) {
struct inode *tail;

@@ -304,6 +305,7 @@ static void requeue_io(struct inode *ino
{
struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;

+ assert_spin_locked(&wb_inode_list_lock);
list_move(&inode->i_list, &wb->b_more_io);
}

@@ -344,6 +346,7 @@ static void move_expired_inodes(struct l
struct inode *inode;
int do_sb_sort = 0;

+ assert_spin_locked(&wb_inode_list_lock);
while (!list_empty(delaying_queue)) {
inode = list_entry(delaying_queue->prev, struct inode, i_list);
if (older_than_this &&
@@ -399,11 +402,13 @@ static void inode_wait_for_writeback(str

wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
while (inode->i_state & I_SYNC) {
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
+ spin_lock(&wb_inode_list_lock);
}
}

@@ -457,6 +462,7 @@ writeback_single_inode(struct inode *ino
/* Set I_SYNC, reset I_DIRTY_PAGES */
inode->i_state |= I_SYNC;
inode->i_state &= ~I_DIRTY_PAGES;
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);

@@ -493,6 +499,7 @@ writeback_single_inode(struct inode *ino

spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
+ spin_lock(&wb_inode_list_lock);
inode->i_state &= ~I_SYNC;
if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
@@ -623,23 +630,31 @@ static int writeback_sb_inodes(struct su
struct bdi_writeback *wb,
struct writeback_control *wbc)
{
+again:
while (!list_empty(&wb->b_io)) {
long pages_skipped;
struct inode *inode = list_entry(wb->b_io.prev,
struct inode, i_list);
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&wb_inode_list_lock);
+ spin_lock(&wb_inode_list_lock);
+ goto again;
+ }
if (wbc->sb && sb != inode->i_sb) {
/* super block given and doesn't
match, skip this inode */
redirty_tail(inode);
+ spin_unlock(&inode->i_lock);
continue;
}
- if (sb != inode->i_sb)
+ if (sb != inode->i_sb) {
/* finish with this superblock */
+ spin_unlock(&inode->i_lock);
return 0;
- spin_lock(&inode->i_lock);
+ }
if (inode->i_state & (I_NEW | I_WILL_FREE)) {
- spin_unlock(&inode->i_lock);
requeue_io(inode);
+ spin_unlock(&inode->i_lock);
continue;
}
/*
@@ -662,11 +677,13 @@ static int writeback_sb_inodes(struct su
*/
redirty_tail(inode);
}
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
iput(inode);
cond_resched();
spin_lock(&inode_lock);
+ spin_lock(&wb_inode_list_lock);
if (wbc->nr_to_write <= 0) {
wbc->more_io = 1;
return 1;
@@ -685,6 +702,9 @@ static void writeback_inodes_wb(struct b

wbc->wb_start = jiffies; /* livelock avoidance */
spin_lock(&inode_lock);
+again:
+ spin_lock(&wb_inode_list_lock);
+
if (!wbc->for_kupdate || list_empty(&wb->b_io))
queue_io(wb, wbc->older_than_this);

@@ -697,13 +717,23 @@ static void writeback_inodes_wb(struct b
if (wbc->sb && sb != wbc->sb) {
/* super block given and doesn't
match, skip this inode */
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&wb_inode_list_lock);
+ goto again;
+ }
redirty_tail(inode);
+ spin_unlock(&inode->i_lock);
continue;
}
state = pin_sb_for_writeback(wbc, sb);

if (state == SB_PIN_FAILED) {
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&wb_inode_list_lock);
+ goto again;
+ }
requeue_io(inode);
+ spin_unlock(&inode->i_lock);
continue;
}
ret = writeback_sb_inodes(sb, wb, wbc);
@@ -713,6 +743,7 @@ static void writeback_inodes_wb(struct b
if (ret)
break;
}
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode_lock);
/* Leave any unwritten inodes on b_io */
}
@@ -825,12 +856,21 @@ static long wb_writeback(struct bdi_writ
* become available for writeback. Otherwise
* we'll just busyloop.
*/
+retry:
spin_lock(&inode_lock);
+ spin_lock(&wb_inode_list_lock);
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
struct inode, i_list);
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lock);
+ goto retry;
+ }
inode_wait_for_writeback(inode);
+ spin_unlock(&inode->i_lock);
}
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode_lock);
}

@@ -1142,7 +1182,9 @@ void __mark_inode_dirty(struct inode *in
}

inode->dirtied_when = jiffies;
+ spin_lock(&wb_inode_list_lock);
list_move(&inode->i_list, &wb->b_dirty);
+ spin_unlock(&wb_inode_list_lock);
}
}
out:
@@ -1306,7 +1348,9 @@ int write_inode_now(struct inode *inode,
might_sleep();
spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
+ spin_lock(&wb_inode_list_lock);
ret = writeback_single_inode(inode, &wbc);
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
if (sync)
@@ -1332,7 +1376,9 @@ int sync_inode(struct inode *inode, stru

spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
+ spin_lock(&wb_inode_list_lock);
ret = writeback_single_inode(inode, wbc);
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
return ret;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -32,6 +32,8 @@
* s_inodes, i_sb_list
* inode_hash_lock protects:
* inode hash table, i_hash
+ * wb_inode_list_lock protects:
+ * inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
* inode->i_lock protects:
* i_state, i_count
*
@@ -39,6 +41,7 @@
* inode_lock
* sb_inode_list_lock
* inode->i_lock
+ * wb_inode_list_lock
* inode_hash_lock
*/
/*
@@ -100,6 +103,7 @@ static struct hlist_head *inode_hashtabl
*/
DEFINE_SPINLOCK(inode_lock);
DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(wb_inode_list_lock);
DEFINE_SPINLOCK(inode_hash_lock);

/*
@@ -309,8 +313,11 @@ void __iget(struct inode *inode)
if (inode->i_count > 1)
return;

- if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+ spin_lock(&wb_inode_list_lock);
list_move(&inode->i_list, &inode_in_use);
+ spin_unlock(&wb_inode_list_lock);
+ }
inodes_stat.nr_unused--;
}

@@ -413,7 +420,9 @@ static int invalidate_list(struct list_h
}
invalidate_inode_buffers(inode);
if (!inode->i_count) {
+ spin_lock(&wb_inode_list_lock);
list_move(&inode->i_list, dispose);
+ spin_unlock(&wb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
@@ -492,6 +501,8 @@ static void prune_icache(int nr_to_scan)

down_read(&iprune_sem);
spin_lock(&inode_lock);
+again:
+ spin_lock(&wb_inode_list_lock);
for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
struct inode *inode;

@@ -500,13 +511,17 @@ static void prune_icache(int nr_to_scan)

inode = list_entry(inode_unused.prev, struct inode, i_list);

- spin_lock(&inode->i_lock);
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&wb_inode_list_lock);
+ goto again;
+ }
if (inode->i_state || inode->i_count) {
list_move(&inode->i_list, &inode_unused);
spin_unlock(&inode->i_lock);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ spin_unlock(&wb_inode_list_lock);
__iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
@@ -515,11 +530,16 @@ static void prune_icache(int nr_to_scan)
0, -1);
iput(inode);
spin_lock(&inode_lock);
+again2:
+ spin_lock(&wb_inode_list_lock);

if (inode != list_entry(inode_unused.next,
struct inode, i_list))
continue; /* wrong inode or list_empty */
- spin_lock(&inode->i_lock);
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&wb_inode_list_lock);
+ goto again2;
+ }
if (!can_unuse(inode)) {
spin_unlock(&inode->i_lock);
continue;
@@ -537,6 +557,7 @@ static void prune_icache(int nr_to_scan)
else
__count_vm_events(PGINODESTEAL, reap);
spin_unlock(&inode_lock);
+ spin_unlock(&wb_inode_list_lock);

dispose_list(&freeable);
up_read(&iprune_sem);
@@ -660,7 +681,9 @@ __inode_add_to_lists(struct super_block
spin_lock(&sb_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
+ spin_lock(&wb_inode_list_lock);
list_add(&inode->i_list, &inode_in_use);
+ spin_unlock(&wb_inode_list_lock);
if (head) {
spin_lock(&inode_hash_lock);
hlist_add_head(&inode->i_hash, head);
@@ -1290,7 +1313,9 @@ void generic_delete_inode(struct inode *
{
const struct super_operations *op = inode->i_sb->s_op;

+ spin_lock(&wb_inode_list_lock);
list_del_init(&inode->i_list);
+ spin_unlock(&wb_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
@@ -1335,8 +1360,11 @@ int generic_detach_inode(struct inode *i
struct super_block *sb = inode->i_sb;

if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+ spin_lock(&wb_inode_list_lock);
list_move(&inode->i_list, &inode_unused);
+ spin_unlock(&wb_inode_list_lock);
+ }
inodes_stat.nr_unused++;
if (sb->s_flags & MS_ACTIVE) {
spin_unlock(&inode->i_lock);
@@ -1360,7 +1388,9 @@ int generic_detach_inode(struct inode *i
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
}
+ spin_lock(&wb_inode_list_lock);
list_del_init(&inode->i_list);
+ spin_unlock(&wb_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
@@ -1432,17 +1462,17 @@ void iput(struct inode *inode)
if (inode) {
BUG_ON(inode->i_state == I_CLEAR);

-retry:
+retry1:
spin_lock(&inode->i_lock);
if (inode->i_count == 1) {
if (!spin_trylock(&inode_lock)) {
+retry2:
spin_unlock(&inode->i_lock);
- goto retry;
+ goto retry1;
}
if (!spin_trylock(&sb_inode_list_lock)) {
spin_unlock(&inode_lock);
- spin_unlock(&inode->i_lock);
- goto retry;
+ goto retry2;
}
inode->i_count--;
iput_final(inode);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -11,6 +11,7 @@ struct backing_dev_info;

extern spinlock_t inode_lock;
extern spinlock_t sb_inode_list_lock;
+extern spinlock_t wb_inode_list_lock;
extern spinlock_t inode_hash_lock;
extern struct list_head inode_in_use;
extern struct list_head inode_unused;
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -78,6 +78,7 @@ static int bdi_debug_stats_show(struct s
*/
nr_wb = nr_dirty = nr_io = nr_more_io = 0;
spin_lock(&inode_lock);
+ spin_lock(&wb_inode_list_lock);
list_for_each_entry(wb, &bdi->wb_list, list) {
nr_wb++;
list_for_each_entry(inode, &wb->b_dirty, i_list)
@@ -87,6 +88,7 @@ static int bdi_debug_stats_show(struct s
list_for_each_entry(inode, &wb->b_more_io, i_list)
nr_more_io++;
}
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode_lock);

get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
@@ -712,9 +714,11 @@ void bdi_destroy(struct backing_dev_info
struct bdi_writeback *dst = &default_backing_dev_info.wb;

spin_lock(&inode_lock);
+ spin_lock(&wb_inode_list_lock);
list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
list_splice(&bdi->wb.b_io, &dst->b_io);
list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+ spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode_lock);
}
Peter Zijlstra
2010-06-24 08:58:08 UTC
Permalink
+ assert_spin_locked(&wb_inode_list_lock);
There's also lockdep_assert_held() which also validates we're the owner.
Nick Piggin
2010-06-24 15:09:08 UTC
Permalink
Post by Peter Zijlstra
+ assert_spin_locked(&wb_inode_list_lock);
There's also lockdep_assert_held() which also validates we're the owner.
These locks should have such miniscule contention now that they
effectively mean the same thing :) But no that's a good suggestion
thanks. I guess _most_ assert_spin_locked could be changed over.
Peter Zijlstra
2010-06-24 15:13:32 UTC
Permalink
Post by Nick Piggin
Post by Peter Zijlstra
+ assert_spin_locked(&wb_inode_list_lock);
There's also lockdep_assert_held() which also validates we're the owner.
These locks should have such miniscule contention now that they
effectively mean the same thing :) But no that's a good suggestion
thanks. I guess _most_ assert_spin_locked could be changed over.
Probably, I just haven't felt like actually visiting all sites to
check ;-)
n***@suse.de
2010-06-24 03:02:26 UTC
Permalink
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

XXX: probably don't need parent lock in inotify (because child lock
should stabilize parent). Also, possibly some filesystems don't need so
much locking (eg. of child dentry when modifying d_child, so long as
parent is locked)... but be on the safe side. Hmm, maybe we should just
say d_child list is protected by d_parent->d_lock. d_parent could remain
protected with d_lock.

XXX: leave dcache_lock in there until remove dcache_lock patch

Signed-off-by: Nick Piggin <***@suse.de>
---
drivers/usb/core/inode.c | 8 +-
fs/autofs4/expire.c | 87 +++++++++++++++-------
fs/autofs4/root.c | 6 +
fs/ceph/dir.c | 6 +
fs/ceph/inode.c | 8 +-
fs/coda/cache.c | 2
fs/dcache.c | 164 ++++++++++++++++++++++++++++++++++---------
fs/libfs.c | 40 ++++++----
fs/ncpfs/dir.c | 3
fs/ncpfs/ncplib_kernel.h | 4 +
fs/notify/fsnotify.c | 4 -
fs/notify/inotify/inotify.c | 4 -
fs/smbfs/cache.c | 4 +
include/linux/dcache.h | 1
kernel/cgroup.c | 19 ++++
security/selinux/selinuxfs.c | 12 ++-
16 files changed, 283 insertions(+), 89 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -47,6 +47,8 @@
* - d_lru
* - d_count
* - d_unhashed()
+ * - d_parent and d_subdirs
+ * - childrens' d_child and d_parent
*
* Ordering:
* dcache_lock
@@ -219,7 +221,8 @@ static void dentry_lru_del_init(struct d
*
* If this is the root of the dentry tree, return NULL.
*
- * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
+ * dcache_lock and d_lock and d_parent->d_lock must be held by caller, and
+ * are dropped by d_kill.
*/
static struct dentry *d_kill(struct dentry *dentry)
__releases(dentry->d_lock)
@@ -228,12 +231,14 @@ static struct dentry *d_kill(struct dent
struct dentry *parent;

list_del(&dentry->d_u.d_child);
- /*drops the locks, at that point nobody can reach this dentry */
- dentry_iput(dentry);
+ if (dentry->d_parent && dentry != dentry->d_parent)
+ spin_unlock(&dentry->d_parent->d_lock);
if (IS_ROOT(dentry))
parent = NULL;
else
parent = dentry->d_parent;
+ /*drops the locks, at that point nobody can reach this dentry */
+ dentry_iput(dentry);
d_free(dentry);
return parent;
}
@@ -269,6 +274,7 @@ static struct dentry *d_kill(struct dent

void dput(struct dentry *dentry)
{
+ struct dentry *parent = NULL;
if (!dentry)
return;

@@ -287,10 +293,20 @@ repeat:
spin_unlock(&dentry->d_lock);
goto repeat;
}
+ parent = dentry->d_parent;
+ if (parent && parent != dentry) {
+ if (!spin_trylock(&parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ goto repeat;
+ }
+ }
}
dentry->d_count--;
if (dentry->d_count) {
spin_unlock(&dentry->d_lock);
+ if (parent && parent != dentry)
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
return;
}
@@ -310,6 +326,8 @@ repeat:
dentry_lru_add(dentry);
}
spin_unlock(&dentry->d_lock);
+ if (parent && parent != dentry)
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
return;

@@ -545,10 +563,22 @@ static void prune_one_dentry(struct dent
* because dcache_lock needs to be taken anyway.
*/
while (dentry) {
+ struct dentry *parent = NULL;
+
spin_lock(&dcache_lock);
+again:
spin_lock(&dentry->d_lock);
+ if (dentry->d_parent && dentry != dentry->d_parent) {
+ if (!spin_trylock(&dentry->d_parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto again;
+ }
+ parent = dentry->d_parent;
+ }
dentry->d_count--;
if (dentry->d_count) {
+ if (parent)
+ spin_unlock(&parent->d_lock);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return;
@@ -626,20 +656,27 @@ again:
dentry = list_entry(tmp.prev, struct dentry, d_lru);

if (!spin_trylock(&dentry->d_lock)) {
+again1:
spin_unlock(&dcache_lru_lock);
goto again;
}
- __dentry_lru_del_init(dentry);
/*
* We found an inuse dentry which was not removed from
* the LRU because of laziness during lookup. Do not free
* it - just keep it off the LRU list.
*/
if (dentry->d_count) {
+ __dentry_lru_del_init(dentry);
spin_unlock(&dentry->d_lock);
continue;
}
-
+ if (dentry->d_parent && dentry->d_parent != dentry) {
+ if (!spin_trylock(&dentry->d_parent->d_lock)) {
+ spin_unlock(&dentry->d_lock);
+ goto again1;
+ }
+ }
+ __dentry_lru_del_init(dentry);
spin_unlock(&dcache_lru_lock);
prune_one_dentry(dentry);
/* dcache_lock and dentry->d_lock dropped */
@@ -776,14 +813,15 @@ static void shrink_dcache_for_umount_sub
/* this is a branch with children - detach all of them
* from the system in one go */
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
list_for_each_entry(loop, &dentry->d_subdirs,
d_u.d_child) {
- spin_lock(&loop->d_lock);
+ spin_lock_nested(&loop->d_lock, DENTRY_D_LOCK_NESTED);
dentry_lru_del_init(loop);
__d_drop(loop);
spin_unlock(&loop->d_lock);
- cond_resched_lock(&dcache_lock);
}
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

/* move to the first child */
@@ -811,16 +849,17 @@ static void shrink_dcache_for_umount_sub
BUG();
}

- if (IS_ROOT(dentry))
+ if (IS_ROOT(dentry)) {
parent = NULL;
- else {
+ list_del(&dentry->d_u.d_child);
+ } else {
parent = dentry->d_parent;
spin_lock(&parent->d_lock);
parent->d_count--;
+ list_del(&dentry->d_u.d_child);
spin_unlock(&parent->d_lock);
}

- list_del(&dentry->d_u.d_child);
detached++;

inode = dentry->d_inode;
@@ -905,6 +944,7 @@ int have_submounts(struct dentry *parent
spin_lock(&dcache_lock);
if (d_mountpoint(parent))
goto positive;
+ spin_lock(&this_parent->d_lock);
repeat:
next = this_parent->d_subdirs.next;
resume:
@@ -912,22 +952,34 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
+
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
/* Have we found a mount point ? */
- if (d_mountpoint(dentry))
+ if (d_mountpoint(dentry)) {
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&this_parent->d_lock);
goto positive;
+ }
if (!list_empty(&dentry->d_subdirs)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
this_parent = dentry;
+ spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
goto repeat;
}
+ spin_unlock(&dentry->d_lock);
}
/*
* All done at this level ... ascend and resume the search.
*/
if (this_parent != parent) {
next = this_parent->d_u.d_child.next;
+ spin_unlock(&this_parent->d_lock);
this_parent = this_parent->d_parent;
+ spin_lock(&this_parent->d_lock);
goto resume;
}
+ spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
return 0; /* No mount points found in tree */
positive:
@@ -957,6 +1009,7 @@ static int select_parent(struct dentry *
int found = 0;

spin_lock(&dcache_lock);
+ spin_lock(&this_parent->d_lock);
repeat:
next = this_parent->d_subdirs.next;
resume:
@@ -964,8 +1017,9 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
+ BUG_ON(this_parent == dentry);

- spin_lock(&dentry->d_lock);
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
dentry_lru_del_init(dentry);
/*
* move only zero ref count dentries to the end
@@ -975,33 +1029,45 @@ resume:
dentry_lru_add_tail(dentry);
found++;
}
- spin_unlock(&dentry->d_lock);

/*
* We can return to the caller if we have found some (this
* ensures forward progress). We'll be coming back to find
* the rest.
*/
- if (found && need_resched())
+ if (found && need_resched()) {
+ spin_unlock(&dentry->d_lock);
goto out;
+ }

/*
* Descend a level if the d_subdirs list is non-empty.
*/
if (!list_empty(&dentry->d_subdirs)) {
+ spin_unlock(&this_parent->d_lock);
+ spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
this_parent = dentry;
+ spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
goto repeat;
}
+
+ spin_unlock(&dentry->d_lock);
}
/*
* All done at this level ... ascend and resume the search.
*/
if (this_parent != parent) {
+ struct dentry *tmp;
next = this_parent->d_u.d_child.next;
- this_parent = this_parent->d_parent;
+ tmp = this_parent->d_parent;
+ spin_unlock(&this_parent->d_lock);
+ BUG_ON(tmp == this_parent);
+ this_parent = tmp;
+ spin_lock(&this_parent->d_lock);
goto resume;
}
out:
+ spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
return found;
}
@@ -1098,19 +1164,20 @@ struct dentry *d_alloc(struct dentry * p
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
INIT_LIST_HEAD(&dentry->d_alias);
-
- if (parent) {
- dentry->d_parent = dget(parent);
- dentry->d_sb = parent->d_sb;
- } else {
- INIT_LIST_HEAD(&dentry->d_u.d_child);
- }
+ INIT_LIST_HEAD(&dentry->d_u.d_child);

if (parent) {
spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+ dentry->d_parent = dget_dlock(parent);
+ dentry->d_sb = parent->d_sb;
list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
}
+
atomic_inc(&dentry_stat.nr_dentry);

return dentry;
@@ -1802,15 +1869,26 @@ static void d_move_locked(struct dentry
/*
* XXXX: do we really need to take target->d_lock?
*/
- if (d_ancestor(dentry, target)) {
- spin_lock(&dentry->d_lock);
- spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
- } else if (d_ancestor(target, dentry) || target < dentry) {
- spin_lock(&target->d_lock);
- spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
- } else {
- spin_lock(&dentry->d_lock);
- spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
+ BUG_ON(d_ancestor(dentry, target));
+ BUG_ON(d_ancestor(target, dentry));
+
+ if (dentry->d_parent == target->d_parent)
+ spin_lock(&dentry->d_parent->d_lock);
+ else {
+ if (d_ancestor(dentry->d_parent, target->d_parent)) {
+ spin_lock(&dentry->d_parent->d_lock);
+ spin_lock_nested(&target->d_parent->d_lock, DENTRY_D_LOCK_NESTED);
+ } else {
+ spin_lock(&target->d_parent->d_lock);
+ spin_lock_nested(&dentry->d_parent->d_lock, DENTRY_D_LOCK_NESTED);
+ }
+ }
+ if (target < dentry) {
+ spin_lock_nested(&target->d_lock, 2);
+ spin_lock_nested(&dentry->d_lock, 3);
+ } else {
+ spin_lock_nested(&dentry->d_lock, 2);
+ spin_lock_nested(&target->d_lock, 3);
}

/* Move the dentry to the target hash queue, if on different bucket */
@@ -1843,6 +1921,10 @@ static void d_move_locked(struct dentry
}

list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
+ if (target->d_parent != dentry->d_parent)
+ spin_unlock(&dentry->d_parent->d_lock);
+ if (target->d_parent != target)
+ spin_unlock(&target->d_parent->d_lock);
spin_unlock(&target->d_lock);
fsnotify_d_move(dentry);
spin_unlock(&dentry->d_lock);
@@ -1943,6 +2025,13 @@ static void __d_materialise_dentry(struc
dparent = dentry->d_parent;
aparent = anon->d_parent;

+ /* XXX: hack */
+ /* returns with anon->d_lock held! */
+ spin_lock(&aparent->d_lock);
+ spin_lock(&dparent->d_lock);
+ spin_lock(&dentry->d_lock);
+ spin_lock(&anon->d_lock);
+
dentry->d_parent = (aparent == anon) ? dentry : aparent;
list_del(&dentry->d_u.d_child);
if (!IS_ROOT(dentry))
@@ -1957,6 +2046,10 @@ static void __d_materialise_dentry(struc
else
INIT_LIST_HEAD(&anon->d_u.d_child);

+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dparent->d_lock);
+ spin_unlock(&aparent->d_lock);
+
anon->d_flags &= ~DCACHE_DISCONNECTED;
}

@@ -1992,7 +2085,6 @@ struct dentry *d_materialise_unique(stru
/* Is this an anonymous mountpoint that we could splice
* into our tree? */
if (IS_ROOT(alias)) {
- spin_lock(&alias->d_lock);
__d_materialise_dentry(dentry, alias);
__d_drop(alias);
goto found;
@@ -2396,6 +2488,7 @@ void d_genocide(struct dentry *root)
struct list_head *next;

spin_lock(&dcache_lock);
+ spin_lock(&this_parent->d_lock);
repeat:
next = this_parent->d_subdirs.next;
resume:
@@ -2409,8 +2502,10 @@ resume:
continue;
}
if (!list_empty(&dentry->d_subdirs)) {
- spin_unlock(&dentry->d_lock);
+ spin_unlock(&this_parent->d_lock);
+ spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
this_parent = dentry;
+ spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
goto repeat;
}
dentry->d_count--;
@@ -2418,12 +2513,13 @@ resume:
}
if (this_parent != root) {
next = this_parent->d_u.d_child.next;
- spin_lock(&this_parent->d_lock);
this_parent->d_count--;
spin_unlock(&this_parent->d_lock);
this_parent = this_parent->d_parent;
+ spin_lock(&this_parent->d_lock);
goto resume;
}
+ spin_unlock(&this_parent->d_lock);
spin_unlock(&dcache_lock);
}

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -81,7 +81,8 @@ int dcache_dir_close(struct inode *inode

loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
{
- mutex_lock(&file->f_path.dentry->d_inode->i_mutex);
+ struct dentry *dentry = file->f_path.dentry;
+ mutex_lock(&dentry->d_inode->i_mutex);
switch (origin) {
case 1:
offset += file->f_pos;
@@ -89,7 +90,7 @@ loff_t dcache_dir_lseek(struct file *fil
if (offset >= 0)
break;
default:
- mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
+ mutex_unlock(&dentry->d_inode->i_mutex);
return -EINVAL;
}
if (offset != file->f_pos) {
@@ -99,23 +100,27 @@ loff_t dcache_dir_lseek(struct file *fil
struct dentry *cursor = file->private_data;
loff_t n = file->f_pos - 2;

- spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
+ spin_lock_nested(&cursor->d_lock, DENTRY_D_LOCK_NESTED);
list_del(&cursor->d_u.d_child);
- p = file->f_path.dentry->d_subdirs.next;
- while (n && p != &file->f_path.dentry->d_subdirs) {
+ spin_unlock(&cursor->d_lock);
+ p = dentry->d_subdirs.next;
+ while (n && p != &dentry->d_subdirs) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- spin_lock(&next->d_lock);
+ spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
if (simple_positive(next))
n--;
spin_unlock(&next->d_lock);
p = p->next;
}
+ spin_lock_nested(&cursor->d_lock, DENTRY_D_LOCK_NESTED);
list_add_tail(&cursor->d_u.d_child, p);
- spin_unlock(&dcache_lock);
+ spin_unlock(&cursor->d_lock);
+ spin_unlock(&dentry->d_lock);
}
}
- mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
+ mutex_unlock(&dentry->d_inode->i_mutex);
return offset;
}

@@ -155,9 +160,12 @@ int dcache_readdir(struct file * filp, v
i++;
/* fallthrough */
default:
- spin_lock(&dcache_lock);
- if (filp->f_pos == 2)
+ spin_lock(&dentry->d_lock);
+ if (filp->f_pos == 2) {
+ spin_lock_nested(&cursor->d_lock, DENTRY_D_LOCK_NESTED);
list_move(q, &dentry->d_subdirs);
+ spin_unlock(&cursor->d_lock);
+ }

for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
@@ -169,19 +177,21 @@ int dcache_readdir(struct file * filp, v
}

spin_unlock(&next->d_lock);
- spin_unlock(&dcache_lock);
+ spin_unlock(&dentry->d_lock);
if (filldir(dirent, next->d_name.name,
next->d_name.len, filp->f_pos,
next->d_inode->i_ino,
dt_type(next->d_inode)) < 0)
return 0;
- spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
+ spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
/* next is still alive */
list_move(q, p);
+ spin_unlock(&next->d_lock);
p = q;
filp->f_pos++;
}
- spin_unlock(&dcache_lock);
+ spin_unlock(&dentry->d_lock);
}
return 0;
}
@@ -278,7 +288,7 @@ int simple_empty(struct dentry *dentry)
struct dentry *child;
int ret = 0;

- spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (simple_positive(child)) {
@@ -289,7 +299,7 @@ int simple_empty(struct dentry *dentry)
}
ret = 1;
out:
- spin_unlock(&dcache_lock);
+ spin_unlock(&dentry->d_lock);
return ret;
}

Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -185,17 +185,19 @@ static void set_dentry_child_flags(struc
list_for_each_entry(alias, &inode->i_dentry, d_alias) {
struct dentry *child;

+ spin_lock(&alias->d_lock);
list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
if (!child->d_inode)
continue;

- spin_lock(&child->d_lock);
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (watched)
child->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
else
child->d_flags &=~DCACHE_INOTIFY_PARENT_WATCHED;
spin_unlock(&child->d_lock);
}
+ spin_unlock(&alias->d_lock);
}
spin_unlock(&dcache_lock);
}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -342,6 +342,7 @@ static inline struct dentry *dget_dlock(
}
return dentry;
}
+
static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry) {
Index: linux-2.6/drivers/usb/core/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -348,18 +348,20 @@ static int usbfs_empty (struct dentry *d
struct list_head *list;

spin_lock(&dcache_lock);
-
+ spin_lock(&dentry->d_lock);
list_for_each(list, &dentry->d_subdirs) {
struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
- spin_lock(&de->d_lock);
+
+ spin_lock_nested(&de->d_lock, DENTRY_D_LOCK_NESTED);
if (usbfs_positive(de)) {
spin_unlock(&de->d_lock);
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return 0;
}
spin_unlock(&de->d_lock);
}
-
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return 1;
}
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -93,22 +93,59 @@ done:
/*
* Calculate next entry in top down tree traversal.
* From next_mnt in namespace.c - elegant.
+ *
+ * How is this supposed to work if we drop dcache_lock between calls anyway?
+ * How does it cope with renames?
+ * And also callers dput the returned dentry before taking dcache_lock again
+ * so what prevents it from being freed??
*/
-static struct dentry *next_dentry(struct dentry *p, struct dentry *root)
+static struct dentry *get_next_positive_dentry(struct dentry *p,
+ struct dentry *root)
{
- struct list_head *next = p->d_subdirs.next;
+ struct list_head *next;
+ struct dentry *ret;

+ spin_lock(&dcache_lock);
+again:
+ spin_lock(&p->d_lock);
+ next = p->d_subdirs.next;
if (next == &p->d_subdirs) {
while (1) {
- if (p == root)
+ struct dentry *parent;
+
+ if (p == root) {
+ spin_unlock(&p->d_lock);
+ spin_unlock(&dcache_lock);
return NULL;
+ }
+
+ parent = p->d_parent;
+ if (!spin_trylock(&parent->d_lock)) {
+ spin_unlock(&p->d_lock);
+ goto again;
+ }
+ spin_unlock(&p->d_lock);
next = p->d_u.d_child.next;
- if (next != &p->d_parent->d_subdirs)
+ p = parent;
+ if (next != &parent->d_subdirs)
break;
- p = p->d_parent;
}
}
- return list_entry(next, struct dentry, d_u.d_child);
+ ret = list_entry(next, struct dentry, d_u.d_child);
+
+ spin_lock_nested(&ret->d_lock, DENTRY_D_LOCK_NESTED);
+ /* Negative dentry - try next */
+ if (!simple_positive(ret)) {
+ spin_unlock(&ret->d_lock);
+ p = ret;
+ goto again;
+ }
+ dget_dlock(ret);
+ spin_unlock(&ret->d_lock);
+ spin_unlock(&p->d_lock);
+ spin_unlock(&dcache_lock);
+
+ return ret;
}

/*
@@ -158,18 +195,11 @@ static int autofs4_tree_busy(struct vfsm
if (!simple_positive(top))
return 1;

- spin_lock(&dcache_lock);
- for (p = top; p; p = next_dentry(p, top)) {
- /* Negative dentry - give up */
- if (!simple_positive(p))
- continue;
+ for (p = dget(top); p; p = get_next_positive_dentry(p, top)) {

DPRINTK("dentry %p %.*s",
p, (int) p->d_name.len, p->d_name.name);

- p = dget(p);
- spin_unlock(&dcache_lock);
-
/*
* Is someone visiting anywhere in the subtree ?
* If there's no mount we need to check the usage
@@ -205,9 +235,7 @@ static int autofs4_tree_busy(struct vfsm
}
}
dput(p);
- spin_lock(&dcache_lock);
}
- spin_unlock(&dcache_lock);

/* Timeout of a tree mount is ultimately determined by its top dentry */
if (!autofs4_can_expire(top, timeout, do_now))
@@ -226,18 +254,11 @@ static struct dentry *autofs4_check_leav
DPRINTK("parent %p %.*s",
parent, (int)parent->d_name.len, parent->d_name.name);

- spin_lock(&dcache_lock);
- for (p = parent; p; p = next_dentry(p, parent)) {
- /* Negative dentry - give up */
- if (!simple_positive(p))
- continue;
+ for (p = dget(parent); p; p = get_next_positive_dentry(p, parent)) {

DPRINTK("dentry %p %.*s",
p, (int) p->d_name.len, p->d_name.name);

- p = dget(p);
- spin_unlock(&dcache_lock);
-
if (d_mountpoint(p)) {
/* Can we umount this guy */
if (autofs4_mount_busy(mnt, p))
@@ -249,9 +270,7 @@ static struct dentry *autofs4_check_leav
}
cont:
dput(p);
- spin_lock(&dcache_lock);
}
- spin_unlock(&dcache_lock);
return NULL;
}

@@ -294,6 +313,8 @@ struct dentry *autofs4_expire_direct(str
* A tree is eligible if :-
* - it is unused by any user process
* - it has been unused for exp_timeout time
+ * This seems to be racy dropping dcache_lock and asking for next->next after
+ * the lock has been dropped.
*/
struct dentry *autofs4_expire_indirect(struct super_block *sb,
struct vfsmount *mnt,
@@ -316,6 +337,7 @@ struct dentry *autofs4_expire_indirect(s
timeout = sbi->exp_timeout;

spin_lock(&dcache_lock);
+ spin_lock(&root->d_lock);
next = root->d_subdirs.next;

/* On exit from the loop expire is set to a dgot dentry
@@ -329,7 +351,10 @@ struct dentry *autofs4_expire_indirect(s
continue;
}

- dentry = dget(dentry);
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+ dentry = dget_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&root->d_lock);
spin_unlock(&dcache_lock);

spin_lock(&sbi->fs_lock);
@@ -396,8 +421,10 @@ next:
spin_unlock(&sbi->fs_lock);
dput(dentry);
spin_lock(&dcache_lock);
+ spin_lock(&root->d_lock);
next = next->next;
}
+ spin_unlock(&root->d_lock);
spin_unlock(&dcache_lock);
return NULL;

@@ -409,7 +436,11 @@ found:
init_completion(&ino->expire_complete);
spin_unlock(&sbi->fs_lock);
spin_lock(&dcache_lock);
- list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
+ spin_lock(&expired->d_parent->d_lock);
+ spin_lock_nested(&expired->d_lock, DENTRY_D_LOCK_NESTED);
+ list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
+ spin_unlock(&expired->d_lock);
+ spin_unlock(&expired->d_parent->d_lock);
spin_unlock(&dcache_lock);
return expired;
}
Index: linux-2.6/fs/autofs4/root.c
===================================================================
--- linux-2.6.orig/fs/autofs4/root.c
+++ linux-2.6/fs/autofs4/root.c
@@ -135,10 +135,13 @@ static int autofs4_dir_open(struct inode
* it.
*/
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
if (!d_mountpoint(dentry) && list_empty(&dentry->d_subdirs)) {
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
return -ENOENT;
}
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

out:
@@ -246,7 +249,9 @@ static void *autofs4_follow_link(struct
lookup_type = autofs4_need_mount(nd->flags);
spin_lock(&sbi->fs_lock);
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
if (!(lookup_type || ino->flags & AUTOFS_INF_PENDING)) {
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
spin_unlock(&sbi->fs_lock);
goto follow;
@@ -268,6 +273,7 @@ static void *autofs4_follow_link(struct

goto follow;
}
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
spin_unlock(&sbi->fs_lock);
follow:
Index: linux-2.6/fs/coda/cache.c
===================================================================
--- linux-2.6.orig/fs/coda/cache.c
+++ linux-2.6/fs/coda/cache.c
@@ -87,6 +87,7 @@ static void coda_flag_children(struct de
struct dentry *de;

spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
list_for_each(child, &parent->d_subdirs)
{
de = list_entry(child, struct dentry, d_u.d_child);
@@ -95,6 +96,7 @@ static void coda_flag_children(struct de
continue;
coda_flag_inode(de->d_inode, flag);
}
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
return;
}
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c
+++ linux-2.6/fs/ncpfs/dir.c
@@ -365,6 +365,7 @@ ncp_dget_fpos(struct dentry *dentry, str

/* If a pointer is invalid, we search the dentry. */
spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
dent = list_entry(next, struct dentry, d_u.d_child);
@@ -373,11 +374,13 @@ ncp_dget_fpos(struct dentry *dentry, str
dget_locked(dent);
else
dent = NULL;
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
goto out;
}
next = next->next;
}
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
return NULL;

Index: linux-2.6/fs/ncpfs/ncplib_kernel.h
===================================================================
--- linux-2.6.orig/fs/ncpfs/ncplib_kernel.h
+++ linux-2.6/fs/ncpfs/ncplib_kernel.h
@@ -193,6 +193,7 @@ ncp_renew_dentries(struct dentry *parent
struct dentry *dentry;

spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -204,6 +205,7 @@ ncp_renew_dentries(struct dentry *parent

next = next->next;
}
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
}

@@ -215,6 +217,7 @@ ncp_invalidate_dircache_entries(struct d
struct dentry *dentry;

spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -222,6 +225,7 @@ ncp_invalidate_dircache_entries(struct d
ncp_age_dentry(server, dentry);
next = next->next;
}
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
}

Index: linux-2.6/fs/smbfs/cache.c
===================================================================
--- linux-2.6.orig/fs/smbfs/cache.c
+++ linux-2.6/fs/smbfs/cache.c
@@ -63,6 +63,7 @@ smb_invalidate_dircache_entries(struct d
struct dentry *dentry;

spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -70,6 +71,7 @@ smb_invalidate_dircache_entries(struct d
smb_age_dentry(server, dentry);
next = next->next;
}
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
}

@@ -97,6 +99,7 @@ smb_dget_fpos(struct dentry *dentry, str

/* If a pointer is invalid, we search the dentry. */
spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
dent = list_entry(next, struct dentry, d_u.d_child);
@@ -111,6 +114,7 @@ smb_dget_fpos(struct dentry *dentry, str
}
dent = NULL;
out_unlock:
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
return dent;
}
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -870,23 +870,31 @@ static void cgroup_clear_directory(struc

BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
spin_lock(&dcache_lock);
+ spin_lock(&dentry->d_lock);
node = dentry->d_subdirs.next;
while (node != &dentry->d_subdirs) {
struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
+
+ spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
list_del_init(node);
if (d->d_inode) {
/* This should never be called on a cgroup
* directory with child cgroups */
BUG_ON(d->d_inode->i_mode & S_IFDIR);
- d = dget_locked(d);
+ dget_locked_dlock(d);
+ spin_unlock(&d->d_lock);
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
d_delete(d);
simple_unlink(dentry->d_inode, d);
dput(d);
spin_lock(&dcache_lock);
- }
+ spin_lock(&dentry->d_lock);
+ } else
+ spin_unlock(&d->d_lock);
node = dentry->d_subdirs.next;
}
+ spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
}

@@ -895,10 +903,17 @@ static void cgroup_clear_directory(struc
*/
static void cgroup_d_remove_dir(struct dentry *dentry)
{
+ struct dentry *parent;
+
cgroup_clear_directory(dentry);

spin_lock(&dcache_lock);
+ parent = dentry->d_parent;
+ spin_lock(&parent->d_lock);
+ spin_lock(&dentry->d_lock);
list_del_init(&dentry->d_u.d_child);
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
remove_dir(dentry);
}
Index: linux-2.6/security/selinux/selinuxfs.c
===================================================================
--- linux-2.6.orig/security/selinux/selinuxfs.c
+++ linux-2.6/security/selinux/selinuxfs.c
@@ -942,22 +942,30 @@ static void sel_remove_entries(struct de
struct list_head *node;

spin_lock(&dcache_lock);
+ spin_lock(&de->d_lock);
node = de->d_subdirs.next;
while (node != &de->d_subdirs) {
struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
+
+ spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
list_del_init(node);

if (d->d_inode) {
- d = dget_locked(d);
+ dget_locked_dlock(d);
+ spin_unlock(&de->d_lock);
+ spin_unlock(&d->d_lock);
spin_unlock(&dcache_lock);
d_delete(d);
simple_unlink(de->d_inode, d);
dput(d);
spin_lock(&dcache_lock);
- }
+ spin_lock(&de->d_lock);
+ } else
+ spin_unlock(&d->d_lock);
node = de->d_subdirs.next;
}

+ spin_unlock(&de->d_lock);
spin_unlock(&dcache_lock);
}

Index: linux-2.6/fs/notify/fsnotify.c
===================================================================
--- linux-2.6.orig/fs/notify/fsnotify.c
+++ linux-2.6/fs/notify/fsnotify.c
@@ -62,17 +62,19 @@ void __fsnotify_update_child_dentry_flag
/* run all of the children of the original inode and fix their
* d_flags to indicate parental interest (their parent is the
* original inode) */
+ spin_lock(&alias->d_lock);
list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
if (!child->d_inode)
continue;

- spin_lock(&child->d_lock);
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (watched)
child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
else
child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
spin_unlock(&child->d_lock);
}
+ spin_unlock(&alias->d_lock);
}
spin_unlock(&dcache_lock);
}
Index: linux-2.6/fs/ceph/dir.c
===================================================================
--- linux-2.6.orig/fs/ceph/dir.c
+++ linux-2.6/fs/ceph/dir.c
@@ -112,6 +112,7 @@ static int __dcache_readdir(struct file
last);

spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);

/* start at beginning? */
if (filp->f_pos == 2 || (last &&
@@ -135,7 +136,7 @@ more:
fi->at_end = 1;
goto out_unlock;
}
- spin_lock(&dentry->d_lock);
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
if (!d_unhashed(dentry) && dentry->d_inode &&
ceph_snap(dentry->d_inode) != CEPH_SNAPDIR &&
ceph_ino(dentry->d_inode) != CEPH_INO_CEPH &&
@@ -153,6 +154,7 @@ more:

dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
spin_unlock(&inode->i_lock);

@@ -177,6 +179,7 @@ more:

spin_lock(&inode->i_lock);
spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);

last = dentry;

@@ -193,6 +196,7 @@ more:
err = -EAGAIN;

out_unlock:
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);

if (last) {
Index: linux-2.6/fs/ceph/inode.c
===================================================================
--- linux-2.6.orig/fs/ceph/inode.c
+++ linux-2.6/fs/ceph/inode.c
@@ -826,11 +826,13 @@ static void ceph_set_dentry_offset(struc
spin_unlock(&inode->i_lock);

spin_lock(&dcache_lock);
- spin_lock(&dn->d_lock);
+ spin_lock(&dir->d_lock);
+ spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
list_move(&dn->d_u.d_child, &dir->d_subdirs);
dout("set_dentry_offset %p %lld (%p %p)\n", dn, di->offset,
dn->d_u.d_child.prev, dn->d_u.d_child.next);
spin_unlock(&dn->d_lock);
+ spin_unlock(&dir->d_lock);
spin_unlock(&dcache_lock);
}

@@ -1212,9 +1214,11 @@ retry_lookup:
} else {
/* reorder parent's d_subdirs */
spin_lock(&dcache_lock);
- spin_lock(&dn->d_lock);
+ spin_lock(&parent->d_lock);
+ spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
list_move(&dn->d_u.d_child, &parent->d_subdirs);
spin_unlock(&dn->d_lock);
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
}
Peter Zijlstra
2010-06-24 07:56:01 UTC
Permalink
plain text document attachment (fs-dcache-scale-d_subdirs.patch)
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
XXX: probably don't need parent lock in inotify (because child lock
should stabilize parent). Also, possibly some filesystems don't need so
much locking (eg. of child dentry when modifying d_child, so long as
parent is locked)... but be on the safe side. Hmm, maybe we should just
say d_child list is protected by d_parent->d_lock. d_parent could remain
protected with d_lock.
XXX: leave dcache_lock in there until remove dcache_lock patch
This still suffers the problem John found, right?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen
2010-06-24 09:50:17 UTC
Permalink
Post by n***@suse.de
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Different locking for different file systems seems a bit confusing.
Could the be unified?

-Andi
--
***@linux.intel.com -- Speaking for myself only.
Nick Piggin
2010-06-24 15:53:49 UTC
Permalink
Post by Andi Kleen
Post by n***@suse.de
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Different locking for different file systems seems a bit confusing.
Could the be unified?
Yes well it is a bit misleading. It should always be modified under
spinlocks, but some filesystems are using i_mutex for read access,
which should be fine (and not require knowledge of any other code).
n***@suse.de
2010-06-24 03:02:27 UTC
Permalink
Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/affs/amigaffs.c | 2 +
fs/dcache.c | 56 +++++++++++++++++++++++++++++++++++++++-----
fs/exportfs/expfs.c | 4 +++
fs/nfs/getroot.c | 4 +++
fs/notify/fsnotify.c | 2 +
fs/notify/inotify/inotify.c | 2 +
fs/ocfs2/dcache.c | 3 +-
fs/sysfs/dir.c | 3 ++
include/linux/dcache.h | 1
9 files changed, 70 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -37,6 +37,8 @@

/*
* Usage:
+ * dcache_inode_lock protects:
+ * - i_dentry, d_alias, d_inode
* dcache_hash_lock protects:
* - the dcache hash table
* dcache_lru_lock protects:
@@ -49,12 +51,14 @@
* - d_unhashed()
* - d_parent and d_subdirs
* - childrens' d_child and d_parent
+ * - d_alias, d_inode
*
* Ordering:
* dcache_lock
- * dentry->d_lock
- * dcache_lru_lock
- * dcache_hash_lock
+ * dcache_inode_lock
+ * dentry->d_lock
+ * dcache_lru_lock
+ * dcache_hash_lock
*
* If there is an ancestor relationship:
* dentry->d_parent->...->d_parent->d_lock
@@ -70,11 +74,13 @@
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

+EXPORT_SYMBOL(dcache_inode_lock);
EXPORT_SYMBOL(dcache_hash_lock);
EXPORT_SYMBOL(dcache_lock);

@@ -139,6 +145,7 @@ static void d_free(struct dentry *dentry
*/
static void dentry_iput(struct dentry * dentry)
__releases(dentry->d_lock)
+ __releases(dcache_inode_lock)
__releases(dcache_lock)
{
struct inode *inode = dentry->d_inode;
@@ -146,6 +153,7 @@ static void dentry_iput(struct dentry *
dentry->d_inode = NULL;
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
if (!inode->i_nlink)
fsnotify_inoderemove(inode);
@@ -155,6 +163,7 @@ static void dentry_iput(struct dentry *
iput(inode);
} else {
spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}
}
@@ -226,6 +235,7 @@ static void dentry_lru_del_init(struct d
*/
static struct dentry *d_kill(struct dentry *dentry)
__releases(dentry->d_lock)
+ __releases(dcache_inode_lock)
__releases(dcache_lock)
{
struct dentry *parent;
@@ -290,15 +300,20 @@ repeat:
* want to reduce dcache_lock anyway so this will
* get improved.
*/
+drop1:
spin_unlock(&dentry->d_lock);
goto repeat;
}
+ if (!spin_trylock(&dcache_inode_lock)) {
+drop2:
+ spin_unlock(&dcache_lock);
+ goto drop1;
+ }
parent = dentry->d_parent;
if (parent && parent != dentry) {
if (!spin_trylock(&parent->d_lock)) {
- spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
- goto repeat;
+ spin_unlock(&dcache_inode_lock);
+ goto drop2;
}
}
}
@@ -328,6 +343,7 @@ repeat:
spin_unlock(&dentry->d_lock);
if (parent && parent != dentry)
spin_unlock(&parent->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
return;

@@ -510,7 +526,9 @@ struct dentry * d_find_alias(struct inod

if (!list_empty(&inode->i_dentry)) {
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
de = __d_find_alias(inode, 0);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}
return de;
@@ -526,18 +544,21 @@ void d_prune_aliases(struct inode *inode
struct dentry *dentry;
restart:
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
if (!dentry->d_count) {
__dget_locked_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
dput(dentry);
goto restart;
}
spin_unlock(&dentry->d_lock);
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}
EXPORT_SYMBOL(d_prune_aliases);
@@ -566,6 +587,7 @@ static void prune_one_dentry(struct dent
struct dentry *parent = NULL;

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
again:
spin_lock(&dentry->d_lock);
if (dentry->d_parent && dentry != dentry->d_parent) {
@@ -580,6 +602,7 @@ again:
if (parent)
spin_unlock(&parent->d_lock);
spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
return;
}
@@ -650,6 +673,7 @@ restart:
spin_unlock(&dcache_lru_lock);

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
again:
spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
while (!list_empty(&tmp)) {
@@ -681,8 +705,10 @@ again1:
prune_one_dentry(dentry);
/* dcache_lock and dentry->d_lock dropped */
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
spin_lock(&dcache_lru_lock);
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);

if (count == NULL && !list_empty(&sb->s_dentry_lru))
@@ -1198,9 +1224,11 @@ EXPORT_SYMBOL(d_alloc_name);
/* the caller must hold dcache_lock */
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
+ spin_lock(&dentry->d_lock);
if (inode)
list_add(&dentry->d_alias, &inode->i_dentry);
dentry->d_inode = inode;
+ spin_unlock(&dentry->d_lock);
fsnotify_d_instantiate(dentry, inode);
}

@@ -1223,7 +1251,9 @@ void d_instantiate(struct dentry *entry,
{
BUG_ON(!list_empty(&entry->d_alias));
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
__d_instantiate(entry, inode);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
security_d_instantiate(entry, inode);
}
@@ -1284,7 +1314,9 @@ struct dentry *d_instantiate_unique(stru
BUG_ON(!list_empty(&entry->d_alias));

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
result = __d_instantiate_unique(entry, inode);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);

if (!result) {
@@ -1375,8 +1407,10 @@ struct dentry *d_obtain_alias(struct ino
tmp->d_parent = tmp; /* make sure dput doesn't croak */

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
res = __d_find_alias(inode, 0);
if (res) {
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
dput(tmp);
goto out_iput;
@@ -1391,6 +1425,7 @@ struct dentry *d_obtain_alias(struct ino
list_add(&tmp->d_alias, &inode->i_dentry);
hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
spin_unlock(&tmp->d_lock);
+ spin_unlock(&dcache_inode_lock);

spin_unlock(&dcache_lock);
return tmp;
@@ -1423,9 +1458,11 @@ struct dentry *d_splice_alias(struct ino

if (inode && S_ISDIR(inode->i_mode)) {
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
new = __d_find_alias(inode, 1);
if (new) {
BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
security_d_instantiate(new, inode);
d_move(new, dentry);
@@ -1433,6 +1470,7 @@ struct dentry *d_splice_alias(struct ino
} else {
/* already taking dcache_lock, so d_add() by hand */
__d_instantiate(dentry, inode);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
security_d_instantiate(dentry, inode);
d_rehash(dentry);
@@ -1507,8 +1545,10 @@ struct dentry *d_add_ci(struct dentry *d
* already has a dentry.
*/
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
__d_instantiate(found, inode);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
security_d_instantiate(found, inode);
return found;
@@ -1520,6 +1560,7 @@ struct dentry *d_add_ci(struct dentry *d
*/
new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
dget_locked(new);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
security_d_instantiate(found, inode);
d_move(new, found);
@@ -1738,6 +1779,7 @@ void d_delete(struct dentry * dentry)
* Are we the only user?
*/
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (dentry->d_count == 1) {
@@ -1751,6 +1793,7 @@ void d_delete(struct dentry * dentry)
__d_drop(dentry);

spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);

fsnotify_nameremove(dentry, isdir);
@@ -2003,6 +2046,7 @@ out_unalias:
d_move_locked(alias, dentry);
ret = alias;
out_err:
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
if (m2)
mutex_unlock(m2);
@@ -2068,6 +2112,7 @@ struct dentry *d_materialise_unique(stru
BUG_ON(!d_unhashed(dentry));

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);

if (!inode) {
actual = dentry;
@@ -2111,6 +2156,7 @@ found:
_d_rehash(actual);
spin_unlock(&dcache_hash_lock);
spin_unlock(&actual->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
out_nolock:
if (actual == dentry) {
@@ -2122,6 +2168,7 @@ out_nolock:
return actual;

shouldnt_be_hashed:
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
BUG();
}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -188,6 +188,7 @@ d_iput: no no no yes

#define DCACHE_CANT_MOUNT 0x0100

+extern spinlock_t dcache_inode_lock;
extern spinlock_t dcache_hash_lock;
extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -182,6 +182,7 @@ static void set_dentry_child_flags(struc
struct dentry *alias;

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
list_for_each_entry(alias, &inode->i_dentry, d_alias) {
struct dentry *child;

@@ -199,6 +200,7 @@ static void set_dentry_child_flags(struc
}
spin_unlock(&alias->d_lock);
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}

Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -48,8 +48,10 @@ find_acceptable_alias(struct dentry *res
return result;

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
dget_locked(dentry);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
if (toput)
dput(toput);
@@ -58,8 +60,10 @@ find_acceptable_alias(struct dentry *res
return dentry;
}
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
toput = dentry;
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);

if (toput)
Index: linux-2.6/fs/affs/amigaffs.c
===================================================================
--- linux-2.6.orig/fs/affs/amigaffs.c
+++ linux-2.6/fs/affs/amigaffs.c
@@ -129,6 +129,7 @@ affs_fix_dcache(struct dentry *dentry, u
struct list_head *head, *next;

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
head = &inode->i_dentry;
next = head->next;
while (next != head) {
@@ -139,6 +140,7 @@ affs_fix_dcache(struct dentry *dentry, u
}
next = next->next;
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}

Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -152,7 +152,7 @@ struct dentry *ocfs2_find_local_alias(st
struct dentry *dentry = NULL;

spin_lock(&dcache_lock);
-
+ spin_lock(&dcache_inode_lock);
list_for_each(p, &inode->i_dentry) {
dentry = list_entry(p, struct dentry, d_alias);

@@ -170,6 +170,7 @@ struct dentry *ocfs2_find_local_alias(st
dentry = NULL;
}

+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);

return dentry;
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -65,7 +65,11 @@ static int nfs_superblock_set_dummy_root
* Oops, since the test for IS_ROOT() will fail.
*/
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
+ spin_lock(&sb->s_root->d_lock);
list_del_init(&sb->s_root->d_alias);
+ spin_unlock(&sb->s_root->d_lock);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}
return 0;
Index: linux-2.6/fs/notify/fsnotify.c
===================================================================
--- linux-2.6.orig/fs/notify/fsnotify.c
+++ linux-2.6/fs/notify/fsnotify.c
@@ -54,6 +54,7 @@ void __fsnotify_update_child_dentry_flag
watched = fsnotify_inode_watches_children(inode);

spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
/* run all of the dentries associated with this inode. Since this is a
* directory, there damn well better only be one item on this list */
list_for_each_entry(alias, &inode->i_dentry, d_alias) {
@@ -76,6 +77,7 @@ void __fsnotify_update_child_dentry_flag
}
spin_unlock(&alias->d_lock);
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:46 UTC
Permalink
Remove the global inode_lock, as it doesn't protect anything.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/buffer.c | 2 -
fs/drop_caches.c | 4 --
fs/fs-writeback.c | 35 ++++-------------------
fs/hugetlbfs/inode.c | 2 -
fs/inode.c | 65 ++------------------------------------------
fs/notify/inode_mark.c | 14 +++++----
fs/notify/inotify/inotify.c | 16 ++++++----
fs/quota/dquot.c | 6 ----
include/linux/writeback.h | 1
mm/backing-dev.c | 4 --
10 files changed, 30 insertions(+), 119 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1144,7 +1144,7 @@ __getblk_slow(struct block_device *bdev,
* inode list.
*
* mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
*/
void mark_buffer_dirty(struct buffer_head *bh)
{
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct sup
{
struct inode *inode, *toput_inode = NULL;

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct sup
__iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
invalidate_mapping_pages(inode->i_mapping, 0, -1);
iput(toput_inode);
toput_inode = inode;
- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
}
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
iput(toput_inode);
}

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -312,7 +312,7 @@ static void requeue_io(struct inode *ino
static void inode_sync_complete(struct inode *inode)
{
/*
- * Prevent speculative execution through spin_unlock(&inode_lock);
+ * Prevent speculative execution through spin_unlock(&inode->i_lock);
*/
smp_mb();
wake_up_bit(&inode->i_state, __I_SYNC);
@@ -404,9 +404,7 @@ static void inode_wait_for_writeback(str
while (inode->i_state & I_SYNC) {
spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&wb_inode_list_lock);
}
@@ -464,7 +462,6 @@ writeback_single_inode(struct inode *ino
inode->i_state &= ~I_DIRTY_PAGES;
spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);

ret = do_writepages(mapping, wbc);

@@ -484,12 +481,10 @@ writeback_single_inode(struct inode *ino
* due to delalloc, clear dirty metadata flags right before
* write_inode()
*/
- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY;
inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
int err = write_inode(inode, wbc);
@@ -497,7 +492,6 @@ writeback_single_inode(struct inode *ino
ret = err;
}

- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&wb_inode_list_lock);
inode->i_state &= ~I_SYNC;
@@ -679,10 +673,8 @@ again:
}
spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
iput(inode);
cond_resched();
- spin_lock(&inode_lock);
spin_lock(&wb_inode_list_lock);
if (wbc->nr_to_write <= 0) {
wbc->more_io = 1;
@@ -701,7 +693,6 @@ static void writeback_inodes_wb(struct b
int ret = 0;

wbc->wb_start = jiffies; /* livelock avoidance */
- spin_lock(&inode_lock);
again:
spin_lock(&wb_inode_list_lock);

@@ -744,7 +735,6 @@ again:
break;
}
spin_unlock(&wb_inode_list_lock);
- spin_unlock(&inode_lock);
/* Leave any unwritten inodes on b_io */
}

@@ -857,21 +847,18 @@ static long wb_writeback(struct bdi_writ
* we'll just busyloop.
*/
retry:
- spin_lock(&inode_lock);
spin_lock(&wb_inode_list_lock);
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
struct inode, i_list);
if (!spin_trylock(&inode->i_lock)) {
spin_unlock(&wb_inode_list_lock);
- spin_unlock(&inode_lock);
goto retry;
}
inode_wait_for_writeback(inode);
spin_unlock(&inode->i_lock);
}
spin_unlock(&wb_inode_list_lock);
- spin_unlock(&inode_lock);
}

return wrote;
@@ -1141,7 +1128,6 @@ void __mark_inode_dirty(struct inode *in
if (unlikely(block_dump))
block_dump___mark_inode_dirty(inode);

- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
@@ -1190,7 +1176,6 @@ void __mark_inode_dirty(struct inode *in
}
out:
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(__mark_inode_dirty);

@@ -1221,7 +1206,6 @@ static void wait_sb_inodes(struct super_
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);

/*
@@ -1246,14 +1230,12 @@ static void wait_sb_inodes(struct super_
__iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
/*
- * We hold a reference to 'inode' so it couldn't have
- * been removed from s_inodes list while we dropped the
- * inode_lock. We cannot iput the inode now as we can
- * be holding the last reference and we cannot iput it
- * under inode_lock. So we keep the reference and iput
- * it later.
+ * We hold a reference to 'inode' so it couldn't have been
+ * removed from s_inodes list while we dropped the
+ * sb_inode_list_lock. We cannot iput the inode now as we can
+ * be holding the last reference and we cannot iput it under
+ * spinlock. So we keep the reference and iput it later.
*/
iput(old_inode);
old_inode = inode;
@@ -1262,11 +1244,9 @@ static void wait_sb_inodes(struct super_

cond_resched();

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
}
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
iput(old_inode);
}

@@ -1348,13 +1328,11 @@ int write_inode_now(struct inode *inode,
wbc.nr_to_write = 0;

might_sleep();
- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&wb_inode_list_lock);
ret = writeback_single_inode(inode, &wbc);
spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
if (sync)
inode_sync_wait(inode);
return ret;
@@ -1376,13 +1354,11 @@ int sync_inode(struct inode *inode, stru
{
int ret;

- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&wb_inode_list_lock);
ret = writeback_single_inode(inode, wbc);
spin_unlock(&wb_inode_list_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
return ret;
}
EXPORT_SYMBOL(sync_inode);
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -377,7 +377,7 @@ static void hugetlbfs_delete_inode(struc
clear_inode(inode);
}

-static void hugetlbfs_forget_inode(struct inode *inode) __releases(inode_lock)
+static void hugetlbfs_forget_inode(struct inode *inode)
{
if (generic_detach_inode(inode)) {
truncate_hugepages(inode, 0);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -105,7 +105,6 @@ static struct hlist_head *inode_hashtabl
* NOTE! You also have to own the lock if you change
* the i_state of an inode while it is in use..
*/
-DEFINE_SPINLOCK(inode_lock);
DEFINE_SPINLOCK(sb_inode_list_lock);
DEFINE_SPINLOCK(wb_inode_list_lock);
DEFINE_SPINLOCK(inode_hash_lock);
@@ -376,16 +375,14 @@ static void dispose_list(struct list_hea
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
list_del_init(&inode->i_sb_list);
- spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
+ spin_unlock(&sb_inode_list_lock);

wake_up_inode(inode);
destroy_inode(inode);
@@ -413,7 +410,6 @@ static int invalidate_list(struct list_h
* change during umount anymore, and because iprune_sem keeps
* shrink_icache_memory() away.
*/
- cond_resched_lock(&inode_lock);
cond_resched_lock(&sb_inode_list_lock);

next = next->next;
@@ -458,13 +454,11 @@ int invalidate_inodes(struct super_block
LIST_HEAD(throw_away);

down_write(&iprune_sem);
- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
inotify_unmount_inodes(&sb->s_inodes);
fsnotify_unmount_inodes(&sb->s_inodes);
busy = invalidate_list(&sb->s_inodes, &throw_away);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);

dispose_list(&throw_away);
up_write(&iprune_sem);
@@ -507,7 +501,6 @@ static void prune_icache(int nr_to_scan)
unsigned long reap = 0;

down_read(&iprune_sem);
- spin_lock(&inode_lock);
again:
spin_lock(&wb_inode_list_lock);
for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
@@ -531,12 +524,10 @@ again:
spin_unlock(&wb_inode_list_lock);
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
if (remove_inode_buffers(inode))
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
iput(inode);
- spin_lock(&inode_lock);
again2:
spin_lock(&wb_inode_list_lock);

@@ -563,7 +554,6 @@ again2:
__count_vm_events(KSWAPD_INODESTEAL, reap);
else
__count_vm_events(PGINODESTEAL, reap);
- spin_unlock(&inode_lock);
spin_unlock(&wb_inode_list_lock);

dispose_list(&freeable);
@@ -714,12 +704,10 @@ void inode_add_to_lists(struct super_blo
{
struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
__inode_add_to_lists(sb, head, inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
}
EXPORT_SYMBOL_GPL(inode_add_to_lists);

@@ -745,18 +733,14 @@ struct inode *new_inode(struct super_blo
static atomic_t last_ino = ATOMIC_INIT(0);
struct inode *inode;

- spin_lock_prefetch(&inode_lock);
-
inode = alloc_inode(sb);
if (inode) {
- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
inode->i_ino = atomic_inc_return(&last_ino);
inode->i_state = 0;
__inode_add_to_lists(sb, NULL, inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
}
return inode;
}
@@ -815,7 +799,6 @@ static struct inode *get_new_inode(struc
if (inode) {
struct inode *old;

- spin_lock(&inode_lock);
/* We released the lock, so.. */
old = find_inode(sb, head, test, data);
if (!old) {
@@ -827,7 +810,6 @@ static struct inode *get_new_inode(struc
inode->i_state = I_NEW;
__inode_add_to_lists(sb, head, inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);

/* Return the locked inode with I_NEW set, the
* caller is responsible for filling in the contents
@@ -842,7 +824,6 @@ static struct inode *get_new_inode(struc
*/
__iget(old);
spin_unlock(&old->i_lock);
- spin_unlock(&inode_lock);
destroy_inode(inode);
inode = old;
wait_on_inode(inode);
@@ -852,7 +833,6 @@ static struct inode *get_new_inode(struc
set_failed:
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
destroy_inode(inode);
return NULL;
}
@@ -870,7 +850,6 @@ static struct inode *get_new_inode_fast(
if (inode) {
struct inode *old;

- spin_lock(&inode_lock);
/* We released the lock, so.. */
old = find_inode_fast(sb, head, ino);
if (!old) {
@@ -880,7 +859,6 @@ static struct inode *get_new_inode_fast(
inode->i_state = I_NEW;
__inode_add_to_lists(sb, head, inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);

/* Return the locked inode with I_NEW set, the
* caller is responsible for filling in the contents
@@ -895,7 +873,6 @@ static struct inode *get_new_inode_fast(
*/
__iget(old);
spin_unlock(&old->i_lock);
- spin_unlock(&inode_lock);
destroy_inode(inode);
inode = old;
wait_on_inode(inode);
@@ -945,7 +922,6 @@ ino_t iunique(struct super_block *sb, in
struct hlist_head *head;
ino_t res;

- spin_lock(&inode_lock);
spin_lock(&unique_lock);
do {
if (counter <= max_reserved)
@@ -954,7 +930,6 @@ ino_t iunique(struct super_block *sb, in
head = inode_hashtable + hash(sb, res);
} while (!test_inode_iunique(sb, head, res));
spin_unlock(&unique_lock);
- spin_unlock(&inode_lock);

return res;
}
@@ -964,7 +939,6 @@ struct inode *igrab(struct inode *inode)
{
struct inode *ret = inode;

- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
if (!(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)))
__iget(inode);
@@ -976,7 +950,6 @@ struct inode *igrab(struct inode *inode)
*/
ret = NULL;
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);

return ret;
}
@@ -1007,17 +980,14 @@ static struct inode *ifind(struct super_
{
struct inode *inode;

- spin_lock(&inode_lock);
inode = find_inode(sb, head, test, data);
if (inode) {
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
if (likely(wait))
wait_on_inode(inode);
return inode;
}
- spin_unlock(&inode_lock);
return NULL;
}

@@ -1041,16 +1011,13 @@ static struct inode *ifind_fast(struct s
{
struct inode *inode;

- spin_lock(&inode_lock);
inode = find_inode_fast(sb, head, ino);
if (inode) {
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
wait_on_inode(inode);
return inode;
}
- spin_unlock(&inode_lock);
return NULL;
}

@@ -1214,7 +1181,6 @@ int insert_inode_locked(struct inode *in
struct hlist_node *node;
struct inode *old = NULL;

- spin_lock(&inode_lock);
repeat:
spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, node, head, i_hash) {
@@ -1233,13 +1199,11 @@ repeat:
if (likely(!node)) {
hlist_add_head(&inode->i_hash, head);
spin_unlock(&inode_hash_lock);
- spin_unlock(&inode_lock);
return 0;
}
spin_unlock(&inode_hash_lock);
__iget(old);
spin_unlock(&old->i_lock);
- spin_unlock(&inode_lock);
wait_on_inode(old);
if (unlikely(!hlist_unhashed(&old->i_hash))) {
iput(old);
@@ -1262,7 +1226,6 @@ int insert_inode_locked4(struct inode *i
struct hlist_node *node;
struct inode *old = NULL;

- spin_lock(&inode_lock);
repeat:
spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, node, head, i_hash) {
@@ -1281,13 +1244,11 @@ repeat:
if (likely(!node)) {
hlist_add_head(&inode->i_hash, head);
spin_unlock(&inode_hash_lock);
- spin_unlock(&inode_lock);
return 0;
}
spin_unlock(&inode_hash_lock);
__iget(old);
spin_unlock(&old->i_lock);
- spin_unlock(&inode_lock);
wait_on_inode(old);
if (unlikely(!hlist_unhashed(&old->i_hash))) {
iput(old);
@@ -1310,13 +1271,11 @@ void __insert_inode_hash(struct inode *i
{
struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);

- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_add_head(&inode->i_hash, head);
spin_unlock(&inode_hash_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(__insert_inode_hash);

@@ -1328,13 +1287,11 @@ EXPORT_SYMBOL(__insert_inode_hash);
*/
void remove_inode_hash(struct inode *inode)
{
- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(remove_inode_hash);

@@ -1362,7 +1319,6 @@ void generic_delete_inode(struct inode *
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
atomic_dec(&inodes_stat.nr_inodes);

if (op->delete_inode) {
@@ -1376,13 +1332,11 @@ void generic_delete_inode(struct inode *
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
}
- spin_lock(&inode_lock);
spin_lock(&inode->i_lock);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
wake_up_inode(inode);
BUG_ON(inode->i_state != I_CLEAR);
destroy_inode(inode);
@@ -1412,16 +1366,13 @@ int generic_detach_inode(struct inode *i
if (sb->s_flags & MS_ACTIVE) {
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
return 0;
}
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
write_inode_now(inode, 1);
- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
@@ -1439,7 +1390,6 @@ int generic_detach_inode(struct inode *i
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
atomic_dec(&inodes_stat.nr_inodes);
return 1;
}
@@ -1505,17 +1455,12 @@ void iput(struct inode *inode)
if (inode) {
BUG_ON(inode->i_state == I_CLEAR);

-retry1:
+retry:
spin_lock(&inode->i_lock);
if (inode->i_count == 1) {
- if (!spin_trylock(&inode_lock)) {
-retry2:
- spin_unlock(&inode->i_lock);
- goto retry1;
- }
if (!spin_trylock(&sb_inode_list_lock)) {
- spin_unlock(&inode_lock);
- goto retry2;
+ spin_unlock(&inode->i_lock);
+ goto retry;
}
inode->i_count--;
iput_final(inode);
@@ -1713,10 +1658,8 @@ static void __wait_on_freeing_inode(stru
wq = bit_waitqueue(&inode->i_state, __I_NEW);
prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lock);
schedule();
finish_wait(wq, &wait.wait);
- spin_lock(&inode_lock);
}

static __initdata unsigned long ihash_entries;
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -390,13 +390,16 @@ void inotify_unmount_inodes(struct list_
struct inode *need_iput_tmp;
struct list_head *watches;

+ spin_lock(&inode->i_lock);
/*
* We cannot __iget() an inode in state I_CLEAR, I_FREEING,
* I_WILL_FREE, or I_NEW which is fine because by that point
* the inode cannot have any associated watches.
*/
- if (inode->i_state & (I_CLEAR|I_FREEING|I_WILL_FREE|I_NEW))
+ if (inode->i_state & (I_CLEAR|I_FREEING|I_WILL_FREE|I_NEW)) {
+ spin_unlock(&inode->i_lock);
continue;
+ }

/*
* If i_count is zero, the inode cannot have any watches and
@@ -404,18 +407,21 @@ void inotify_unmount_inodes(struct list_
* evict all inodes with zero i_count from icache which is
* unnecessarily violent and may in fact be illegal to do.
*/
- if (!inode->i_count)
+ if (!inode->i_count) {
+ spin_unlock(&inode->i_lock);
continue;
+ }

need_iput_tmp = need_iput;
need_iput = NULL;
/* In case inotify_remove_watch_locked() drops a reference. */
if (inode != need_iput_tmp) {
- spin_lock(&inode->i_lock);
__iget(inode);
- spin_unlock(&inode->i_lock);
} else
need_iput_tmp = NULL;
+
+ spin_unlock(&inode->i_lock);
+
/* In case the dropping of a reference would nuke next_i. */
if (&next_i->i_sb_list != list) {
spin_lock(&next_i->i_lock);
@@ -435,7 +441,6 @@ void inotify_unmount_inodes(struct list_
* iprune_mutex keeps shrink_icache_memory() away.
*/
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);

if (need_iput_tmp)
iput(need_iput_tmp);
@@ -455,7 +460,6 @@ void inotify_unmount_inodes(struct list_
mutex_unlock(&inode->inotify_mutex);
iput(inode);

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
}
}
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -9,7 +9,6 @@

struct backing_dev_info;

-extern spinlock_t inode_lock;
extern spinlock_t sb_inode_list_lock;
extern spinlock_t wb_inode_list_lock;
extern spinlock_t inode_hash_lock;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -883,7 +883,6 @@ static void add_dquot_ref(struct super_b
int reserved = 0;
#endif

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
@@ -907,7 +906,6 @@ static void add_dquot_ref(struct super_b
__iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);

iput(old_inode);
__dquot_initialize(inode, type);
@@ -917,11 +915,9 @@ static void add_dquot_ref(struct super_b
* reference and we cannot iput it under inode_lock. So we
* keep the reference and iput it later. */
old_inode = inode;
- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
}
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
iput(old_inode);

#ifdef CONFIG_QUOTA_DEBUG
@@ -999,7 +995,6 @@ static void remove_dquot_ref(struct supe
{
struct inode *inode;

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
/*
@@ -1012,7 +1007,6 @@ static void remove_dquot_ref(struct supe
remove_inode_dquot_ref(inode, type, tofree_head);
}
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);
}

/* Gather all references from inodes and drop them */
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c
+++ linux-2.6/fs/notify/inode_mark.c
@@ -368,13 +368,16 @@ void fsnotify_unmount_inodes(struct list
list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
struct inode *need_iput_tmp;

+ spin_lock(&inode->i_lock);
/*
* We cannot __iget() an inode in state I_CLEAR, I_FREEING,
* I_WILL_FREE, or I_NEW which is fine because by that point
* the inode cannot have any associated watches.
*/
- if (inode->i_state & (I_CLEAR|I_FREEING|I_WILL_FREE|I_NEW))
+ if (inode->i_state & (I_CLEAR|I_FREEING|I_WILL_FREE|I_NEW)) {
+ spin_unlock(&inode->i_lock);
continue;
+ }

/*
* If i_count is zero, the inode cannot have any watches and
@@ -382,19 +385,20 @@ void fsnotify_unmount_inodes(struct list
* evict all inodes with zero i_count from icache which is
* unnecessarily violent and may in fact be illegal to do.
*/
- if (!inode->i_count)
+ if (!inode->i_count) {
+ spin_unlock(&inode->i_lock);
continue;
+ }

need_iput_tmp = need_iput;
need_iput = NULL;

/* In case fsnotify_inode_delete() drops a reference. */
if (inode != need_iput_tmp) {
- spin_lock(&inode->i_lock);
__iget(inode);
- spin_unlock(&inode->i_lock);
} else
need_iput_tmp = NULL;
+ spin_unlock(&inode->i_lock);

/* In case the dropping of a reference would nuke next_i. */
if (&next_i->i_sb_list != list) {
@@ -415,7 +419,6 @@ void fsnotify_unmount_inodes(struct list
* iprune_mutex keeps shrink_icache_memory() away.
*/
spin_unlock(&sb_inode_list_lock);
- spin_unlock(&inode_lock);

if (need_iput_tmp)
iput(need_iput_tmp);
@@ -427,7 +430,6 @@ void fsnotify_unmount_inodes(struct list

iput(inode);

- spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
}
}
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -77,7 +77,6 @@ static int bdi_debug_stats_show(struct s
* RCU on the reader side
*/
nr_wb = nr_dirty = nr_io = nr_more_io = 0;
- spin_lock(&inode_lock);
spin_lock(&wb_inode_list_lock);
list_for_each_entry(wb, &bdi->wb_list, list) {
nr_wb++;
@@ -89,7 +88,6 @@ static int bdi_debug_stats_show(struct s
nr_more_io++;
}
spin_unlock(&wb_inode_list_lock);
- spin_unlock(&inode_lock);

get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);

@@ -713,13 +711,11 @@ void bdi_destroy(struct backing_dev_info
if (bdi_has_dirty_io(bdi)) {
struct bdi_writeback *dst = &default_backing_dev_info.wb;

- spin_lock(&inode_lock);
spin_lock(&wb_inode_list_lock);
list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
list_splice(&bdi->wb.b_io, &dst->b_io);
list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
spin_unlock(&wb_inode_list_lock);
- spin_unlock(&inode_lock);
}

bdi_unregister(bdi);
n***@suse.de
2010-06-24 03:02:23 UTC
Permalink
Make dentry_stat_t.nr_dentry an atomic_t type, and move it from under
dcache_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 20 +++++++++-----------
include/linux/dcache.h | 4 ++--
kernel/sysctl.c | 6 ++++++
3 files changed, 17 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -88,6 +88,7 @@ static struct hlist_head *dentry_hashtab

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
+ .nr_dentry = ATOMIC_INIT(0),
.age_limit = 45,
};

@@ -106,11 +107,11 @@ static void d_callback(struct rcu_head *
}

/*
- * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
*/
static void d_free(struct dentry *dentry)
{
+ atomic_dec(&dentry_stat.nr_dentry);
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
/* if dentry was never inserted into hash, immediate free is OK */
@@ -217,7 +218,6 @@ static struct dentry *d_kill(struct dent
struct dentry *parent;

list_del(&dentry->d_u.d_child);
- dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
if (IS_ROOT(dentry))
@@ -787,10 +787,7 @@ static void shrink_dcache_for_umount_sub
struct dentry, d_u.d_child);
}
out:
- /* several dentries were freed, need to correct nr_dentry */
- spin_lock(&dcache_lock);
- dentry_stat.nr_dentry -= detached;
- spin_unlock(&dcache_lock);
+ return;
}

/*
@@ -1047,11 +1044,12 @@ struct dentry *d_alloc(struct dentry * p
INIT_LIST_HEAD(&dentry->d_u.d_child);
}

- spin_lock(&dcache_lock);
- if (parent)
+ if (parent) {
+ spin_lock(&dcache_lock);
list_add(&dentry->d_u.d_child, &parent->d_subdirs);
- dentry_stat.nr_dentry++;
- spin_unlock(&dcache_lock);
+ spin_unlock(&dcache_lock);
+ }
+ atomic_inc(&dentry_stat.nr_dentry);

return dentry;
}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -37,8 +37,8 @@ struct qstr {
};

struct dentry_stat_t {
- int nr_dentry;
- int nr_unused;
+ atomic_t nr_dentry;
+ int nr_unused; /* protected by dcache_lru_lock */
int age_limit; /* age in seconds */
int want_pages; /* pages requested by system */
int dummy[2];
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1371,6 +1371,12 @@ static struct ctl_table fs_table[] = {
.extra2 = &sysctl_nr_open_max,
},
{
+ /*
+ * dentry_stat has an atomic_t member, so this is a bit of
+ * a hack, but it works for the moment, and I won't bother
+ * changing it now because we'll probably want to change to
+ * a more scalable counter anyway.
+ */
.procname = "dentry-state",
.data = &dentry_stat,
.maxlen = 6*sizeof(int),


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:52 UTC
Permalink
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/anon_inodes.c | 4 +++-
fs/pipe.c | 5 +++--
net/socket.c | 5 +++--
3 files changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c
+++ linux-2.6/net/socket.c
@@ -371,7 +371,7 @@ static int sock_alloc_file(struct socket
if (unlikely(fd < 0))
return fd;

- path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+ path.dentry = d_alloc_pseudo(sock_mnt->mnt_sb, &name);
if (unlikely(!path.dentry)) {
put_unused_fd(fd);
return -ENOMEM;
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c
+++ linux-2.6/fs/anon_inodes.c
@@ -104,7 +104,7 @@ struct file *anon_inode_getfile(const ch
this.name = name;
this.len = strlen(name);
this.hash = 0;
- path.dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+ path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
if (!path.dentry)
goto err_module;

Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c
+++ linux-2.6/fs/pipe.c
@@ -997,7 +997,7 @@ struct file *create_write_pipe(int flags
goto err;

err = -ENOMEM;
- path.dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+ path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, &name);
if (!path.dentry)
goto err_inode;
path.mnt = mntget(pipe_mnt);
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1255,6 +1255,18 @@ struct dentry *d_alloc(struct dentry * p
}
EXPORT_SYMBOL(d_alloc);

+struct dentry *d_alloc_pseudo(struct super_block *sb, const struct qstr *name)
+{
+ struct dentry *dentry = d_alloc(NULL, name);
+ if (dentry) {
+ dentry->d_sb = sb;
+ dentry->d_parent = dentry;
+ dentry->d_flags |= DCACHE_DISCONNECTED;
+ }
+ return dentry;
+}
+EXPORT_SYMBOL(d_alloc_pseudo);
+
struct dentry *d_alloc_name(struct dentry *parent, const char *name)
{
struct qstr q;
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -225,6 +225,7 @@ extern void d_delete(struct dentry *);

/* allocate/de-allocate */
extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
+extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
extern struct dentry * d_obtain_alias(struct inode *);
n***@suse.de
2010-06-24 03:03:04 UTC
Permalink
Problem with inode reclaim is that it puts inodes into I_FREEING state
and then continues to gather more, during which it may iput,
invalidate_mapping_pages, be preempted, etc. Holding these inodes in
I_FREEING can cause pauses.

After the inode scalability work, there is not a big reason to batch
up inodes to reclaim them, so dispose them as they are found from the
LRU.

Signed-off-by: Nick Piggin <***@suse.de>

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -390,6 +390,19 @@ EXPORT_SYMBOL(clear_inode);

static void inode_sb_list_del(struct inode *inode);

+static void dispose_one_inode(struct inode *inode)
+{
+ clear_inode(inode);
+
+ spin_lock(&inode->i_lock);
+ __remove_inode_hash(inode);
+ inode_sb_list_del(inode);
+ spin_unlock(&inode->i_lock);
+
+ wake_up_inode(inode);
+ destroy_inode(inode);
+}
+
/*
* dispose_list - dispose of the contents of a local list
* @head: the head of the list to free
@@ -409,15 +422,8 @@ static void dispose_list(struct list_hea

if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
- clear_inode(inode);
-
- spin_lock(&inode->i_lock);
- __remove_inode_hash(inode);
- inode_sb_list_del(inode);
- spin_unlock(&inode->i_lock);
-
- wake_up_inode(inode);
- destroy_inode(inode);
+ dispose_one_inode(inode);
+ cond_resched();
nr_disposed++;
}
}
@@ -526,7 +532,6 @@ EXPORT_SYMBOL(invalidate_inodes);
*/
static void prune_icache(struct zone *zone, unsigned long nr_to_scan)
{
- LIST_HEAD(freeable);
unsigned long reap = 0;

down_read(&iprune_sem);
@@ -563,8 +568,6 @@ again:
__iget(inode);
spin_unlock(&inode->i_lock);

- dispose_list(&freeable);
-
if (remove_inode_buffers(inode))
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
@@ -572,11 +575,15 @@ again:
spin_lock(&zone->inode_lru_lock);
continue;
}
- list_move(&inode->i_lru, &freeable);
+ list_del_init(&inode->i_lru);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
zone->inode_nr_lru--;
+ spin_unlock(&zone->inode_lru_lock);
+ dispose_one_inode(inode);
+ cond_resched();
+ spin_lock(&zone->inode_lru_lock);
}
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -584,7 +591,6 @@ again:
__count_vm_events(PGINODESTEAL, reap);
spin_unlock(&zone->inode_lru_lock);

- dispose_list(&freeable);
up_read(&iprune_sem);
}



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:49 UTC
Permalink
Impelemnt lazy inode lru similarly to dcache. This will reduce lock
acquisition and will help to improve lock ordering subsequently.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/fs-writeback.c | 21 +++++++--------
fs/inode.c | 61 +++++++++++++++++++---------------------------
include/linux/fs.h | 7 ++++-
include/linux/writeback.h | 1
4 files changed, 42 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -96,7 +96,6 @@ static unsigned int i_hash_shift __read_
* allowing for low-overhead inode sync() operations.
*/

-LIST_HEAD(inode_in_use);
LIST_HEAD(inode_unused);

struct inode_hash_bucket {
@@ -298,6 +297,7 @@ void inode_init_once(struct inode *inode
INIT_HLIST_BL_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_devices);
+ INIT_LIST_HEAD(&inode->i_list);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -323,25 +323,6 @@ static void init_once(void *foo)
inode_init_once(inode);
}

-/*
- * inode_lock must be held
- */
-void __iget(struct inode *inode)
-{
- assert_spin_locked(&inode->i_lock);
-
- inode->i_count++;
- if (inode->i_count > 1)
- return;
-
- if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
- spin_lock(&wb_inode_list_lock);
- list_move(&inode->i_list, &inode_in_use);
- spin_unlock(&wb_inode_list_lock);
- }
- atomic_dec(&inodes_stat.nr_unused);
-}
-
/**
* clear_inode - clear an inode
* @inode: inode to clear
@@ -384,7 +365,7 @@ static void dispose_list(struct list_hea
struct inode *inode;

inode = list_first_entry(head, struct inode, i_list);
- list_del(&inode->i_list);
+ list_del_init(&inode->i_list);

if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
@@ -437,11 +418,12 @@ static int invalidate_list(struct list_h
invalidate_inode_buffers(inode);
if (!inode->i_count) {
spin_lock(&wb_inode_list_lock);
- list_move(&inode->i_list, dispose);
+ list_del(&inode->i_list);
spin_unlock(&wb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
+ list_add(&inode->i_list, dispose);
count++;
continue;
}
@@ -480,19 +462,6 @@ int invalidate_inodes(struct super_block
}
EXPORT_SYMBOL(invalidate_inodes);

-static int can_unuse(struct inode *inode)
-{
- if (inode->i_state)
- return 0;
- if (inode_has_buffers(inode))
- return 0;
- if (inode->i_count)
- return 0;
- if (inode->i_data.nrpages)
- return 0;
- return 1;
-}
-
/*
* Scan `goal' inodes on the unused list for freeable ones. They are moved to
* a temporary list and then are freed outside inode_lock by dispose_list().
@@ -510,13 +479,12 @@ static void prune_icache(int nr_to_scan)
{
LIST_HEAD(freeable);
int nr_pruned = 0;
- int nr_scanned;
unsigned long reap = 0;

down_read(&iprune_sem);
again:
spin_lock(&wb_inode_list_lock);
- for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
+ for (; nr_to_scan; nr_to_scan--) {
struct inode *inode;

if (list_empty(&inode_unused))
@@ -528,33 +496,30 @@ again:
spin_unlock(&wb_inode_list_lock);
goto again;
}
- if (inode->i_state || inode->i_count) {
+ if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
+ list_del_init(&inode->i_list);
+ spin_unlock(&inode->i_lock);
+ atomic_dec(&inodes_stat.nr_unused);
+ continue;
+ }
+ if (inode->i_state) {
list_move(&inode->i_list, &inode_unused);
+ inode->i_state &= ~I_REFERENCED;
spin_unlock(&inode->i_lock);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ list_move(&inode->i_list, &inode_unused);
spin_unlock(&wb_inode_list_lock);
__iget(inode);
spin_unlock(&inode->i_lock);
+
if (remove_inode_buffers(inode))
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
iput(inode);
-again2:
spin_lock(&wb_inode_list_lock);
-
- if (inode != list_entry(inode_unused.next,
- struct inode, i_list))
- continue; /* wrong inode or list_empty */
- if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&wb_inode_list_lock);
- goto again2;
- }
- if (!can_unuse(inode)) {
- spin_unlock(&inode->i_lock);
- continue;
- }
+ continue;
}
list_move(&inode->i_list, &freeable);
WARN_ON(inode->i_state & I_NEW);
@@ -694,9 +659,6 @@ __inode_add_to_lists(struct super_block
atomic_inc(&inodes_stat.nr_inodes);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
- spin_lock(&wb_inode_list_lock);
- list_add(&inode->i_list, &inode_in_use);
- spin_unlock(&wb_inode_list_lock);
if (b) {
spin_lock_bucket(b);
hlist_bl_add_head(&inode->i_hash, &b->head);
@@ -1343,9 +1305,13 @@ void generic_delete_inode(struct inode *
{
const struct super_operations *op = inode->i_sb->s_op;

- spin_lock(&wb_inode_list_lock);
- list_del_init(&inode->i_list);
- spin_unlock(&wb_inode_list_lock);
+ if (!list_empty(&inode->i_list)) {
+ spin_lock(&wb_inode_list_lock);
+ list_del_init(&inode->i_list);
+ spin_unlock(&wb_inode_list_lock);
+ if (!inode->i_state)
+ atomic_dec(&inodes_stat.nr_unused);
+ }
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
@@ -1393,13 +1359,15 @@ int generic_detach_inode(struct inode *i
struct super_block *sb = inode->i_sb;

if (!hlist_bl_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
- spin_lock(&wb_inode_list_lock);
- list_move(&inode->i_list, &inode_unused);
- spin_unlock(&wb_inode_list_lock);
- }
- atomic_inc(&inodes_stat.nr_unused);
if (sb->s_flags & MS_ACTIVE) {
+ inode->i_state |= I_REFERENCED;
+ if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
+ list_empty(&inode->i_list)) {
+ spin_lock(&wb_inode_list_lock);
+ list_add(&inode->i_list, &inode_unused);
+ spin_unlock(&wb_inode_list_lock);
+ atomic_inc(&inodes_stat.nr_unused);
+ }
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
return 0;
@@ -1414,11 +1382,14 @@ int generic_detach_inode(struct inode *i
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
__remove_inode_hash(inode);
- atomic_dec(&inodes_stat.nr_unused);
}
- spin_lock(&wb_inode_list_lock);
- list_del_init(&inode->i_list);
- spin_unlock(&wb_inode_list_lock);
+ if (!list_empty(&inode->i_list)) {
+ spin_lock(&wb_inode_list_lock);
+ list_del_init(&inode->i_list);
+ spin_unlock(&wb_inode_list_lock);
+ if (!inode->i_state)
+ atomic_dec(&inodes_stat.nr_unused);
+ }
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1634,16 +1634,17 @@ struct super_operations {
*
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*/
-#define I_DIRTY_SYNC 1
-#define I_DIRTY_DATASYNC 2
-#define I_DIRTY_PAGES 4
+#define I_DIRTY_SYNC 0x01
+#define I_DIRTY_DATASYNC 0x02
+#define I_DIRTY_PAGES 0x04
#define __I_NEW 3
#define I_NEW (1 << __I_NEW)
-#define I_WILL_FREE 16
-#define I_FREEING 32
-#define I_CLEAR 64
+#define I_WILL_FREE 0x10
+#define I_FREEING 0x20
+#define I_CLEAR 0x40
#define __I_SYNC 7
#define I_SYNC (1 << __I_SYNC)
+#define I_REFERENCED 0x100

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

@@ -2171,7 +2172,6 @@ extern int insert_inode_locked4(struct i
extern int insert_inode_locked(struct inode *);
extern void unlock_new_inode(struct inode *);

-extern void __iget(struct inode * inode);
extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void destroy_inode(struct inode *);
@@ -2422,6 +2422,12 @@ extern int generic_show_options(struct s
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);

+static inline void __iget(struct inode *inode)
+{
+ assert_spin_locked(&inode->i_lock);
+ inode->i_count++;
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -552,15 +552,8 @@ select_queue:
inode->i_state |= I_DIRTY_PAGES;
redirty_tail(inode);
}
- } else if (inode->i_count) {
- /*
- * The inode is clean, inuse
- */
- list_move(&inode->i_list, &inode_in_use);
} else {
- /*
- * The inode is clean, unused
- */
+ /* The inode is clean */
list_move(&inode->i_list, &inode_unused);
}
}
@@ -1206,8 +1199,6 @@ static void wait_sb_inodes(struct super_
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- spin_lock(&sb_inode_list_lock);
-
/*
* Data integrity sync. Must wait for all pages under writeback,
* because there may have been pages dirtied before our sync
@@ -1215,7 +1206,8 @@ static void wait_sb_inodes(struct super_
* In which case, the inode may not be on the dirty list, but
* we still have to wait for that writeout.
*/
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
struct address_space *mapping;

mapping = inode->i_mapping;
@@ -1229,13 +1221,13 @@ static void wait_sb_inodes(struct super_
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();
/*
* We hold a reference to 'inode' so it couldn't have been
- * removed from s_inodes list while we dropped the
- * sb_inode_list_lock. We cannot iput the inode now as we can
- * be holding the last reference and we cannot iput it under
- * spinlock. So we keep the reference and iput it later.
+ * removed from s_inodes list while we dropped the i_lock. We
+ * cannot iput the inode now as we can be holding the last
+ * reference and we cannot iput it under spinlock. So we keep
+ * the reference and iput it later.
*/
iput(old_inode);
old_inode = inode;
@@ -1244,9 +1236,9 @@ static void wait_sb_inodes(struct super_

cond_resched();

- spin_lock(&sb_inode_list_lock);
+ rcu_read_lock();
}
- spin_unlock(&sb_inode_list_lock);
+ rcu_read_unlock();
iput(old_inode);
}

Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -11,7 +11,6 @@ struct backing_dev_info;

extern spinlock_t sb_inode_list_lock;
extern spinlock_t wb_inode_list_lock;
-extern struct list_head inode_in_use;
extern struct list_head inode_unused;

/*
Andi Kleen
2010-06-24 09:52:58 UTC
Permalink
Post by n***@suse.de
Impelemnt lazy inode lru similarly to dcache. This will reduce lock
acquisition and will help to improve lock ordering subsequently.
Or just drop inode LRU completely and only rely on the dcache for that?

-Andi
--
***@linux.intel.com -- Speaking for myself only.
Nick Piggin
2010-06-24 15:59:04 UTC
Permalink
Post by Andi Kleen
Post by n***@suse.de
Impelemnt lazy inode lru similarly to dcache. This will reduce lock
acquisition and will help to improve lock ordering subsequently.
Or just drop inode LRU completely and only rely on the dcache for that?
Possible, yes. There was some talking about it. I prefer not to do
anything too controversial yet if possible :)
n***@suse.de
2010-06-24 03:03:01 UTC
Permalink
Split wb_inode_list_lock lock into two locks, inode_lru_lock to protect
inode LRU list, and a per-bdi lock to protect the inode writeback lists.
Inode is given another list anchor so it can be present on both the LRU
and the writeback lists, for simplicity.

Signed-off-by: Nick Piggin <***@suse.de>
--
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -283,11 +283,9 @@ void bdi_start_writeback(struct backing_
* the case then the inode must have been redirtied while it was being written
* out and we don't reset its dirtied_when.
*/
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct bdi_writeback *wb, struct inode *inode)
{
- struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
- assert_spin_locked(&wb_inode_list_lock);
+ assert_spin_locked(&wb->b_lock);
if (!list_empty(&wb->b_dirty)) {
struct inode *tail;

@@ -301,11 +299,9 @@ static void redirty_tail(struct inode *i
/*
* requeue inode for re-scanning after bdi->b_io list is exhausted.
*/
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct bdi_writeback *wb, struct inode *inode)
{
- struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
- assert_spin_locked(&wb_inode_list_lock);
+ assert_spin_locked(&wb->b_lock);
list_move(&inode->i_io, &wb->b_more_io);
}

@@ -346,7 +342,6 @@ static void move_expired_inodes(struct l
struct inode *inode;
int do_sb_sort = 0;

- assert_spin_locked(&wb_inode_list_lock);
while (!list_empty(delaying_queue)) {
inode = list_entry(delaying_queue->prev, struct inode, i_io);
if (older_than_this &&
@@ -395,18 +390,19 @@ static int write_inode(struct inode *ino
/*
* Wait for writeback on an inode to complete.
*/
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct bdi_writeback *wb,
+ struct inode *inode)
{
DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
wait_queue_head_t *wqh;

wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
while (inode->i_state & I_SYNC) {
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
spin_unlock(&inode->i_lock);
__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
spin_lock(&inode->i_lock);
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
}
}

@@ -424,7 +420,8 @@ static void inode_wait_for_writeback(str
* Called under inode_lock.
*/
static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
+ struct writeback_control *wbc)
{
struct address_space *mapping = inode->i_mapping;
unsigned dirty;
@@ -445,14 +442,14 @@ writeback_single_inode(struct inode *ino
* completed a full scan of b_io.
*/
if (wbc->sync_mode != WB_SYNC_ALL) {
- requeue_io(inode);
+ requeue_io(wb, inode);
return 0;
}

/*
* It's a data-integrity sync. We must wait.
*/
- inode_wait_for_writeback(inode);
+ inode_wait_for_writeback(wb, inode);
}

BUG_ON(inode->i_state & I_SYNC);
@@ -460,7 +457,7 @@ writeback_single_inode(struct inode *ino
/* Set I_SYNC, reset I_DIRTY_PAGES */
inode->i_state |= I_SYNC;
inode->i_state &= ~I_DIRTY_PAGES;
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
spin_unlock(&inode->i_lock);

ret = do_writepages(mapping, wbc);
@@ -495,7 +492,7 @@ writeback_single_inode(struct inode *ino
spin_lock(&inode->i_lock);
}

- spin_lock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
inode->i_state &= ~I_SYNC;
if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
@@ -508,7 +505,7 @@ writeback_single_inode(struct inode *ino
* At least XFS will redirty the inode during the
* writeback (delalloc) and on io completion (isize).
*/
- redirty_tail(inode);
+ redirty_tail(wb, inode);
} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
/*
* We didn't write back all the pages. nfs_writepages()
@@ -536,12 +533,12 @@ select_queue:
/*
* slice used up: queue for next turn
*/
- requeue_io(inode);
+ requeue_io(wb, inode);
} else {
/*
* somehow blocked: retry later
*/
- redirty_tail(inode);
+ redirty_tail(wb, inode);
}
} else {
/*
@@ -552,15 +549,13 @@ select_queue:
* all the other files.
*/
inode->i_state |= I_DIRTY_PAGES;
- redirty_tail(inode);
+ redirty_tail(wb, inode);
}
} else {
/* The inode is clean */
list_del_init(&inode->i_io);
- if (list_empty(&inode->i_lru)) {
- list_add(&inode->i_lru, &inode_unused);
- inodes_stat.nr_unused++;
- }
+ if (list_empty(&inode->i_lru))
+ __inode_lru_list_add(inode);
}
}
inode_sync_complete(inode);
@@ -629,14 +624,15 @@ again:
struct inode *inode = list_entry(wb->b_io.prev,
struct inode, i_io);
if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&wb_inode_list_lock);
- spin_lock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
+ cpu_relax();
+ spin_lock(&wb->b_lock);
goto again;
}
if (wbc->sb && sb != inode->i_sb) {
/* super block given and doesn't
match, skip this inode */
- redirty_tail(inode);
+ redirty_tail(wb, inode);
spin_unlock(&inode->i_lock);
continue;
}
@@ -646,7 +642,7 @@ again:
return 0;
}
if (inode->i_state & (I_NEW | I_WILL_FREE)) {
- requeue_io(inode);
+ requeue_io(wb, inode);
spin_unlock(&inode->i_lock);
continue;
}
@@ -662,19 +658,19 @@ again:
BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
__iget(inode);
pages_skipped = wbc->pages_skipped;
- writeback_single_inode(inode, wbc);
+ writeback_single_inode(wb, inode, wbc);
if (wbc->pages_skipped != pages_skipped) {
/*
* writeback is not making progress due to locked
* buffers. Skip this inode for now.
*/
- redirty_tail(inode);
+ redirty_tail(wb, inode);
}
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
spin_unlock(&inode->i_lock);
iput(inode);
cond_resched();
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
if (wbc->nr_to_write <= 0) {
wbc->more_io = 1;
return 1;
@@ -693,7 +689,7 @@ static void writeback_inodes_wb(struct b

wbc->wb_start = jiffies; /* livelock avoidance */
again:
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);

if (!wbc->for_kupdate || list_empty(&wb->b_io))
queue_io(wb, wbc->older_than_this);
@@ -708,10 +704,11 @@ again:
/* super block given and doesn't
match, skip this inode */
if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
+ cpu_relax();
goto again;
}
- redirty_tail(inode);
+ redirty_tail(wb, inode);
spin_unlock(&inode->i_lock);
continue;
}
@@ -719,10 +716,11 @@ again:

if (state == SB_PIN_FAILED) {
if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
+ cpu_relax();
goto again;
}
- requeue_io(inode);
+ requeue_io(wb, inode);
spin_unlock(&inode->i_lock);
continue;
}
@@ -733,7 +731,7 @@ again:
if (ret)
break;
}
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
/* Leave any unwritten inodes on b_io */
}

@@ -846,18 +844,19 @@ static long wb_writeback(struct bdi_writ
* we'll just busyloop.
*/
retry:
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
struct inode, i_io);
if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
+ cpu_relax();
goto retry;
}
- inode_wait_for_writeback(inode);
+ inode_wait_for_writeback(wb, inode);
spin_unlock(&inode->i_lock);
}
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
}

return wrote;
@@ -1156,7 +1155,7 @@ void __mark_inode_dirty(struct inode *in
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+ struct bdi_writeback *wb = inode_to_wb(inode);
struct backing_dev_info *bdi = wb->bdi;

if (bdi_cap_writeback_dirty(bdi) &&
@@ -1167,9 +1166,10 @@ void __mark_inode_dirty(struct inode *in
}

inode->dirtied_when = jiffies;
- spin_lock(&wb_inode_list_lock);
- list_move(&inode->i_io, &wb->b_dirty);
- spin_unlock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
+ BUG_ON(!list_empty(&inode->i_io));
+ list_add(&inode->i_io, &wb->b_dirty);
+ spin_unlock(&wb->b_lock);
}
}
out:
@@ -1313,6 +1313,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
*/
int write_inode_now(struct inode *inode, int sync)
{
+ struct bdi_writeback *wb = inode_to_wb(inode);
int ret;
struct writeback_control wbc = {
.nr_to_write = LONG_MAX,
@@ -1326,9 +1327,9 @@ int write_inode_now(struct inode *inode,

might_sleep();
spin_lock(&inode->i_lock);
- spin_lock(&wb_inode_list_lock);
- ret = writeback_single_inode(inode, &wbc);
- spin_unlock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
+ ret = writeback_single_inode(wb, inode, &wbc);
+ spin_unlock(&wb->b_lock);
spin_unlock(&inode->i_lock);
if (sync)
inode_sync_wait(inode);
@@ -1349,12 +1350,13 @@ EXPORT_SYMBOL(write_inode_now);
*/
int sync_inode(struct inode *inode, struct writeback_control *wbc)
{
+ struct bdi_writeback *wb = inode_to_wb(inode);
int ret;

spin_lock(&inode->i_lock);
- spin_lock(&wb_inode_list_lock);
- ret = writeback_single_inode(inode, wbc);
- spin_unlock(&wb_inode_list_lock);
+ spin_lock(&wb->b_lock);
+ ret = writeback_single_inode(wb, inode, wbc);
+ spin_unlock(&wb->b_lock);
spin_unlock(&inode->i_lock);
return ret;
}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -27,6 +27,7 @@
#include <linux/posix_acl.h>
#include <linux/bit_spinlock.h>
#include <linux/lglock.h>
+#include "internal.h"

/*
* Usage:
@@ -34,8 +35,10 @@
* s_inodes, i_sb_list
* inode_hash_bucket lock protects:
* inode hash table, i_hash
- * wb_inode_list_lock protects:
- * inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_io, i_lru
+ * inode_lru_lock protects:
+ * inode_lru, i_lru
+ * wb->b_lock protects:
+ * b_io, b_more_io, b_dirty, i_io, i_lru
* inode->i_lock protects:
* i_state
* i_count
@@ -48,7 +51,8 @@
* inode_lock
* inode->i_lock
* inode_list_lglock
- * wb_inode_list_lock
+ * inode_lru_lock
+ * wb->b_lock
* inode_hash_bucket lock
*/
/*
@@ -98,7 +102,7 @@ static unsigned int i_hash_shift __read_
* allowing for low-overhead inode sync() operations.
*/

-LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_lru);

struct inode_hash_bucket {
struct hlist_bl_head head;
@@ -125,7 +129,7 @@ static struct inode_hash_bucket *inode_h
DECLARE_LGLOCK(inode_list_lglock);
DEFINE_LGLOCK(inode_list_lglock);

-DEFINE_SPINLOCK(wb_inode_list_lock);
+static DEFINE_SPINLOCK(inode_lru_lock);

/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -422,6 +426,22 @@ static void dispose_list(struct list_hea
}
}

+void __inode_lru_list_add(struct inode *inode)
+{
+ spin_lock(&inode_lru_lock);
+ list_add(&inode->i_lru, &inode_lru);
+ inodes_stat.nr_unused++;
+ spin_unlock(&inode_lru_lock);
+}
+
+void __inode_lru_list_del(struct inode *inode)
+{
+ spin_lock(&inode_lru_lock);
+ list_del_init(&inode->i_lru);
+ inodes_stat.nr_unused--;
+ spin_unlock(&inode_lru_lock);
+}
+
/*
* Invalidate all inodes for a device.
*/
@@ -438,11 +458,17 @@ static int invalidate_sb_inodes(struct s
}
invalidate_inode_buffers(inode);
if (!inode->i_count) {
- spin_lock(&wb_inode_list_lock);
+ struct bdi_writeback *wb = inode_to_wb(inode);
+
+ spin_lock(&wb->b_lock);
list_del_init(&inode->i_io);
+ spin_unlock(&wb->b_lock);
+
+ spin_lock(&inode_lru_lock);
list_del(&inode->i_lru);
inodes_stat.nr_unused--;
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lru_lock);
+
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
@@ -494,7 +520,7 @@ EXPORT_SYMBOL(invalidate_inodes);
*
* Any inodes which are pinned purely because of attached pagecache have their
* pagecache removed. We expect the final iput() on that inode to add it to
- * the front of the inode_unused list. So look for it there and if the
+ * the front of the inode_lru list. So look for it there and if the
* inode is still freeable, proceed. The right inode is found 99.9% of the
* time in testing on a 4-way.
*
@@ -508,17 +534,17 @@ static void prune_icache(int nr_to_scan)

down_read(&iprune_sem);
again:
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&inode_lru_lock);
for (; nr_to_scan; nr_to_scan--) {
struct inode *inode;

- if (list_empty(&inode_unused))
+ if (list_empty(&inode_lru))
break;

- inode = list_entry(inode_unused.prev, struct inode, i_lru);
+ inode = list_entry(inode_lru.prev, struct inode, i_lru);

if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lru_lock);
goto again;
}
if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
@@ -528,14 +554,14 @@ again:
continue;
}
if (inode->i_state) {
- list_move(&inode->i_lru, &inode_unused);
+ list_move(&inode->i_lru, &inode_lru);
inode->i_state &= ~I_REFERENCED;
spin_unlock(&inode->i_lock);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
- list_move(&inode->i_lru, &inode_unused);
- spin_unlock(&wb_inode_list_lock);
+ list_move(&inode->i_lru, &inode_lru);
+ spin_unlock(&inode_lru_lock);
__iget(inode);
spin_unlock(&inode->i_lock);

@@ -543,7 +569,7 @@ again:
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
iput(inode);
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&inode_lru_lock);
continue;
}
list_move(&inode->i_lru, &freeable);
@@ -556,7 +582,7 @@ again:
__count_vm_events(KSWAPD_INODESTEAL, reap);
else
__count_vm_events(PGINODESTEAL, reap);
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lru_lock);

dispose_list(&freeable);
up_read(&iprune_sem);
@@ -1400,15 +1426,16 @@ void generic_delete_inode(struct inode *
const struct super_operations *op = inode->i_sb->s_op;

if (!list_empty(&inode->i_lru)) {
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&inode_lru_lock);
list_del_init(&inode->i_lru);
inodes_stat.nr_unused--;
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lru_lock);
}
if (!list_empty(&inode->i_io)) {
- spin_lock(&wb_inode_list_lock);
+ struct bdi_writeback *wb = inode_to_wb(inode);
+ spin_lock(&wb->b_lock);
list_del_init(&inode->i_io);
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
}
inode_sb_list_del(inode);
percpu_counter_dec(&nr_inodes);
@@ -1460,10 +1487,10 @@ int generic_detach_inode(struct inode *i
inode->i_state |= I_REFERENCED;
if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
list_empty(&inode->i_lru)) {
- spin_lock(&wb_inode_list_lock);
- list_add(&inode->i_lru, &inode_unused);
+ spin_lock(&inode_lru_lock);
+ list_add(&inode->i_lru, &inode_lru);
inodes_stat.nr_unused++;
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lru_lock);
}
spin_unlock(&inode->i_lock);
return 0;
@@ -1478,15 +1505,16 @@ int generic_detach_inode(struct inode *i
__remove_inode_hash(inode);
}
if (!list_empty(&inode->i_lru)) {
- spin_lock(&wb_inode_list_lock);
+ spin_lock(&inode_lru_lock);
list_del_init(&inode->i_lru);
inodes_stat.nr_unused--;
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&inode_lru_lock);
}
if (!list_empty(&inode->i_io)) {
- spin_lock(&wb_inode_list_lock);
+ struct bdi_writeback *wb = inode_to_wb(inode);
+ spin_lock(&wb->b_lock);
list_del_init(&inode->i_io);
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&wb->b_lock);
}
inode_sb_list_del(inode);
percpu_counter_dec(&nr_inodes);
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -16,6 +16,7 @@
#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/writeback.h>
+#include <linux/spinlock.h>
#include <asm/atomic.h>

struct page;
@@ -53,6 +54,7 @@ struct bdi_writeback {
unsigned long last_old_flush; /* last old data flush */

struct task_struct *task; /* writeback task */
+ spinlock_t b_lock; /* lock for inode lists */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -9,9 +9,6 @@

struct backing_dev_info;

-extern spinlock_t wb_inode_list_lock;
-extern struct list_head inode_unused;
-
/*
* fs/fs-writeback.c
*/
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -75,19 +75,22 @@ static int bdi_debug_stats_show(struct s
/*
* inode lock is enough here, the bdi->wb_list is protected by
* RCU on the reader side
+ * (so why not for_each_entry_rcu, and why no explicit rcu disable??)
*/
nr_wb = nr_dirty = nr_io = nr_more_io = 0;
- spin_lock(&wb_inode_list_lock);
- list_for_each_entry(wb, &bdi->wb_list, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(wb, &bdi->wb_list, list) {
nr_wb++;
+ spin_lock(&wb->b_lock);
list_for_each_entry(inode, &wb->b_dirty, i_io)
nr_dirty++;
list_for_each_entry(inode, &wb->b_io, i_io)
nr_io++;
list_for_each_entry(inode, &wb->b_more_io, i_io)
nr_more_io++;
+ spin_unlock(&wb->b_lock);
}
- spin_unlock(&wb_inode_list_lock);
+ rcu_read_unlock();

get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);

@@ -267,6 +270,7 @@ static void bdi_wb_init(struct bdi_write

wb->bdi = bdi;
wb->last_old_flush = jiffies;
+ spin_lock_init(&wb->b_lock);
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
@@ -700,6 +704,17 @@ err:
}
EXPORT_SYMBOL(bdi_init);

+static void bdi_lock_two(struct backing_dev_info *bdi1, struct backing_dev_info *bdi2)
+{
+ if (bdi1 < bdi2) {
+ spin_lock(&bdi1->wb.b_lock);
+ spin_lock_nested(&bdi2->wb.b_lock, 1);
+ } else {
+ spin_lock(&bdi2->wb.b_lock);
+ spin_lock_nested(&bdi1->wb.b_lock, 1);
+ }
+}
+
void mapping_set_bdi(struct address_space *mapping, struct backing_dev_info *bdi)
{
struct inode *inode = mapping->host;
@@ -708,7 +723,7 @@ void mapping_set_bdi(struct address_spac
if (unlikely(old == bdi))
return;

- spin_lock(&wb_inode_list_lock);
+ bdi_lock_two(bdi, old);
if (!list_empty(&inode->i_io)) {
struct inode *i;

@@ -737,7 +752,8 @@ void mapping_set_bdi(struct address_spac
}
found:
mapping->a_bdi = bdi;
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&bdi->wb.b_lock);
+ spin_unlock(&old->wb.b_lock);
}
EXPORT_SYMBOL(mapping_set_bdi);

@@ -753,7 +769,7 @@ void bdi_destroy(struct backing_dev_info
struct bdi_writeback *dst = &default_backing_dev_info.wb;
struct inode *i;

- spin_lock(&wb_inode_list_lock);
+ bdi_lock_two(bdi, &default_backing_dev_info);
list_for_each_entry(i, &bdi->wb.b_dirty, i_io) {
list_del(&i->i_io);
list_add(&i->i_io, &dst->b_dirty);
@@ -769,7 +785,8 @@ void bdi_destroy(struct backing_dev_info
list_add(&i->i_io, &dst->b_more_io);
i->i_mapping->a_bdi = bdi;
}
- spin_unlock(&wb_inode_list_lock);
+ spin_unlock(&bdi->wb.b_lock);
+ spin_unlock(&dst->b_lock);
}

bdi_unregister(bdi);
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h
+++ linux-2.6/fs/internal.h
@@ -15,6 +15,8 @@ struct super_block;
struct linux_binprm;
struct path;

+#define inode_to_wb(inode) (&(inode)->i_mapping->a_bdi->wb)
+
/*
* block_dev.c
*/
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -2076,6 +2076,8 @@ extern int check_disk_change(struct bloc
extern int __invalidate_device(struct block_device *);
extern int invalidate_partition(struct gendisk *, int);
#endif
+extern void __inode_lru_list_add(struct inode *inode);
+extern void __inode_lru_list_del(struct inode *inode);
extern int invalidate_inodes(struct super_block *);
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end);
n***@suse.de
2010-06-24 03:02:59 UTC
Permalink
Having inode on writeback lists of a different bdi than
inode->i_mapping->backing_dev_info makes it very difficult to do per-bdi
locking of the writeback lists. Add functions to move these inodes over when
the mapping backing dev is changed.

i_mapping.backing_dev_info is renamed to i_mapping.a_bdi while we're
here. Succinct is nice, and it catches conversion errors.

Signed-off-by: Nick Piggin <***@suse.de>
--
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -316,19 +316,25 @@ static inline bool bdi_cap_flush_forker(
return bdi == &default_backing_dev_info;
}

+void mapping_set_bdi(struct address_space *mapping, struct backing_dev_info *bdi);
+static inline void mapping_new_set_bdi(struct address_space *mapping, struct backing_dev_info *bdi)
+{
+ mapping->a_bdi = bdi;
+}
+
static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
{
- return bdi_cap_writeback_dirty(mapping->backing_dev_info);
+ return bdi_cap_writeback_dirty(mapping->a_bdi);
}

static inline bool mapping_cap_account_dirty(struct address_space *mapping)
{
- return bdi_cap_account_dirty(mapping->backing_dev_info);
+ return bdi_cap_account_dirty(mapping->a_bdi);
}

static inline bool mapping_cap_swap_backed(struct address_space *mapping)
{
- return bdi_cap_swap_backed(mapping->backing_dev_info);
+ return bdi_cap_swap_backed(mapping->a_bdi);
}

static inline int bdi_sched_wait(void *word)
@@ -347,7 +353,7 @@ static inline void blk_run_backing_dev(s
static inline void blk_run_address_space(struct address_space *mapping)
{
if (mapping)
- blk_run_backing_dev(mapping->backing_dev_info, NULL);
+ blk_run_backing_dev(mapping->a_bdi, NULL);
}

#endif /* _LINUX_BACKING_DEV_H */
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -635,7 +635,7 @@ struct address_space {
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
- struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+ struct backing_dev_info *a_bdi; /* device readahead, etc */
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -700,6 +700,47 @@ err:
}
EXPORT_SYMBOL(bdi_init);

+void mapping_set_bdi(struct address_space *mapping, struct backing_dev_info *bdi)
+{
+ struct inode *inode = mapping->host;
+ struct backing_dev_info *old = mapping->a_bdi;
+
+ if (unlikely(old == bdi))
+ return;
+
+ spin_lock(&wb_inode_list_lock);
+ if (!list_empty(&inode->i_io)) {
+ struct inode *i;
+
+ list_for_each_entry(i, &old->wb.b_dirty, i_io) {
+ if (inode == i) {
+ list_del(&inode->i_io);
+ list_add(&inode->i_io, &bdi->wb.b_dirty);
+ goto found;
+ }
+ }
+ list_for_each_entry(i, &old->wb.b_io, i_io) {
+ if (inode == i) {
+ list_del(&inode->i_io);
+ list_add(&inode->i_io, &bdi->wb.b_io);
+ goto found;
+ }
+ }
+ list_for_each_entry(i, &old->wb.b_more_io, i_io) {
+ if (inode == i) {
+ list_del(&inode->i_io);
+ list_add(&inode->i_io, &bdi->wb.b_more_io);
+ goto found;
+ }
+ }
+ BUG();
+ }
+found:
+ mapping->a_bdi = bdi;
+ spin_unlock(&wb_inode_list_lock);
+}
+EXPORT_SYMBOL(mapping_set_bdi);
+
void bdi_destroy(struct backing_dev_info *bdi)
{
int i;
@@ -710,11 +751,24 @@ void bdi_destroy(struct backing_dev_info
*/
if (bdi_has_dirty_io(bdi)) {
struct bdi_writeback *dst = &default_backing_dev_info.wb;
+ struct inode *i;

spin_lock(&wb_inode_list_lock);
- list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
- list_splice(&bdi->wb.b_io, &dst->b_io);
- list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+ list_for_each_entry(i, &bdi->wb.b_dirty, i_io) {
+ list_del(&i->i_io);
+ list_add(&i->i_io, &dst->b_dirty);
+ i->i_mapping->a_bdi = bdi;
+ }
+ list_for_each_entry(i, &bdi->wb.b_io, i_io) {
+ list_del(&i->i_io);
+ list_add(&i->i_io, &dst->b_io);
+ i->i_mapping->a_bdi = bdi;
+ }
+ list_for_each_entry(i, &bdi->wb.b_more_io, i_io) {
+ list_del(&i->i_io);
+ list_add(&i->i_io, &dst->b_more_io);
+ i->i_mapping->a_bdi = bdi;
+ }
spin_unlock(&wb_inode_list_lock);
}

Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -871,7 +871,7 @@ static int memory_open(struct inode *ino

filp->f_op = dev->fops;
if (dev->dev_info)
- filp->f_mapping->backing_dev_info = dev->dev_info;
+ mapping_set_bdi(filp->f_mapping, dev->dev_info);

if (dev->fops->open)
return dev->fops->open(inode, filp);
Index: linux-2.6/drivers/char/raw.c
===================================================================
--- linux-2.6.orig/drivers/char/raw.c
+++ linux-2.6/drivers/char/raw.c
@@ -109,7 +109,7 @@ static int raw_release(struct inode *ino
if (--raw_devices[minor].inuse == 0) {
/* Here inode->i_mapping == bdev->bd_inode->i_mapping */
inode->i_mapping = &inode->i_data;
- inode->i_mapping->backing_dev_info = &default_backing_dev_info;
+ mapping_set_bdi(inode->i_mapping, &default_backing_dev_info);
}
mutex_unlock(&raw_mutex);

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -539,7 +539,7 @@ struct block_device *bdget(dev_t dev)
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
mapping_set_gfp_mask(&inode->i_data, GFP_USER);
- inode->i_data.backing_dev_info = &default_backing_dev_info;
+ mapping_new_set_bdi(&inode->i_data, &default_backing_dev_info);
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
spin_unlock(&bdev_lock);
@@ -1404,7 +1404,7 @@ static int __blkdev_get(struct block_dev
bdi = blk_get_backing_dev_info(bdev);
if (bdi == NULL)
bdi = &default_backing_dev_info;
- bdev->bd_inode->i_data.backing_dev_info = bdi;
+ mapping_set_bdi(&bdev->bd_inode->i_data, bdi);
}
if (bdev->bd_invalidated)
rescan_partitions(disk, bdev);
@@ -1419,8 +1419,8 @@ static int __blkdev_get(struct block_dev
if (ret)
goto out_clear;
bdev->bd_contains = whole;
- bdev->bd_inode->i_data.backing_dev_info =
- whole->bd_inode->i_data.backing_dev_info;
+ mapping_set_bdi(&bdev->bd_inode->i_data,
+ whole->bd_inode->i_data.a_bdi);
bdev->bd_part = disk_get_part(disk, partno);
if (!(disk->flags & GENHD_FL_UP) ||
!bdev->bd_part || !bdev->bd_part->nr_sects) {
@@ -1454,7 +1454,7 @@ static int __blkdev_get(struct block_dev
disk_put_part(bdev->bd_part);
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
- bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+ mapping_set_bdi(&bdev->bd_inode->i_data, &default_backing_dev_info);
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
@@ -1551,7 +1551,8 @@ static int __blkdev_put(struct block_dev
disk_put_part(bdev->bd_part);
bdev->bd_part = NULL;
bdev->bd_disk = NULL;
- bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+ mapping_set_bdi(&bdev->bd_inode->i_data,
+ &default_backing_dev_info);
if (bdev != bdev->bd_contains)
victim = bdev->bd_contains;
bdev->bd_contains = NULL;
Index: linux-2.6/fs/btrfs/disk-io.c
===================================================================
--- linux-2.6.orig/fs/btrfs/disk-io.c
+++ linux-2.6/fs/btrfs/disk-io.c
@@ -1636,7 +1636,7 @@ struct btrfs_root *open_ctree(struct sup
*/
fs_info->btree_inode->i_size = OFFSET_MAX;
fs_info->btree_inode->i_mapping->a_ops = &btree_aops;
- fs_info->btree_inode->i_mapping->backing_dev_info = &fs_info->bdi;
+ mapping_new_set_bdi(fs_info->btree_inode->i_mapping, &fs_info->bdi);

RB_CLEAR_NODE(&BTRFS_I(fs_info->btree_inode)->rb_node);
extent_io_tree_init(&BTRFS_I(fs_info->btree_inode)->io_tree,
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c
+++ linux-2.6/fs/btrfs/inode.c
@@ -2480,7 +2480,7 @@ static void btrfs_read_locked_inode(stru
switch (inode->i_mode & S_IFMT) {
case S_IFREG:
inode->i_mapping->a_ops = &btrfs_aops;
- inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
inode->i_fop = &btrfs_file_operations;
inode->i_op = &btrfs_file_inode_operations;
@@ -2495,7 +2495,7 @@ static void btrfs_read_locked_inode(stru
case S_IFLNK:
inode->i_op = &btrfs_symlink_inode_operations;
inode->i_mapping->a_ops = &btrfs_symlink_aops;
- inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
break;
default:
inode->i_op = &btrfs_special_inode_operations;
@@ -4705,7 +4705,7 @@ static int btrfs_create(struct inode *di
drop_inode = 1;
else {
inode->i_mapping->a_ops = &btrfs_aops;
- inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
inode->i_fop = &btrfs_file_operations;
inode->i_op = &btrfs_file_inode_operations;
BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
@@ -6700,7 +6700,7 @@ static int btrfs_symlink(struct inode *d
drop_inode = 1;
else {
inode->i_mapping->a_ops = &btrfs_aops;
- inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
inode->i_fop = &btrfs_file_operations;
inode->i_op = &btrfs_file_inode_operations;
BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
@@ -6740,7 +6740,7 @@ static int btrfs_symlink(struct inode *d

inode->i_op = &btrfs_symlink_inode_operations;
inode->i_mapping->a_ops = &btrfs_symlink_aops;
- inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
inode_set_bytes(inode, name_len);
btrfs_i_size_write(inode, name_len - 1);
err = btrfs_update_inode(trans, root, inode);
Index: linux-2.6/fs/ceph/inode.c
===================================================================
--- linux-2.6.orig/fs/ceph/inode.c
+++ linux-2.6/fs/ceph/inode.c
@@ -623,8 +623,8 @@ static int fill_inode(struct inode *inod
}

inode->i_mapping->a_ops = &ceph_aops;
- inode->i_mapping->backing_dev_info =
- &ceph_sb_to_client(inode->i_sb)->backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping,
+ &ceph_sb_to_client(inode->i_sb)->backing_dev_info);

switch (inode->i_mode & S_IFMT) {
case S_IFIFO:
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -136,7 +136,8 @@ struct inode * configfs_new_inode(mode_t
struct inode * inode = new_inode(configfs_sb);
if (inode) {
inode->i_mapping->a_ops = &configfs_aops;
- inode->i_mapping->backing_dev_info = &configfs_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping,
+ &configfs_backing_dev_info);
inode->i_op = &configfs_inode_operations;

if (sd->s_iattr) {
Index: linux-2.6/fs/gfs2/glock.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glock.c
+++ linux-2.6/fs/gfs2/glock.c
@@ -8,6 +8,7 @@
*/

#include <linux/sched.h>
+#include <linux/backing-dev.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/buffer_head.h>
@@ -796,7 +797,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp,
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
mapping->assoc_mapping = NULL;
- mapping->backing_dev_info = s->s_bdi;
+ mapping_new_set_bdi(mapping, s->s_bdi);
mapping->writeback_index = 0;
}

Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -476,7 +476,8 @@ static struct inode *hugetlbfs_get_inode
inode->i_uid = uid;
inode->i_gid = gid;
inode->i_mapping->a_ops = &hugetlbfs_aops;
- inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping,
+ &hugetlbfs_backing_dev_info);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
INIT_LIST_HEAD(&inode->i_mapping->private_list);
info = HUGETLBFS_I(inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -237,7 +237,7 @@ int inode_init_always(struct super_block
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
mapping->assoc_mapping = NULL;
- mapping->backing_dev_info = &default_backing_dev_info;
+ mapping_new_set_bdi(mapping, &default_backing_dev_info);
mapping->writeback_index = 0;

/*
@@ -248,8 +248,8 @@ int inode_init_always(struct super_block
if (sb->s_bdev) {
struct backing_dev_info *bdi;

- bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
- mapping->backing_dev_info = bdi;
+ bdi = sb->s_bdev->bd_inode->i_mapping->a_bdi;
+ mapping_new_set_bdi(mapping, bdi);
}
inode->i_private = NULL;
inode->i_mapping = mapping;
Index: linux-2.6/fs/logfs/inode.c
===================================================================
--- linux-2.6.orig/fs/logfs/inode.c
+++ linux-2.6/fs/logfs/inode.c
@@ -258,7 +258,7 @@ struct inode *logfs_new_meta_inode(struc
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
mapping->assoc_mapping = NULL;
- mapping->backing_dev_info = &default_backing_dev_info;
+ mapping_new_set_bdi(mapping, &default_backing_dev_info);
inode->i_mapping = mapping;
inode->i_nlink = 1;
}
Index: linux-2.6/fs/nilfs2/btnode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/btnode.c
+++ linux-2.6/fs/nilfs2/btnode.c
@@ -59,7 +59,7 @@ void nilfs_btnode_cache_init(struct addr
btnc->flags = 0;
mapping_set_gfp_mask(btnc, GFP_NOFS);
btnc->assoc_mapping = NULL;
- btnc->backing_dev_info = bdi;
+ mapping_new_set_bdi(btnc, bdi);
btnc->a_ops = &def_btnode_aops;
}

Index: linux-2.6/fs/nilfs2/mdt.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/mdt.c
+++ linux-2.6/fs/nilfs2/mdt.c
@@ -516,7 +516,7 @@ nilfs_mdt_new_common(struct the_nilfs *n
mapping->flags = 0;
mapping_set_gfp_mask(mapping, gfp_mask);
mapping->assoc_mapping = NULL;
- mapping->backing_dev_info = nilfs->ns_bdi;
+ mapping_new_set_bdi(mapping, nilfs->ns_bdi);

inode->i_mapping = mapping;
}
Index: linux-2.6/fs/ocfs2/dlmfs/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlmfs/dlmfs.c
+++ linux-2.6/fs/ocfs2/dlmfs/dlmfs.c
@@ -403,7 +403,7 @@ static struct inode *dlmfs_get_root_inod
inode->i_mode = mode;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
- inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping, &dlmfs_backing_dev_info);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
inc_nlink(inode);

@@ -428,7 +428,7 @@ static struct inode *dlmfs_get_inode(str
inode->i_mode = mode;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
- inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping, &dlmfs_backing_dev_info);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;

ip = DLMFS_I(inode);
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c
+++ linux-2.6/fs/ramfs/inode.c
@@ -60,7 +60,7 @@ struct inode *ramfs_get_inode(struct sup
if (inode) {
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
- inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping, &ramfs_backing_dev_info);
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -251,7 +251,7 @@ static void sysfs_init_inode(struct sysf

inode->i_private = sysfs_get(sd);
inode->i_mapping->a_ops = &sysfs_aops;
- inode->i_mapping->backing_dev_info = &sysfs_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping, &sysfs_backing_dev_info);
inode->i_op = &sysfs_inode_operations;

set_default_inode_attr(inode, sd->s_mode);
Index: linux-2.6/fs/ubifs/dir.c
===================================================================
--- linux-2.6.orig/fs/ubifs/dir.c
+++ linux-2.6/fs/ubifs/dir.c
@@ -109,7 +109,7 @@ struct inode *ubifs_new_inode(struct ubi
ubifs_current_time(inode);
inode->i_mapping->nrpages = 0;
/* Disable readahead */
- inode->i_mapping->backing_dev_info = &c->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &c->bdi);

switch (mode & S_IFMT) {
case S_IFREG:
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c
+++ linux-2.6/fs/ubifs/super.c
@@ -157,7 +157,7 @@ struct inode *ubifs_iget(struct super_bl
goto out_invalid;

/* Disable read-ahead */
- inode->i_mapping->backing_dev_info = &c->bdi;
+ mapping_new_set_bdi(inode->i_mapping, &c->bdi);

switch (inode->i_mode & S_IFMT) {
case S_IFREG:
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
@@ -636,7 +636,7 @@ xfs_buf_readahead(
{
struct backing_dev_info *bdi;

- bdi = target->bt_mapping->backing_dev_info;
+ bdi = target->bt_mapping->a_bdi;
if (bdi_read_congested(bdi))
return;

@@ -1610,7 +1610,7 @@ xfs_mapping_buftarg(
bdi = &default_backing_dev_info;
mapping = &inode->i_data;
mapping->a_ops = &mapping_aops;
- mapping->backing_dev_info = bdi;
+ mapping_new_set_bdi(mapping, bdi);
mapping_set_gfp_mask(mapping, GFP_NOFS);
btp->bt_mapping = mapping;
return 0;
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -782,7 +782,7 @@ static struct inode *cgroup_new_inode(mo
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
- inode->i_mapping->backing_dev_info = &cgroup_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping, &cgroup_backing_dev_info);
}
return inode;
}
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1561,7 +1561,7 @@ static struct inode *shmem_get_inode(str
if (inode) {
inode_init_owner(inode, dir, mode);
inode->i_blocks = 0;
- inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
+ mapping_new_set_bdi(inode->i_mapping, &shmem_backing_dev_info);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
inode->i_generation = get_seconds();
info = SHMEM_I(inode);
Index: linux-2.6/fs/afs/write.c
===================================================================
--- linux-2.6.orig/fs/afs/write.c
+++ linux-2.6/fs/afs/write.c
@@ -438,7 +438,7 @@ no_more:
*/
int afs_writepage(struct page *page, struct writeback_control *wbc)
{
- struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+ struct backing_dev_info *bdi = page->mapping->a_bdi;
struct afs_writeback *wb;
int ret;

@@ -469,7 +469,7 @@ static int afs_writepages_region(struct
struct writeback_control *wbc,
pgoff_t index, pgoff_t end, pgoff_t *_next)
{
- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;
struct afs_writeback *wb;
struct page *page;
int ret, n;
@@ -548,7 +548,7 @@ static int afs_writepages_region(struct
int afs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;
pgoff_t start, end, next;
int ret;

@@ -680,7 +680,7 @@ int afs_writeback_all(struct afs_vnode *
{
struct address_space *mapping = vnode->vfs_inode.i_mapping;
struct writeback_control wbc = {
- .bdi = mapping->backing_dev_info,
+ .bdi = mapping->a_bdi,
.sync_mode = WB_SYNC_ALL,
.nr_to_write = LONG_MAX,
.range_cyclic = 1,
Index: linux-2.6/fs/btrfs/extent_io.c
===================================================================
--- linux-2.6.orig/fs/btrfs/extent_io.c
+++ linux-2.6/fs/btrfs/extent_io.c
@@ -2628,7 +2628,7 @@ int extent_write_locked_range(struct ext
.sync_io = mode == WB_SYNC_ALL,
};
struct writeback_control wbc_writepages = {
- .bdi = inode->i_mapping->backing_dev_info,
+ .bdi = inode->i_mapping->a_bdi,
.sync_mode = mode,
.older_than_this = NULL,
.nr_to_write = nr_pages * 2,
Index: linux-2.6/fs/btrfs/file.c
===================================================================
--- linux-2.6.orig/fs/btrfs/file.c
+++ linux-2.6/fs/btrfs/file.c
@@ -872,7 +872,7 @@ static ssize_t btrfs_file_aio_write(stru
goto out;
count = ocount;

- current->backing_dev_info = inode->i_mapping->backing_dev_info;
+ current->backing_dev_info = inode->i_mapping->a_bdi;
err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
if (err)
goto out;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -3264,7 +3264,7 @@ void block_sync_page(struct page *page)
smp_mb();
mapping = page_mapping(page);
if (mapping)
- blk_run_backing_dev(mapping->backing_dev_info, page);
+ blk_run_backing_dev(mapping->a_bdi, page);
}
EXPORT_SYMBOL(block_sync_page);

Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c
+++ linux-2.6/mm/fadvise.c
@@ -72,7 +72,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
else
endbyte--; /* inclusive */

- bdi = mapping->backing_dev_info;
+ bdi = mapping->a_bdi;

switch (advice) {
case POSIX_FADV_NORMAL:
@@ -116,7 +116,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
- if (!bdi_write_congested(mapping->backing_dev_info))
+ if (!bdi_write_congested(mapping->a_bdi))
filemap_flush(mapping);

/* First and last FULL page! */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -136,7 +136,7 @@ void __remove_from_page_cache(struct pag
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
}
}

@@ -2375,7 +2375,7 @@ ssize_t __generic_file_aio_write(struct
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
+ current->backing_dev_info = mapping->a_bdi;
written = 0;

err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -409,7 +409,7 @@ xip_file_write(struct file *filp, const
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
+ current->backing_dev_info = mapping->a_bdi;

ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
if (ret)
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -491,7 +491,7 @@ static void balance_dirty_pages(struct a
unsigned long pages_written = 0;
unsigned long pause = 1;

- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;

for (;;) {
struct writeback_control wbc = {
@@ -633,7 +633,7 @@ void balance_dirty_pages_ratelimited_nr(
unsigned long *p;

ratelimit = ratelimit_pages;
- if (mapping->backing_dev_info->dirty_exceeded)
+ if (mapping->a_bdi->dirty_exceeded)
ratelimit = 8;

/*
@@ -1093,7 +1093,7 @@ void account_page_dirtied(struct page *p
{
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ __inc_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
task_dirty_inc(current);
task_io_account_write(PAGE_CACHE_SIZE);
}
@@ -1268,8 +1268,7 @@ int clear_page_dirty_for_io(struct page
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
+ dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
return 1;
}
return 0;
@@ -1284,7 +1283,7 @@ int test_clear_page_writeback(struct pag
int ret;

if (mapping) {
- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;
unsigned long flags;

spin_lock_irqsave(&mapping->tree_lock, flags);
@@ -1313,7 +1312,7 @@ int test_set_page_writeback(struct page
int ret;

if (mapping) {
- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;
unsigned long flags;

spin_lock_irqsave(&mapping->tree_lock, flags);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -25,7 +25,7 @@
void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
{
- ra->ra_pages = mapping->backing_dev_info->ra_pages;
+ ra->ra_pages = mapping->a_bdi->ra_pages;
ra->prev_pos = -1;
}
EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -549,7 +549,7 @@ page_cache_async_readahead(struct addres
/*
* Defer asynchronous read-ahead on IO congestion.
*/
- if (bdi_read_congested(mapping->backing_dev_info))
+ if (bdi_read_congested(mapping->a_bdi))
return;

/* do read-ahead */
@@ -564,7 +564,7 @@ page_cache_async_readahead(struct addres
* explicitly kick off the IO.
*/
if (PageUptodate(page))
- blk_run_backing_dev(mapping->backing_dev_info, NULL);
+ blk_run_backing_dev(mapping->a_bdi, NULL);
#endif
}
EXPORT_SYMBOL_GPL(page_cache_async_readahead);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -501,7 +501,7 @@ void __init swap_setup(void)
unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);

#ifdef CONFIG_SWAP
- bdi_init(swapper_space.backing_dev_info);
+ bdi_init(swapper_space.a_bdi);
#endif

/* Use a smaller cluster for small-memory machines */
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -45,7 +45,7 @@ struct address_space swapper_space = {
.tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
- .backing_dev_info = &swap_backing_dev_info,
+ .a_bdi = &swap_backing_dev_info,
};

#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -116,7 +116,7 @@ void swap_unplug_io_fn(struct backing_de
*/
WARN_ON(page_count(page) <= 1);

- bdi = bdev->bd_inode->i_mapping->backing_dev_info;
+ bdi = bdev->bd_inode->i_mapping->a_bdi;
blk_run_backing_dev(bdi, page);
}
up_read(&swap_unplug_sem);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -75,8 +75,7 @@ void cancel_dirty_page(struct page *page
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
+ dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
if (account_size)
task_io_account_cancelled_write(account_size);
}
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -361,7 +361,7 @@ static pageout_t pageout(struct page *pa
}
if (mapping->a_ops->writepage == NULL)
return PAGE_ACTIVATE;
- if (!may_write_to_queue(mapping->backing_dev_info))
+ if (!may_write_to_queue(mapping->a_bdi))
return PAGE_KEEP;

if (clear_page_dirty_for_io(page)) {
Index: linux-2.6/fs/ext2/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/ialloc.c
+++ linux-2.6/fs/ext2/ialloc.c
@@ -177,7 +177,7 @@ static void ext2_preread_inode(struct in
struct ext2_group_desc * gdp;
struct backing_dev_info *bdi;

- bdi = inode->i_mapping->backing_dev_info;
+ bdi = inode->i_mapping->a_bdi;
if (bdi_read_congested(bdi))
return;
if (bdi_write_congested(bdi))
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -28,7 +28,7 @@
#include <linux/buffer_head.h>
#include "internal.h"

-#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
+#define inode_to_bdi(inode) ((inode)->i_mapping->a_bdi)

/*
* We don't actually have pdflush, but this one is exported though /proc...
Index: linux-2.6/drivers/mtd/mtdchar.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdchar.c
+++ linux-2.6/drivers/mtd/mtdchar.c
@@ -99,7 +99,7 @@ static int mtd_open(struct inode *inode,
if (mtd_ino->i_state & I_NEW) {
mtd_ino->i_private = mtd;
mtd_ino->i_mode = S_IFCHR;
- mtd_ino->i_data.backing_dev_info = mtd->backing_dev_info;
+ mapping_new_set_bdi(&mtd_ino->i_data, mtd->backing_dev_info);
unlock_new_inode(mtd_ino);
}
file->f_mapping = mtd_ino->i_mapping;
Index: linux-2.6/fs/ceph/addr.c
===================================================================
--- linux-2.6.orig/fs/ceph/addr.c
+++ linux-2.6/fs/ceph/addr.c
@@ -108,8 +108,7 @@ static int ceph_set_page_dirty(struct pa

if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
+ __inc_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(&mapping->page_tree,
@@ -593,7 +592,7 @@ static int ceph_writepages_start(struct
struct writeback_control *wbc)
{
struct inode *inode = mapping->host;
- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_client *client;
pgoff_t index, start, end;
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -1365,7 +1365,7 @@ static int cifs_partialpagewrite(struct
static int cifs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct backing_dev_info *bdi = mapping->a_bdi;
unsigned int bytes_to_write;
unsigned int bytes_written;
struct cifs_sb_info *cifs_sb;
Index: linux-2.6/fs/fuse/file.c
===================================================================
--- linux-2.6.orig/fs/fuse/file.c
+++ linux-2.6/fs/fuse/file.c
@@ -945,7 +945,7 @@ static ssize_t fuse_file_aio_write(struc
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
+ current->backing_dev_info = mapping->a_bdi;

err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
if (err)
@@ -1133,7 +1133,7 @@ static void fuse_writepage_finish(struct
{
struct inode *inode = req->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
- struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+ struct backing_dev_info *bdi = inode->i_mapping->a_bdi;

list_del(&req->writepages_entry);
dec_bdi_stat(bdi, BDI_WRITEBACK);
@@ -1247,7 +1247,7 @@ static int fuse_writepage_locked(struct
req->end = fuse_writepage_end;
req->inode = inode;

- inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_bdi_stat(mapping->a_bdi, BDI_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
end_page_writeback(page);

Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -254,7 +254,7 @@ struct inode *fuse_iget(struct super_blo
if ((inode->i_state & I_NEW)) {
inode->i_flags |= S_NOATIME|S_NOCMTIME;
inode->i_generation = generation;
- inode->i_data.backing_dev_info = &fc->bdi;
+ mapping_new_set_bdi(&inode->i_data, &fc->bdi);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -280,7 +280,7 @@ nfs_fhget(struct super_block *sb, struct
if (S_ISREG(inode->i_mode)) {
inode->i_fop = &nfs_file_operations;
inode->i_data.a_ops = &nfs_file_aops;
- inode->i_data.backing_dev_info = &NFS_SB(sb)->backing_dev_info;
+ mapping_new_set_bdi(&inode->i_data, &NFS_SB(sb)->backing_dev_info);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
inode->i_fop = &nfs_dir_operations;
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -444,7 +444,7 @@ nfs_mark_request_commit(struct nfs_page
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
- inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+ inc_bdi_stat(req->wb_page->mapping->a_bdi, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
}

@@ -455,7 +455,7 @@ nfs_clear_request_commit(struct nfs_page

if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
dec_zone_page_state(page, NR_UNSTABLE_NFS);
- dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+ dec_bdi_stat(page->mapping->a_bdi, BDI_RECLAIMABLE);
return 1;
}
return 0;
@@ -1307,8 +1307,7 @@ nfs_commit_list(struct inode *inode, str
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
- dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
- BDI_RECLAIMABLE);
+ dec_bdi_stat(req->wb_page->mapping->a_bdi, BDI_RECLAIMABLE);
nfs_clear_page_tag_locked(req);
}
nfs_commit_clear_lock(NFS_I(inode));
Index: linux-2.6/fs/nilfs2/the_nilfs.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/the_nilfs.c
+++ linux-2.6/fs/nilfs2/the_nilfs.c
@@ -613,7 +613,7 @@ int init_nilfs(struct the_nilfs *nilfs,

nilfs->ns_mount_state = le16_to_cpu(sbp->s_state);

- bdi = nilfs->ns_bdev->bd_inode->i_mapping->backing_dev_info;
+ bdi = nilfs->ns_bdev->bd_inode->i_mapping->a_bdi;
nilfs->ns_bdi = bdi ? : &default_backing_dev_info;

/* Finding last segment */
Index: linux-2.6/fs/ntfs/file.c
===================================================================
--- linux-2.6.orig/fs/ntfs/file.c
+++ linux-2.6/fs/ntfs/file.c
@@ -2088,7 +2088,7 @@ static ssize_t ntfs_file_aio_write_noloc
pos = *ppos;
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
/* We can write back this queue in page reclaim. */
- current->backing_dev_info = mapping->backing_dev_info;
+ current->backing_dev_info = mapping->a_bdi;
written = 0;
err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
if (err)
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c
+++ linux-2.6/fs/ocfs2/file.c
@@ -2129,7 +2129,7 @@ relock:
goto out_dio;
}
} else {
- current->backing_dev_info = file->f_mapping->backing_dev_info;
+ current->backing_dev_info = file->f_mapping->a_bdi;
written = generic_file_buffered_write(iocb, iov, nr_segs, *ppos,
ppos, count, 0);
current->backing_dev_info = NULL;
Index: linux-2.6/fs/romfs/super.c
===================================================================
--- linux-2.6.orig/fs/romfs/super.c
+++ linux-2.6/fs/romfs/super.c
@@ -356,8 +356,8 @@ static struct inode *romfs_iget(struct s
i->i_fop = &romfs_ro_fops;
i->i_data.a_ops = &romfs_aops;
if (i->i_sb->s_mtd)
- i->i_data.backing_dev_info =
- i->i_sb->s_mtd->backing_dev_info;
+ mapping_new_set_bdi(&i->i_data,
+ i->i_sb->s_mtd->backing_dev_info);
if (nextfh & ROMFH_EXEC)
mode |= S_IXUGO;
break;
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c
@@ -756,7 +756,7 @@ start:
goto out_unlock_internal;

/* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
+ current->backing_dev_info = mapping->a_bdi;

if ((ioflags & IO_ISDIRECT)) {
if (mapping->nrpages) {


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:56 UTC
Permalink
Signed-off-by: Nick Piggin <***@suse.de>
---
fs/drop_caches.c | 37 ++++++----
fs/fs-writeback.c | 82 +++++++++++++-----------
fs/inode.c | 128 +++++++++++++++++++++++++------------
fs/notify/inode_mark.c | 108 +++++++++++++++++--------------
fs/notify/inotify/inotify.c | 132 ++++++++++++++++++++-------------------
fs/quota/dquot.c | 98 +++++++++++++++++-----------
fs/super.c | 19 +++++
include/linux/fs.h | 10 ++
include/linux/fsnotify_backend.h | 4 -
include/linux/inotify.h | 4 -
10 files changed, 373 insertions(+), 249 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -26,10 +26,11 @@
#include <linux/async.h>
#include <linux/posix_acl.h>
#include <linux/bit_spinlock.h>
+#include <linux/lglock.h>

/*
* Usage:
- * sb_inode_list_lock protects:
+ * inode_list_lglock protects:
* s_inodes, i_sb_list
* inode_hash_bucket lock protects:
* inode hash table, i_hash
@@ -45,7 +46,7 @@
* Ordering:
* inode_lock
* inode->i_lock
- * sb_inode_list_lock
+ * inode_list_lglock
* wb_inode_list_lock
* inode_hash_bucket lock
*/
@@ -120,7 +121,9 @@ static struct inode_hash_bucket *inode_h
* NOTE! You also have to own the lock if you change
* the i_state of an inode while it is in use..
*/
-DEFINE_SPINLOCK(sb_inode_list_lock);
+DECLARE_LGLOCK(inode_list_lglock);
+DEFINE_LGLOCK(inode_list_lglock);
+
DEFINE_SPINLOCK(wb_inode_list_lock);

/*
@@ -382,6 +385,8 @@ void clear_inode(struct inode *inode)
}
EXPORT_SYMBOL(clear_inode);

+static void inode_sb_list_del(struct inode *inode);
+
/*
* dispose_list - dispose of the contents of a local list
* @head: the head of the list to free
@@ -405,9 +410,7 @@ static void dispose_list(struct list_hea

spin_lock(&inode->i_lock);
__remove_inode_hash(inode);
- spin_lock(&sb_inode_list_lock);
- list_del_rcu(&inode->i_sb_list);
- spin_unlock(&sb_inode_list_lock);
+ inode_sb_list_del(inode);
spin_unlock(&inode->i_lock);

wake_up_inode(inode);
@@ -419,20 +422,12 @@ static void dispose_list(struct list_hea
/*
* Invalidate all inodes for a device.
*/
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_sb_inodes(struct super_block *sb, struct list_head *dispose)
{
- struct list_head *next;
+ struct inode *inode;
int busy = 0;

- next = head->next;
- for (;;) {
- struct list_head *tmp = next;
- struct inode *inode;
-
- next = next->next;
- if (tmp == head)
- break;
- inode = list_entry(tmp, struct inode, i_sb_list);
+ do_inode_list_for_each_entry_rcu(sb, inode) {
spin_lock(&inode->i_lock);
if (inode->i_state & I_NEW) {
spin_unlock(&inode->i_lock);
@@ -452,7 +447,8 @@ static int invalidate_list(struct list_h
}
spin_unlock(&inode->i_lock);
busy = 1;
- }
+ } while_inode_list_for_each_entry_rcu
+
return busy;
}

@@ -476,9 +472,9 @@ int invalidate_inodes(struct super_block
*/
down_write(&iprune_sem);
// spin_lock(&sb_inode_list_lock); XXX: is this safe?
- inotify_unmount_inodes(&sb->s_inodes);
- fsnotify_unmount_inodes(&sb->s_inodes);
- busy = invalidate_list(&sb->s_inodes, &throw_away);
+ inotify_unmount_inodes(sb);
+ fsnotify_unmount_inodes(sb);
+ busy = invalidate_sb_inodes(sb, &throw_away);
// spin_unlock(&sb_inode_list_lock);

dispose_list(&throw_away);
@@ -675,13 +671,63 @@ static unsigned long hash(struct super_b
return tmp & I_HASHMASK;
}

+static inline int inode_list_cpu(struct inode *inode)
+{
+#ifdef CONFIG_SMP
+ return inode->i_sb_list_cpu;
+#else
+ return smp_processor_id();
+#endif
+}
+
+/* helper for file_sb_list_add to reduce ifdefs */
+static inline void __inode_sb_list_add(struct inode *inode, struct super_block *sb)
+{
+ struct list_head *list;
+#ifdef CONFIG_SMP
+ int cpu;
+ cpu = smp_processor_id();
+ inode->i_sb_list_cpu = cpu;
+ list = per_cpu_ptr(sb->s_inodes, cpu);
+#else
+ list = &sb->s_inodes;
+#endif
+ list_add_rcu(&inode->i_sb_list, list);
+}
+
+/**
+ * inode_sb_list_add - add an inode to the sb's file list
+ * @inode: inode to add
+ * @sb: sb to add it to
+ *
+ * Use this function to associate an with the superblock it belongs to.
+ */
+static void inode_sb_list_add(struct inode *inode, struct super_block *sb)
+{
+ lg_local_lock(inode_list_lglock);
+ __inode_sb_list_add(inode, sb);
+ lg_local_unlock(inode_list_lglock);
+}
+
+/**
+ * inode_sb_list_del - remove an inode from the sb's inode list
+ * @inode: inode to remove
+ * @sb: sb to remove it from
+ *
+ * Use this function to remove an inode from its superblock.
+ */
+static void inode_sb_list_del(struct inode *inode)
+{
+ lg_local_lock_cpu(inode_list_lglock, inode_list_cpu(inode));
+ list_del_rcu(&inode->i_sb_list);
+ lg_local_unlock_cpu(inode_list_lglock, inode_list_cpu(inode));
+}
+
static inline void
__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
struct inode *inode)
{
- spin_lock(&sb_inode_list_lock);
- list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
- spin_unlock(&sb_inode_list_lock);
+ inode_sb_list_add(inode, sb);
percpu_counter_inc(&nr_inodes);
if (b) {
spin_lock_bucket(b);
@@ -1221,6 +1267,7 @@ repeat:
continue;
if (!spin_trylock(&old->i_lock)) {
spin_unlock_bucket(b);
+ cpu_relax();
goto repeat;
}
goto found_old;
@@ -1266,7 +1313,6 @@ repeat:
if (!spin_trylock(&old->i_lock)) {
spin_unlock_bucket(b);
cpu_relax();
- cpu_relax();
goto repeat;
}
goto found_old;
@@ -1361,9 +1407,7 @@ void generic_delete_inode(struct inode *
inodes_stat.nr_unused--;
spin_unlock(&wb_inode_list_lock);
}
- spin_lock(&sb_inode_list_lock);
- list_del_rcu(&inode->i_sb_list);
- spin_unlock(&sb_inode_list_lock);
+ inode_sb_list_del(inode);
percpu_counter_dec(&nr_inodes);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
@@ -1437,9 +1481,7 @@ int generic_detach_inode(struct inode *i
inodes_stat.nr_unused--;
spin_unlock(&wb_inode_list_lock);
}
- spin_lock(&sb_inode_list_lock);
- list_del_rcu(&inode->i_sb_list);
- spin_unlock(&sb_inode_list_lock);
+ inode_sb_list_del(inode);
percpu_counter_dec(&nr_inodes);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
@@ -1759,6 +1801,8 @@ void __init inode_init(void)
init_once);
register_shrinker(&icache_shrinker);

+ lg_lock_init(inode_list_lglock);
+
/* Hash may have been set up in inode_init_early */
if (!hashdist)
return;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -733,6 +733,9 @@ struct inode {
struct rcu_head i_rcu;
};
unsigned long i_ino;
+#ifdef CONFIG_SMP
+ int i_sb_list_cpu;
+#endif
unsigned int i_count;
unsigned int i_nlink;
uid_t i_uid;
@@ -1345,11 +1348,12 @@ struct super_block {
#endif
const struct xattr_handler **s_xattr;

- struct list_head s_inodes; /* all inodes */
struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */
#ifdef CONFIG_SMP
+ struct list_head __percpu *s_inodes;
struct list_head __percpu *s_files;
#else
+ struct list_head s_inodes; /* all inodes */
struct list_head s_files;
#endif
/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
@@ -2194,6 +2198,58 @@ static inline void insert_inode_hash(str

extern void file_sb_list_add(struct file *f, struct super_block *sb);
extern void file_sb_list_del(struct file *f);
+#ifdef CONFIG_SMP
+
+/*
+ * These macros iterate all inodes on all CPUs for a given superblock.
+ * rcu_read_lock must be held.
+ */
+#define do_inode_list_for_each_entry_rcu(__sb, __inode) \
+{ \
+ int i; \
+ for_each_possible_cpu(i) { \
+ struct list_head *list; \
+ list = per_cpu_ptr((__sb)->s_inodes, i); \
+ list_for_each_entry_rcu((__inode), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_rcu \
+ } \
+}
+
+#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp) \
+{ \
+ int i; \
+ for_each_possible_cpu(i) { \
+ struct list_head *list; \
+ list = per_cpu_ptr((__sb)->s_inodes, i); \
+ list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_safe \
+ } \
+}
+
+#else
+
+#define do_inode_list_for_each_entry_rcu(__sb, __inode) \
+{ \
+ struct list_head *list; \
+ list = &(sb)->s_inodes; \
+ list_for_each_entry_rcu((__inode), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_rcu \
+}
+
+#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp) \
+{ \
+ struct list_head *list; \
+ list = &(sb)->s_inodes; \
+ list_for_each_entry_rcu((__inode), (__tmp), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_safe \
+}
+
+#endif
+
#ifdef CONFIG_BLOCK
struct bio;
extern void submit_bio(int, struct bio *);
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -68,12 +68,26 @@ static struct super_block *alloc_super(s
for_each_possible_cpu(i)
INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
}
+
+ s->s_inodes = alloc_percpu(struct list_head);
+ if (!s->s_inodes) {
+ free_percpu(s->s_files);
+ security_sb_free(s);
+ kfree(s);
+ s = NULL;
+ goto out;
+ } else {
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(per_cpu_ptr(s->s_inodes, i));
+ }
#else
INIT_LIST_HEAD(&s->s_files);
+ INIT_LIST_HEAD(&s->s_inodes);
#endif
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
- INIT_LIST_HEAD(&s->s_inodes);
INIT_LIST_HEAD(&s->s_dentry_lru);
init_rwsem(&s->s_umount);
mutex_init(&s->s_lock);
@@ -125,6 +139,7 @@ out:
static inline void destroy_super(struct super_block *s)
{
#ifdef CONFIG_SMP
+ free_percpu(s->s_inodes);
free_percpu(s->s_files);
#endif
security_sb_free(s);
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -17,7 +17,7 @@ static void drop_pagecache_sb(struct sup
struct inode *inode, *toput_inode = NULL;

rcu_read_lock();
- list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+ do_inode_list_for_each_entry_rcu(sb, inode) {
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)
|| inode->i_mapping->nrpages == 0) {
@@ -31,7 +31,7 @@ static void drop_pagecache_sb(struct sup
iput(toput_inode);
toput_inode = inode;
rcu_read_lock();
- }
+ } while_inode_list_for_each_entry_rcu
rcu_read_unlock();
iput(toput_inode);
}
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -1198,17 +1198,17 @@ static void wait_sb_inodes(struct super_
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- /*
- * Data integrity sync. Must wait for all pages under writeback,
- * because there may have been pages dirtied before our sync
- * call, but which had writeout started before we write it out.
- * In which case, the inode may not be on the dirty list, but
- * we still have to wait for that writeout.
- */
rcu_read_lock();
- list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+ do_inode_list_for_each_entry_rcu(sb, inode) {
struct address_space *mapping;

+ /*
+ * Data integrity sync. Must wait for all pages under writeback,
+ * because there may have been pages dirtied before our sync
+ * call, but which had writeout started before we write it out.
+ * In which case, the inode may not be on the dirty list, but
+ * we still have to wait for that writeout.
+ */
mapping = inode->i_mapping;
if (mapping->nrpages == 0)
continue;
@@ -1222,11 +1222,12 @@ static void wait_sb_inodes(struct super_
spin_unlock(&inode->i_lock);
rcu_read_unlock();
/*
- * We hold a reference to 'inode' so it couldn't have been
- * removed from s_inodes list while we dropped the i_lock. We
- * cannot iput the inode now as we can be holding the last
- * reference and we cannot iput it under spinlock. So we keep
- * the reference and iput it later.
+ * We hold a reference to 'inode' so it couldn't have
+ * been removed from s_inodes list while we dropped the
+ * i_lock. We cannot iput the inode now as we can be
+ * holding the last reference and we cannot iput it
+ * under spinlock. So we keep the reference and iput it
+ * later.
*/
iput(old_inode);
old_inode = inode;
@@ -1236,7 +1237,7 @@ static void wait_sb_inodes(struct super_
cond_resched();

rcu_read_lock();
- }
+ } while_inode_list_for_each_entry_rcu
rcu_read_unlock();
iput(old_inode);
}
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c
+++ linux-2.6/fs/notify/inode_mark.c
@@ -361,11 +361,11 @@ int fsnotify_add_mark(struct fsnotify_ma
* of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
* We temporarily drop inode_lock, however, and CAN block.
*/
-void fsnotify_unmount_inodes(struct list_head *list)
+void fsnotify_unmount_inodes(struct super_block *sb)
{
struct inode *inode, *next_i, *need_iput = NULL;

- list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+ do_inode_list_for_each_entry_safe(sb, inode, next_i) {
struct inode *need_iput_tmp;

spin_lock(&inode->i_lock);
@@ -421,5 +421,5 @@ void fsnotify_unmount_inodes(struct list
fsnotify_inode_delete(inode);

iput(inode);
- }
+ } while_inode_list_for_each_entry_safe
}
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -381,11 +381,11 @@ EXPORT_SYMBOL_GPL(inotify_get_cookie);
* of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
* We temporarily drop inode_lock, however, and CAN block.
*/
-void inotify_unmount_inodes(struct list_head *list)
+void inotify_unmount_inodes(struct super_block *sb)
{
struct inode *inode, *next_i, *need_iput = NULL;

- list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+ do_inode_list_for_each_entry_safe(sb, inode, next_i) {
struct inotify_watch *watch, *next_w;
struct inode *need_iput_tmp;
struct list_head *watches;
@@ -450,8 +450,8 @@ void inotify_unmount_inodes(struct list_
put_inotify_watch(watch);
}
mutex_unlock(&inode->inotify_mutex);
- iput(inode);
- }
+ iput(inode);
+ } while_inode_list_for_each_entry_safe
}
EXPORT_SYMBOL_GPL(inotify_unmount_inodes);

Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -884,16 +884,12 @@ static void add_dquot_ref(struct super_b
#endif

rcu_read_lock();
- list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+ do_inode_list_for_each_entry_rcu(sb, inode) {
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}
-#ifdef CONFIG_QUOTA_DEBUG
- if (unlikely(inode_get_rsv_space(inode) > 0))
- reserved = 1;
-#endif
if (!atomic_read(&inode->i_writecount)) {
spin_unlock(&inode->i_lock);
continue;
@@ -916,7 +912,7 @@ static void add_dquot_ref(struct super_b
* keep the reference and iput it later. */
old_inode = inode;
rcu_read_lock();
- }
+ } while_inode_list_for_each_entry_rcu
rcu_read_unlock();
iput(old_inode);

@@ -996,7 +992,7 @@ static void remove_dquot_ref(struct supe
struct inode *inode;

rcu_read_lock();
- list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+ do_inode_list_for_each_entry_rcu(sb, inode) {
/*
* We have to scan also I_NEW inodes because they can already
* have quota pointer initialized. Luckily, we need to touch
@@ -1005,7 +1001,7 @@ static void remove_dquot_ref(struct supe
*/
if (!IS_NOQUOTA(inode))
remove_inode_dquot_ref(inode, type, tofree_head);
- }
+ } while_inode_list_for_each_entry_rcu
rcu_read_unlock();
}

Index: linux-2.6/include/linux/fsnotify_backend.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify_backend.h
+++ linux-2.6/include/linux/fsnotify_backend.h
@@ -344,7 +344,7 @@ extern void fsnotify_destroy_mark_by_ent
extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
-extern void fsnotify_unmount_inodes(struct list_head *list);
+extern void fsnotify_unmount_inodes(struct super_block *sb);

/* put here because inotify does some weird stuff when destroying watches */
extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u32 mask,
@@ -374,7 +374,7 @@ static inline u32 fsnotify_get_cookie(vo
return 0;
}

-static inline void fsnotify_unmount_inodes(struct list_head *list)
+static inline void fsnotify_unmount_inodes(struct super_block *sb)
{}

#endif /* CONFIG_FSNOTIFY */
Index: linux-2.6/include/linux/inotify.h
===================================================================
--- linux-2.6.orig/include/linux/inotify.h
+++ linux-2.6/include/linux/inotify.h
@@ -111,7 +111,7 @@ extern void inotify_inode_queue_event(st
const char *, struct inode *);
extern void inotify_dentry_parent_queue_event(struct dentry *, __u32, __u32,
const char *);
-extern void inotify_unmount_inodes(struct list_head *);
+extern void inotify_unmount_inodes(struct super_block *);
extern void inotify_inode_is_dead(struct inode *);
extern u32 inotify_get_cookie(void);

@@ -161,7 +161,7 @@ static inline void inotify_dentry_parent
{
}

-static inline void inotify_unmount_inodes(struct list_head *list)
+static inline void inotify_unmount_inodes(struct super_block *sb)
{
}

Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -9,7 +9,6 @@

struct backing_dev_info;

-extern spinlock_t sb_inode_list_lock;
extern spinlock_t wb_inode_list_lock;
extern struct list_head inode_unused;
n***@suse.de
2010-06-24 03:02:55 UTC
Permalink
From: Eric Dumazet <***@cosmosbay.com>

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
Signed-off-by: Nick Piggin <***@suse.de>
---
fs/fs-writeback.c | 4 ++--
fs/inode.c | 31 ++++++++++++++++++++++++++++---
include/linux/fs.h | 5 ++++-
kernel/sysctl.c | 4 ++--
4 files changed, 36 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -904,7 +904,7 @@ static long wb_check_old_data_flush(stru
wb->last_old_flush = jiffies;
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- inodes_stat.nr_inodes - inodes_stat.nr_unused;
+ get_nr_inodes() - inodes_stat.nr_unused;

if (nr_pages) {
struct wb_writeback_args args = {
@@ -1257,7 +1257,7 @@ void writeback_inodes_sb(struct super_bl
long nr_to_write;

nr_to_write = nr_dirty + nr_unstable +
- inodes_stat.nr_inodes - inodes_stat.nr_unused;
+ get_nr_inodes() - inodes_stat.nr_unused;

bdi_start_writeback(sb->s_bdi, sb, nr_to_write);
}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -144,9 +144,33 @@ struct inodes_stat_t inodes_stat = {
.nr_inodes = 0,
.nr_unused = 0,
};
+struct percpu_counter nr_inodes;

static struct kmem_cache *inode_cachep __read_mostly;

+int get_nr_inodes(void)
+{
+ return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ inodes_stat.nr_inodes = get_nr_inodes();
+ return proc_dointvec(table, write, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void wake_up_inode(struct inode *inode)
{
/*
@@ -657,8 +681,8 @@ __inode_add_to_lists(struct super_block
{
spin_lock(&sb_inode_list_lock);
list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
- inodes_stat.nr_inodes++;
spin_unlock(&sb_inode_list_lock);
+ percpu_counter_inc(&nr_inodes);
if (b) {
spin_lock_bucket(b);
hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
@@ -1337,8 +1361,8 @@ void generic_delete_inode(struct inode *
}
spin_lock(&sb_inode_list_lock);
list_del_rcu(&inode->i_sb_list);
- inodes_stat.nr_inodes--;
spin_unlock(&sb_inode_list_lock);
+ percpu_counter_dec(&nr_inodes);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
@@ -1413,8 +1437,8 @@ int generic_detach_inode(struct inode *i
}
spin_lock(&sb_inode_list_lock);
list_del_rcu(&inode->i_sb_list);
- inodes_stat.nr_inodes--;
spin_unlock(&sb_inode_list_lock);
+ percpu_counter_dec(&nr_inodes);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
@@ -1723,6 +1747,7 @@ void __init inode_init(void)
{
int loop;

+ percpu_counter_init(&nr_inodes, 0);
/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache",
sizeof(struct inode),
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -406,6 +406,8 @@ extern struct files_stat_struct files_st
extern int get_max_files(void);
extern int sysctl_nr_open;
extern struct inodes_stat_t inodes_stat;
+extern struct percpu_counter nr_inodes;
+extern int get_nr_inodes(void);
extern int leases_enable, lease_break_time;
#ifdef CONFIG_DNOTIFY
extern int dir_notify_enable;
@@ -2514,7 +2516,8 @@ ssize_t simple_attr_write(struct file *f
struct ctl_table;
int proc_nr_files(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
-
+int proc_nr_inodes(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
int __init get_filesystem_list(char *buf);

#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1338,14 +1338,14 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 2*sizeof(int),
.mode = 0444,
- .proc_handler = proc_dointvec,
+ .proc_handler = proc_nr_inodes,
},
{
.procname = "inode-state",
.data = &inodes_stat,
.maxlen = 7*sizeof(int),
.mode = 0444,
- .proc_handler = proc_dointvec,
+ .proc_handler = proc_nr_inodes,
},
{
.procname = "file-nr",
n***@suse.de
2010-06-24 03:02:40 UTC
Permalink
Protect i_state updates with i_lock

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/drop_caches.c | 9 ++++--
fs/fs-writeback.c | 37 +++++++++++++++++++-----
fs/inode.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++--------
fs/nilfs2/gcdat.c | 1
fs/quota/dquot.c | 14 +++++++--
5 files changed, 117 insertions(+), 25 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -19,11 +19,14 @@ static void drop_pagecache_sb(struct sup
spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
- continue;
- if (inode->i_mapping->nrpages == 0)
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)
+ || inode->i_mapping->nrpages == 0) {
+ spin_unlock(&inode->i_lock);
continue;
+ }
__iget(inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
invalidate_mapping_pages(inode->i_mapping, 0, -1);
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -398,10 +398,12 @@ static void inode_wait_for_writeback(str
wait_queue_head_t *wqh;

wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
- while (inode->i_state & I_SYNC) {
+ while (inode->i_state & I_SYNC) {
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
}
}

@@ -455,6 +457,7 @@ writeback_single_inode(struct inode *ino
/* Set I_SYNC, reset I_DIRTY_PAGES */
inode->i_state |= I_SYNC;
inode->i_state &= ~I_DIRTY_PAGES;
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);

ret = do_writepages(mapping, wbc);
@@ -476,8 +479,10 @@ writeback_single_inode(struct inode *ino
* write_inode()
*/
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY;
inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -487,6 +492,7 @@ writeback_single_inode(struct inode *ino
}

spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
inode->i_state &= ~I_SYNC;
if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
@@ -630,7 +636,9 @@ static int writeback_sb_inodes(struct su
if (sb != inode->i_sb)
/* finish with this superblock */
return 0;
+ spin_lock(&inode->i_lock);
if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+ spin_unlock(&inode->i_lock);
requeue_io(inode);
continue;
}
@@ -638,8 +646,10 @@ static int writeback_sb_inodes(struct su
* Was this inode dirtied after sync_sb_inodes was called?
* This keeps sync from extra jobs and livelock.
*/
- if (inode_dirtied_after(inode, wbc->wb_start))
+ if (inode_dirtied_after(inode, wbc->wb_start)) {
+ spin_unlock(&inode->i_lock);
return 1;
+ }

BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
__iget(inode);
@@ -652,6 +662,7 @@ static int writeback_sb_inodes(struct su
*/
redirty_tail(inode);
}
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
iput(inode);
cond_resched();
@@ -1090,6 +1101,7 @@ void __mark_inode_dirty(struct inode *in
block_dump___mark_inode_dirty(inode);

spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;

@@ -1134,6 +1146,7 @@ void __mark_inode_dirty(struct inode *in
}
}
out:
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(__mark_inode_dirty);
@@ -1178,12 +1191,17 @@ static void wait_sb_inodes(struct super_
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
struct address_space *mapping;

- if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
- continue;
mapping = inode->i_mapping;
if (mapping->nrpages == 0)
continue;
+
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
__iget(inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
/*
@@ -1287,7 +1305,9 @@ int write_inode_now(struct inode *inode,

might_sleep();
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
ret = writeback_single_inode(inode, &wbc);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
if (sync)
inode_sync_wait(inode);
@@ -1311,7 +1331,9 @@ int sync_inode(struct inode *inode, stru
int ret;

spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
ret = writeback_single_inode(inode, wbc);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
return ret;
}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -32,10 +32,13 @@
* s_inodes, i_sb_list
* inode_hash_lock protects:
* inode hash table, i_hash
+ * inode->i_lock protects:
+ * i_state
*
* Ordering:
* inode_lock
* sb_inode_list_lock
+ * inode->i_lock
* inode_lock
* inode_hash_lock
*/
@@ -301,6 +304,8 @@ static void init_once(void *foo)
*/
void __iget(struct inode *inode)
{
+ assert_spin_locked(&inode->i_lock);
+
if (atomic_inc_return(&inode->i_count) != 1)
return;

@@ -401,16 +406,21 @@ static int invalidate_list(struct list_h
if (tmp == head)
break;
inode = list_entry(tmp, struct inode, i_sb_list);
- if (inode->i_state & I_NEW)
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & I_NEW) {
+ spin_unlock(&inode->i_lock);
continue;
+ }
invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
list_move(&inode->i_list, dispose);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
+ spin_unlock(&inode->i_lock);
count++;
continue;
}
+ spin_unlock(&inode->i_lock);
busy = 1;
}
/* only unused inodes may be cached with i_count zero */
@@ -490,12 +500,15 @@ static void prune_icache(int nr_to_scan)

inode = list_entry(inode_unused.prev, struct inode, i_list);

+ spin_lock(&inode->i_lock);
if (inode->i_state || atomic_read(&inode->i_count)) {
list_move(&inode->i_list, &inode_unused);
+ spin_unlock(&inode->i_lock);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
__iget(inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
if (remove_inode_buffers(inode))
reap += invalidate_mapping_pages(&inode->i_data,
@@ -506,12 +519,16 @@ static void prune_icache(int nr_to_scan)
if (inode != list_entry(inode_unused.next,
struct inode, i_list))
continue; /* wrong inode or list_empty */
- if (!can_unuse(inode))
+ spin_lock(&inode->i_lock);
+ if (!can_unuse(inode)) {
+ spin_unlock(&inode->i_lock);
continue;
+ }
}
list_move(&inode->i_list, &freeable);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
+ spin_unlock(&inode->i_lock);
nr_pruned++;
}
inodes_stat.nr_unused -= nr_pruned;
@@ -574,8 +591,14 @@ repeat:
hlist_for_each_entry(inode, node, head, i_hash) {
if (inode->i_sb != sb)
continue;
- if (!test(inode, data))
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&inode_hash_lock);
+ goto repeat;
+ }
+ if (!test(inode, data)) {
+ spin_unlock(&inode->i_lock);
continue;
+ }
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
spin_unlock(&inode_hash_lock);
__wait_on_freeing_inode(inode);
@@ -604,6 +627,10 @@ repeat:
continue;
if (inode->i_sb != sb)
continue;
+ if (!spin_trylock(&inode->i_lock)) {
+ spin_unlock(&inode_hash_lock);
+ goto repeat;
+ }
if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
spin_unlock(&inode_hash_lock);
__wait_on_freeing_inode(inode);
@@ -630,10 +657,10 @@ __inode_add_to_lists(struct super_block
struct inode *inode)
{
inodes_stat.nr_inodes++;
- list_add(&inode->i_list, &inode_in_use);
spin_lock(&sb_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
+ list_add(&inode->i_list, &inode_in_use);
if (head) {
spin_lock(&inode_hash_lock);
hlist_add_head(&inode->i_hash, head);
@@ -690,9 +717,9 @@ struct inode *new_inode(struct super_blo
inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode_lock);
- __inode_add_to_lists(sb, NULL, inode);
inode->i_ino = ++last_ino;
inode->i_state = 0;
+ __inode_add_to_lists(sb, NULL, inode);
spin_unlock(&inode_lock);
}
return inode;
@@ -759,8 +786,8 @@ static struct inode *get_new_inode(struc
if (set(inode, data))
goto set_failed;

- __inode_add_to_lists(sb, head, inode);
inode->i_state = I_NEW;
+ __inode_add_to_lists(sb, head, inode);
spin_unlock(&inode_lock);

/* Return the locked inode with I_NEW set, the
@@ -775,6 +802,7 @@ static struct inode *get_new_inode(struc
* allocated.
*/
__iget(old);
+ spin_unlock(&old->i_lock);
spin_unlock(&inode_lock);
destroy_inode(inode);
inode = old;
@@ -783,6 +811,7 @@ static struct inode *get_new_inode(struc
return inode;

set_failed:
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
destroy_inode(inode);
return NULL;
@@ -806,8 +835,8 @@ static struct inode *get_new_inode_fast(
old = find_inode_fast(sb, head, ino);
if (!old) {
inode->i_ino = ino;
- __inode_add_to_lists(sb, head, inode);
inode->i_state = I_NEW;
+ __inode_add_to_lists(sb, head, inode);
spin_unlock(&inode_lock);

/* Return the locked inode with I_NEW set, the
@@ -822,6 +851,7 @@ static struct inode *get_new_inode_fast(
* allocated.
*/
__iget(old);
+ spin_unlock(&old->i_lock);
spin_unlock(&inode_lock);
destroy_inode(inode);
inode = old;
@@ -863,6 +893,7 @@ ino_t iunique(struct super_block *sb, in
res = counter++;
head = inode_hashtable + hash(sb, res);
inode = find_inode_fast(sb, head, res);
+ spin_unlock(&inode->i_lock);
} while (inode != NULL);
spin_unlock(&inode_lock);

@@ -872,7 +903,10 @@ EXPORT_SYMBOL(iunique);

struct inode *igrab(struct inode *inode)
{
+ struct inode *ret = inode;
+
spin_lock(&inode_lock);
+ spin_lock(&inode->i_lock);
if (!(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)))
__iget(inode);
else
@@ -881,9 +915,11 @@ struct inode *igrab(struct inode *inode)
* called yet, and somebody is calling igrab
* while the inode is getting freed.
*/
- inode = NULL;
+ ret = NULL;
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
- return inode;
+
+ return ret;
}
EXPORT_SYMBOL(igrab);

@@ -916,6 +952,7 @@ static struct inode *ifind(struct super_
inode = find_inode(sb, head, test, data);
if (inode) {
__iget(inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
if (likely(wait))
wait_on_inode(inode);
@@ -949,6 +986,7 @@ static struct inode *ifind_fast(struct s
inode = find_inode_fast(sb, head, ino);
if (inode) {
__iget(inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
wait_on_inode(inode);
return inode;
@@ -1118,6 +1156,7 @@ int insert_inode_locked(struct inode *in
struct inode *old = NULL;

spin_lock(&inode_lock);
+repeat:
spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, node, head, i_hash) {
if (old->i_ino != ino)
@@ -1126,6 +1165,10 @@ int insert_inode_locked(struct inode *in
continue;
if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
continue;
+ if (!spin_trylock(&old->i_lock)) {
+ spin_unlock(&inode_hash_lock);
+ goto repeat;
+ }
break;
}
if (likely(!node)) {
@@ -1136,6 +1179,7 @@ int insert_inode_locked(struct inode *in
}
spin_unlock(&inode_hash_lock);
__iget(old);
+ spin_unlock(&old->i_lock);
spin_unlock(&inode_lock);
wait_on_inode(old);
if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1160,6 +1204,7 @@ int insert_inode_locked4(struct inode *i
struct inode *old = NULL;

spin_lock(&inode_lock);
+repeat:
spin_lock(&inode_hash_lock);
hlist_for_each_entry(old, node, head, i_hash) {
if (old->i_sb != sb)
@@ -1168,6 +1213,10 @@ int insert_inode_locked4(struct inode *i
continue;
if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
continue;
+ if (!spin_trylock(&old->i_lock)) {
+ spin_unlock(&inode_hash_lock);
+ goto repeat;
+ }
break;
}
if (likely(!node)) {
@@ -1178,6 +1227,7 @@ int insert_inode_locked4(struct inode *i
}
spin_unlock(&inode_hash_lock);
__iget(old);
+ spin_unlock(&old->i_lock);
spin_unlock(&inode_lock);
wait_on_inode(old);
if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1240,12 +1290,14 @@ void generic_delete_inode(struct inode *
{
const struct super_operations *op = inode->i_sb->s_op;

- list_del_init(&inode->i_list);
spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
+ list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
+ spin_unlock(&inode->i_lock);
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);

@@ -1284,19 +1336,27 @@ int generic_detach_inode(struct inode *i
{
struct super_block *sb = inode->i_sb;

+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
if (!hlist_unhashed(&inode->i_hash)) {
if (!(inode->i_state & (I_DIRTY|I_SYNC)))
list_move(&inode->i_list, &inode_unused);
inodes_stat.nr_unused++;
if (sb->s_flags & MS_ACTIVE) {
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
return 0;
}
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_WILL_FREE;
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);
write_inode_now(inode, 1);
spin_lock(&inode_lock);
+ spin_lock(&sb_inode_list_lock);
+ spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
inodes_stat.nr_unused--;
@@ -1305,12 +1365,12 @@ int generic_detach_inode(struct inode *i
spin_unlock(&inode_hash_lock);
}
list_del_init(&inode->i_list);
- spin_lock(&sb_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
return 1;
}
@@ -1558,6 +1618,8 @@ EXPORT_SYMBOL(inode_wait);
* wake_up_inode() after removing from the hash list will DTRT.
*
* This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
*/
static void __wait_on_freeing_inode(struct inode *inode)
{
@@ -1565,6 +1627,7 @@ static void __wait_on_freeing_inode(stru
DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
wq = bit_waitqueue(&inode->i_state, __I_NEW);
prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
schedule();
finish_wait(wq, &wait.wait);
Index: linux-2.6/fs/nilfs2/gcdat.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/gcdat.c
+++ linux-2.6/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
#include "page.h"
#include "mdt.h"

+/* XXX: what protects i_state? */
int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
{
struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -886,18 +886,26 @@ static void add_dquot_ref(struct super_b
spin_lock(&inode_lock);
spin_lock(&sb_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) {
+ spin_unlock(&inode->i_lock);
continue;
+ }
#ifdef CONFIG_QUOTA_DEBUG
if (unlikely(inode_get_rsv_space(inode) > 0))
reserved = 1;
#endif
- if (!atomic_read(&inode->i_writecount))
+ if (!atomic_read(&inode->i_writecount)) {
+ spin_unlock(&inode->i_lock);
continue;
- if (!dqinit_needed(inode, type))
+ }
+ if (!dqinit_needed(inode, type)) {
+ spin_unlock(&inode->i_lock);
continue;
+ }

__iget(inode);
+ spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
spin_unlock(&inode_lock);



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:03:03 UTC
Permalink
Per-zone LRUs and shrinkers for dentry and inode caches.

Signed-off-by: Nick Piggin <***@suse.de>

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -382,6 +382,7 @@ struct files_stat_struct {
#include <linux/semaphore.h>
#include <linux/fiemap.h>
#include <linux/rculist_bl.h>
+#include <linux/mmzone.h>

#include <asm/atomic.h>
#include <asm/byteorder.h>
@@ -1325,6 +1326,25 @@ extern int send_sigurg(struct fown_struc
extern struct list_head super_blocks;
extern spinlock_t sb_lock;

+#define sb_zone_info(sb, ___z) \
+ &sb->s_reclaim.node[zone_to_nid(___z)].zone[zone_idx(___z)]
+
+struct sb_zoneinfo {
+ /* protected by zone->dentry_lru_lock */
+ spinlock_t s_dentry_lru_lock;
+ struct list_head s_dentry_lru;
+ long s_nr_dentry_scan;
+ unsigned long s_nr_dentry_unused;
+};
+
+struct sb_nodeinfo {
+ struct sb_zoneinfo zone[MAX_NR_ZONES];
+};
+
+struct sb_reclaim {
+ struct sb_nodeinfo node[MAX_NUMNODES];
+};
+
struct super_block {
struct list_head s_list; /* Keep this first */
dev_t s_dev; /* search index; _not_ kdev_t */
@@ -1357,9 +1377,7 @@ struct super_block {
struct list_head s_inodes; /* all inodes */
struct list_head s_files;
#endif
- /* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
- struct list_head s_dentry_lru; /* unused dentry lru */
- int s_nr_dentry_unused; /* # of dentry on lru */
+ struct sb_reclaim s_reclaim;

struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -43,8 +43,8 @@
* - i_dentry, d_alias, d_inode of aliases
* dcache_hash_bucket lock protects:
* - the dcache hash table
- * dcache_lru_lock protects:
- * - the dcache lru lists and counters
+ * sbz->s_dcache_lru_lock protects:
+ * - the per-sb x per-zone dcache lru lists and counters
* d_lock protects:
* - d_flags
* - d_name
@@ -58,7 +58,7 @@
* Ordering:
* dentry->d_inode->i_lock
* dentry->d_lock
- * dcache_lru_lock
+ * sbz->s_dcache_lru_lock
* dcache_hash_bucket lock
*
* If there is an ancestor relationship:
@@ -75,7 +75,6 @@
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);
@@ -186,51 +185,55 @@ static void dentry_iput(struct dentry *
*/
static void dentry_lru_add(struct dentry *dentry)
{
- spin_lock(&dcache_lru_lock);
- list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
- dentry->d_sb->s_nr_dentry_unused++;
- dentry_stat.nr_unused++;
- spin_unlock(&dcache_lru_lock);
+ struct sb_zoneinfo *sbz = sb_zone_info(dentry->d_sb,
+ page_zone(virt_to_page(dentry)));
+ spin_lock(&sbz->s_dentry_lru_lock);
+ list_add(&dentry->d_lru, &sbz->s_dentry_lru);
+ sbz->s_nr_dentry_unused++;
+ spin_unlock(&sbz->s_dentry_lru_lock);
}

static void dentry_lru_add_tail(struct dentry *dentry)
{
- spin_lock(&dcache_lru_lock);
- list_add_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
- dentry->d_sb->s_nr_dentry_unused++;
- dentry_stat.nr_unused++;
- spin_unlock(&dcache_lru_lock);
+ struct sb_zoneinfo *sbz = sb_zone_info(dentry->d_sb,
+ page_zone(virt_to_page(dentry)));
+ spin_lock(&sbz->s_dentry_lru_lock);
+ list_add_tail(&dentry->d_lru, &sbz->s_dentry_lru);
+ sbz->s_nr_dentry_unused++;
+ spin_unlock(&sbz->s_dentry_lru_lock);
}

-static void __dentry_lru_del(struct dentry *dentry)
+static void __dentry_lru_del(struct sb_zoneinfo *sbz, struct dentry *dentry)
{
list_del(&dentry->d_lru);
- dentry->d_sb->s_nr_dentry_unused--;
- dentry_stat.nr_unused--;
+ sbz->s_nr_dentry_unused--;
}

-static void __dentry_lru_del_init(struct dentry *dentry)
+static void __dentry_lru_del_init(struct sb_zoneinfo *sbz,struct dentry *dentry)
{
list_del_init(&dentry->d_lru);
- dentry->d_sb->s_nr_dentry_unused--;
- dentry_stat.nr_unused--;
+ sbz->s_nr_dentry_unused--;
}

static void dentry_lru_del(struct dentry *dentry)
{
if (!list_empty(&dentry->d_lru)) {
- spin_lock(&dcache_lru_lock);
- __dentry_lru_del(dentry);
- spin_unlock(&dcache_lru_lock);
+ struct sb_zoneinfo *sbz = sb_zone_info(dentry->d_sb,
+ page_zone(virt_to_page(dentry)));
+ spin_lock(&sbz->s_dentry_lru_lock);
+ __dentry_lru_del(sbz, dentry);
+ spin_unlock(&sbz->s_dentry_lru_lock);
}
}

static void dentry_lru_del_init(struct dentry *dentry)
{
if (likely(!list_empty(&dentry->d_lru))) {
- spin_lock(&dcache_lru_lock);
- __dentry_lru_del_init(dentry);
- spin_unlock(&dcache_lru_lock);
+ struct sb_zoneinfo *sbz = sb_zone_info(dentry->d_sb,
+ page_zone(virt_to_page(dentry)));
+ spin_lock(&sbz->s_dentry_lru_lock);
+ __dentry_lru_del_init(sbz, dentry);
+ spin_unlock(&sbz->s_dentry_lru_lock);
}
}

@@ -638,32 +641,33 @@ again:
* which flags are set. This means we don't need to maintain multiple
* similar copies of this loop.
*/
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+static void __shrink_dcache_sb_zone(struct super_block *sb,
+ struct sb_zoneinfo *sbz, unsigned long *count, int flags)
{
LIST_HEAD(referenced);
LIST_HEAD(tmp);
struct dentry *dentry;
- int cnt = 0;
+ unsigned long cnt = 0;

BUG_ON(!sb);
BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
if (count != NULL)
/* called from prune_dcache() and shrink_dcache_parent() */
cnt = *count;
-relock:
- spin_lock(&dcache_lru_lock);
restart:
if (count == NULL)
- list_splice_init(&sb->s_dentry_lru, &tmp);
+ list_splice_init(&sbz->s_dentry_lru, &tmp);
else {
- while (!list_empty(&sb->s_dentry_lru)) {
- dentry = list_entry(sb->s_dentry_lru.prev,
- struct dentry, d_lru);
+ while (!list_empty(&sbz->s_dentry_lru)) {
+ dentry = list_entry(sbz->s_dentry_lru.prev,
+ struct dentry, d_lru);
BUG_ON(dentry->d_sb != sb);

if (!spin_trylock(&dentry->d_lock)) {
- spin_unlock(&dcache_lru_lock);
- goto relock;
+ spin_unlock(&sbz->s_dentry_lru_lock);
+ cpu_relax();
+ spin_lock(&sbz->s_dentry_lru_lock);
+ continue;
}
/*
* If we are honouring the DCACHE_REFERENCED flag and
@@ -682,13 +686,10 @@ restart:
if (!cnt)
break;
}
- cond_resched_lock(&dcache_lru_lock);
+ cond_resched_lock(&sbz->s_dentry_lru_lock);
}
}
- spin_unlock(&dcache_lru_lock);

-again:
- spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
while (!list_empty(&tmp)) {
struct inode *inode;

@@ -696,8 +697,10 @@ again:

if (!spin_trylock(&dentry->d_lock)) {
again1:
- spin_unlock(&dcache_lru_lock);
- goto again;
+ spin_unlock(&sbz->s_dentry_lru_lock);
+ cpu_relax();
+ spin_lock(&sbz->s_dentry_lru_lock);
+ continue;
}
/*
* We found an inuse dentry which was not removed from
@@ -705,7 +708,7 @@ again1:
* it - just keep it off the LRU list.
*/
if (dentry->d_count) {
- __dentry_lru_del_init(dentry);
+ __dentry_lru_del_init(sbz, dentry);
spin_unlock(&dentry->d_lock);
continue;
}
@@ -722,21 +725,33 @@ again2:
goto again2;
}
}
- __dentry_lru_del_init(dentry);
- spin_unlock(&dcache_lru_lock);
+ __dentry_lru_del_init(sbz, dentry);
+ spin_unlock(&sbz->s_dentry_lru_lock);

prune_one_dentry(dentry);
+ cond_resched();
/* dentry->d_lock dropped */
- spin_lock(&dcache_lru_lock);
+ spin_lock(&sbz->s_dentry_lru_lock);
}

- if (count == NULL && !list_empty(&sb->s_dentry_lru))
+ if (count == NULL && !list_empty(&sbz->s_dentry_lru))
goto restart;
if (count != NULL)
*count = cnt;
if (!list_empty(&referenced))
- list_splice(&referenced, &sb->s_dentry_lru);
- spin_unlock(&dcache_lru_lock);
+ list_splice(&referenced, &sbz->s_dentry_lru);
+}
+
+static void __shrink_dcache_sb(struct super_block *sb, unsigned long *count, int flags)
+{
+ struct zone *zone;
+ for_each_zone(zone) {
+ struct sb_zoneinfo *sbz = sb_zone_info(sb, zone);
+
+ spin_lock(&sbz->s_dentry_lru_lock);
+ __shrink_dcache_sb_zone(sb, sbz, count, flags);
+ spin_unlock(&sbz->s_dentry_lru_lock);
+ }
}

/**
@@ -749,31 +764,29 @@ again2:
* This function may fail to free any resources if all the dentries are in use.
*/
static void prune_dcache(struct zone *zone, unsigned long scanned,
- unsigned long total, gfp_t gfp_mask)
-
+ unsigned long total, gfp_t gfp_mask)
{
- unsigned long nr_to_scan;
struct super_block *sb, *n;
- int w_count;
- int prune_ratio;
- int count, pruned;

- shrinker_add_scan(&nr_to_scan, scanned, total, dentry_stat.nr_unused,
- DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
-done:
- count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
- if (dentry_stat.nr_unused == 0 || count == 0)
- return;
- if (count >= dentry_stat.nr_unused)
- prune_ratio = 1;
- else
- prune_ratio = dentry_stat.nr_unused / count;
spin_lock(&sb_lock);
list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
+ struct sb_zoneinfo *sbz = sb_zone_info(sb, zone);
+ unsigned long nr;
+
if (list_empty(&sb->s_instances))
continue;
- if (sb->s_nr_dentry_unused == 0)
+ if (sbz->s_nr_dentry_unused == 0)
+ continue;
+
+ shrinker_add_scan(&sbz->s_nr_dentry_scan, scanned, total,
+ sbz->s_nr_dentry_unused,
+ DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
+ if (!(gfp_mask & __GFP_FS))
+ continue;
+ nr = ACCESS_ONCE(sbz->s_nr_dentry_scan);
+ if (nr < SHRINK_BATCH)
continue;
+
sb->s_count++;
/* Now, we reclaim unused dentrins with fairness.
* We reclaim them same percentage from each superblock.
@@ -785,11 +798,8 @@ done:
* number of dentries in the machine)
*/
spin_unlock(&sb_lock);
- if (prune_ratio != 1)
- w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
- else
- w_count = sb->s_nr_dentry_unused;
- pruned = w_count;
+
+
/*
* We need to be sure this filesystem isn't being unmounted,
* otherwise we could race with generic_shutdown_super(), and
@@ -798,28 +808,24 @@ done:
* s_root isn't NULL.
*/
if (down_read_trylock(&sb->s_umount)) {
- if ((sb->s_root != NULL) &&
- (!list_empty(&sb->s_dentry_lru))) {
- __shrink_dcache_sb(sb, &w_count,
- DCACHE_REFERENCED);
- pruned -= w_count;
+ spin_lock(&sbz->s_dentry_lru_lock);
+ if (sb->s_root != NULL &&
+ !list_empty(&sbz->s_dentry_lru)) {
+ count_vm_events(SLABS_SCANNED, nr);
+ sbz->s_nr_dentry_scan = 0;
+ __shrink_dcache_sb_zone(sb, sbz,
+ &nr, DCACHE_REFERENCED);
+ sbz->s_nr_dentry_scan += nr;
}
+ spin_unlock(&sbz->s_dentry_lru_lock);
up_read(&sb->s_umount);
}
spin_lock(&sb_lock);
/* lock was dropped, must reset next */
list_safe_reset_next(sb, n, s_list);
- count -= pruned;
__put_super(sb);
- /* more work left to do? */
- if (count <= 0)
- break;
}
spin_unlock(&sb_lock);
- if (count <= 0) {
- cond_resched();
- goto done;
- }
}

/**
@@ -1167,8 +1173,9 @@ out:
void shrink_dcache_parent(struct dentry * parent)
{
struct super_block *sb = parent->d_sb;
- int found;
+ unsigned long found;

+ /* doesn't work well anymore :( */
while ((found = select_parent(parent)) != 0)
__shrink_dcache_sb(sb, &found, 0);
}
@@ -1189,7 +1196,7 @@ EXPORT_SYMBOL(shrink_dcache_parent);
static int shrink_dcache_memory(struct zone *zone, unsigned long scanned,
unsigned long total, unsigned long global, gfp_t gfp_mask)
{
- prune_dcache(zone, scanned, global, gfp_mask);
+ prune_dcache(zone, scanned, total, gfp_mask);
return 0;
}

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -35,7 +35,7 @@
* s_inodes, i_sb_list
* inode_hash_bucket lock protects:
* inode hash table, i_hash
- * inode_lru_lock protects:
+ * zone->inode_lru_lock protects:
* inode_lru, i_lru
* wb->b_lock protects:
* b_io, b_more_io, b_dirty, i_io, i_lru
@@ -51,7 +51,7 @@
* inode_lock
* inode->i_lock
* inode_list_lglock
- * inode_lru_lock
+ * zone->inode_lru_lock
* wb->b_lock
* inode_hash_bucket lock
*/
@@ -102,8 +102,6 @@ static unsigned int i_hash_shift __read_
* allowing for low-overhead inode sync() operations.
*/

-static LIST_HEAD(inode_lru);
-
struct inode_hash_bucket {
struct hlist_bl_head head;
};
@@ -129,8 +127,6 @@ static struct inode_hash_bucket *inode_h
DECLARE_LGLOCK(inode_list_lglock);
DEFINE_LGLOCK(inode_list_lglock);

-static DEFINE_SPINLOCK(inode_lru_lock);
-
/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
* icache shrinking path, and the umount path. Without this exclusion,
@@ -428,18 +424,22 @@ static void dispose_list(struct list_hea

void __inode_lru_list_add(struct inode *inode)
{
- spin_lock(&inode_lru_lock);
- list_add(&inode->i_lru, &inode_lru);
- inodes_stat.nr_unused++;
- spin_unlock(&inode_lru_lock);
+ struct zone *z = page_zone(virt_to_page(inode));
+
+ spin_lock(&z->inode_lru_lock);
+ list_add(&inode->i_lru, &z->inode_lru);
+ z->inode_nr_lru++;
+ spin_unlock(&z->inode_lru_lock);
}

void __inode_lru_list_del(struct inode *inode)
{
- spin_lock(&inode_lru_lock);
+ struct zone *z = page_zone(virt_to_page(inode));
+
+ spin_lock(&z->inode_lru_lock);
list_del_init(&inode->i_lru);
- inodes_stat.nr_unused--;
- spin_unlock(&inode_lru_lock);
+ z->inode_nr_lru--;
+ spin_unlock(&z->inode_lru_lock);
}

/*
@@ -464,10 +464,7 @@ static int invalidate_sb_inodes(struct s
list_del_init(&inode->i_io);
spin_unlock(&wb->b_lock);

- spin_lock(&inode_lru_lock);
- list_del(&inode->i_lru);
- inodes_stat.nr_unused--;
- spin_unlock(&inode_lru_lock);
+ __inode_lru_list_del(inode);

WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
@@ -534,55 +531,58 @@ static void prune_icache(struct zone *zo

down_read(&iprune_sem);
again:
- spin_lock(&inode_lru_lock);
+ spin_lock(&zone->inode_lru_lock);
for (; nr_to_scan; nr_to_scan--) {
struct inode *inode;

- if (list_empty(&inode_lru))
+ if (list_empty(&zone->inode_lru))
break;

- inode = list_entry(inode_lru.prev, struct inode, i_lru);
+ inode = list_entry(zone->inode_lru.prev, struct inode, i_lru);

if (!spin_trylock(&inode->i_lock)) {
- spin_unlock(&inode_lru_lock);
+ spin_unlock(&zone->inode_lru_lock);
+ cpu_relax();
goto again;
}
if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
- inodes_stat.nr_unused--;
+ zone->inode_nr_lru--;
continue;
}
if (inode->i_state) {
- list_move(&inode->i_lru, &inode_lru);
+ list_move(&inode->i_lru, &zone->inode_lru);
inode->i_state &= ~I_REFERENCED;
spin_unlock(&inode->i_lock);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
- list_move(&inode->i_lru, &inode_lru);
- spin_unlock(&inode_lru_lock);
+ list_move(&inode->i_lru, &zone->inode_lru);
+ spin_unlock(&zone->inode_lru_lock);
__iget(inode);
spin_unlock(&inode->i_lock);

+ dispose_list(&freeable);
+
if (remove_inode_buffers(inode))
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
iput(inode);
- spin_lock(&inode_lru_lock);
+ spin_lock(&zone->inode_lru_lock);
continue;
}
list_move(&inode->i_lru, &freeable);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- inodes_stat.nr_unused--;
+ zone->inode_nr_lru--;
}
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
else
__count_vm_events(PGINODESTEAL, reap);
- spin_unlock(&inode_lru_lock);
+ spin_unlock(&zone->inode_lru_lock);

dispose_list(&freeable);
up_read(&iprune_sem);
@@ -600,19 +600,24 @@ again:
static int shrink_icache_memory(struct zone *zone, unsigned long scanned,
unsigned long total, unsigned long global, gfp_t gfp_mask)
{
- static unsigned long nr_to_scan;
unsigned long nr;

- shrinker_add_scan(&nr_to_scan, scanned, global,
- inodes_stat.nr_unused,
+ shrinker_add_scan(&zone->inode_nr_scan, scanned, total,
+ zone->inode_nr_lru,
DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100);
+ /*
+ * Nasty deadlock avoidance. We may hold various FS locks,
+ * and we don't want to recurse into the FS that called us
+ * in clear_inode() and friends..
+ */
if (!(gfp_mask & __GFP_FS))
- return 0;
+ return 0;

- while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
- prune_icache(zone, nr);
- cond_resched();
- }
+ nr = ACCESS_ONCE(zone->inode_nr_scan);
+ if (nr < SHRINK_BATCH)
+ return 0;
+ zone->inode_nr_scan = 0;
+ prune_icache(zone, nr);

return 0;
}
@@ -1431,12 +1436,8 @@ void generic_delete_inode(struct inode *
{
const struct super_operations *op = inode->i_sb->s_op;

- if (!list_empty(&inode->i_lru)) {
- spin_lock(&inode_lru_lock);
- list_del_init(&inode->i_lru);
- inodes_stat.nr_unused--;
- spin_unlock(&inode_lru_lock);
- }
+ if (!list_empty(&inode->i_lru))
+ __inode_lru_list_del(inode);
if (!list_empty(&inode->i_io)) {
struct bdi_writeback *wb = inode_to_wb(inode);
spin_lock(&wb->b_lock);
@@ -1493,10 +1494,7 @@ int generic_detach_inode(struct inode *i
inode->i_state |= I_REFERENCED;
if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
list_empty(&inode->i_lru)) {
- spin_lock(&inode_lru_lock);
- list_add(&inode->i_lru, &inode_lru);
- inodes_stat.nr_unused++;
- spin_unlock(&inode_lru_lock);
+ __inode_lru_list_add(inode);
}
spin_unlock(&inode->i_lock);
return 0;
@@ -1510,12 +1508,8 @@ int generic_detach_inode(struct inode *i
inode->i_state &= ~I_WILL_FREE;
__remove_inode_hash(inode);
}
- if (!list_empty(&inode->i_lru)) {
- spin_lock(&inode_lru_lock);
- list_del_init(&inode->i_lru);
- inodes_stat.nr_unused--;
- spin_unlock(&inode_lru_lock);
- }
+ if (!list_empty(&inode->i_lru))
+ __inode_lru_list_del(inode);
if (!list_empty(&inode->i_io)) {
struct bdi_writeback *wb = inode_to_wb(inode);
spin_lock(&wb->b_lock);
@@ -1831,6 +1825,7 @@ void __init inode_init_early(void)
void __init inode_init(void)
{
int loop;
+ struct zone *zone;

percpu_counter_init(&nr_inodes, 0);
/* inode slab cache */
@@ -1840,6 +1835,12 @@ void __init inode_init(void)
(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
SLAB_MEM_SPREAD),
init_once);
+ for_each_zone(zone) {
+ spin_lock_init(&zone->inode_lru_lock);
+ INIT_LIST_HEAD(&zone->inode_lru);
+ zone->inode_nr_lru = 0;
+ zone->inode_nr_scan = 0;
+ }
register_shrinker(&icache_shrinker);

lg_lock_init(inode_list_lglock);
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -50,6 +50,8 @@ static struct super_block *alloc_super(s
static const struct super_operations default_op;

if (s) {
+ struct zone *zone;
+
if (security_sb_alloc(s)) {
kfree(s);
s = NULL;
@@ -88,7 +90,6 @@ static struct super_block *alloc_super(s
#endif
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
- INIT_LIST_HEAD(&s->s_dentry_lru);
init_rwsem(&s->s_umount);
mutex_init(&s->s_lock);
lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -125,6 +126,15 @@ static struct super_block *alloc_super(s
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
s->s_time_gran = 1000000000;
+
+ for_each_zone(zone) {
+ struct sb_zoneinfo *sbz = sb_zone_info(s, zone);
+
+ spin_lock_init(&sbz->s_dentry_lru_lock);
+ INIT_LIST_HEAD(&sbz->s_dentry_lru);
+ sbz->s_nr_dentry_scan = 0;
+ sbz->s_nr_dentry_unused = 0;
+ }
}
out:
return s;
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -370,6 +370,13 @@ struct zone {


ZONE_PADDING(_pad2_)
+
+ spinlock_t inode_lru_lock;
+ struct list_head inode_lru;
+ unsigned long inode_nr_lru;
+ unsigned long inode_nr_scan;
+
+ ZONE_PADDING(_pad3_)
/* Rarely used or read-mostly fields */

/*
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1004,7 +1004,8 @@ struct shrinker {
/* These are for internal use */
struct list_head list;
};
-#define DEFAULT_SEEKS (128UL*2) /* A good number if you don't know better. */
+#define SHRINK_FIXED (128UL) /* Fixed point for shrinker ratio */
+#define DEFAULT_SEEKS (SHRINK_FIXED*2) /* A good number if you don't know better. */
#define SHRINK_BATCH 128 /* A good number if you don't know better */
extern void register_shrinker(struct shrinker *);
extern void unregister_shrinker(struct shrinker *);
n***@suse.de
2010-06-24 03:02:43 UTC
Permalink
Protect inodes_stat statistics with atomic ops rather than inode_lock.

Signed-off-by: Nick Piggin <***@suse.de>
---
fs/fs-writeback.c | 6 ++++--
fs/inode.c | 28 +++++++++++++++-------------
include/linux/fs.h | 5 +++--
3 files changed, 22 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -924,7 +924,8 @@ static long wb_check_old_data_flush(stru
wb->last_old_flush = jiffies;
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ (atomic_read(&inodes_stat.nr_inodes) -
+ atomic_read(&inodes_stat.nr_unused));

if (nr_pages) {
struct wb_writeback_args args = {
@@ -1285,7 +1286,8 @@ void writeback_inodes_sb(struct super_bl
long nr_to_write;

nr_to_write = nr_dirty + nr_unstable +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ (atomic_read(&inodes_stat.nr_inodes) -
+ atomic_read(&inodes_stat.nr_unused));

bdi_start_writeback(sb->s_bdi, sb, nr_to_write);
}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -123,7 +123,10 @@ static DECLARE_RWSEM(iprune_sem);
/*
* Statistics gathering..
*/
-struct inodes_stat_t inodes_stat;
+struct inodes_stat_t inodes_stat = {
+ .nr_inodes = ATOMIC_INIT(0),
+ .nr_unused = ATOMIC_INIT(0),
+};

static struct kmem_cache *inode_cachep __read_mostly;

@@ -318,7 +321,7 @@ void __iget(struct inode *inode)
list_move(&inode->i_list, &inode_in_use);
spin_unlock(&wb_inode_list_lock);
}
- inodes_stat.nr_unused--;
+ atomic_dec(&inodes_stat.nr_unused);
}

/**
@@ -382,9 +385,7 @@ static void dispose_list(struct list_hea
destroy_inode(inode);
nr_disposed++;
}
- spin_lock(&inode_lock);
- inodes_stat.nr_inodes -= nr_disposed;
- spin_unlock(&inode_lock);
+ atomic_sub(nr_disposed, &inodes_stat.nr_inodes);
}

/*
@@ -433,7 +434,7 @@ static int invalidate_list(struct list_h
busy = 1;
}
/* only unused inodes may be cached with i_count zero */
- inodes_stat.nr_unused -= count;
+ atomic_sub(count, &inodes_stat.nr_unused);
return busy;
}

@@ -551,7 +552,7 @@ again2:
spin_unlock(&inode->i_lock);
nr_pruned++;
}
- inodes_stat.nr_unused -= nr_pruned;
+ atomic_sub(nr_pruned, &inodes_stat.nr_unused);
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
else
@@ -584,7 +585,8 @@ static int shrink_icache_memory(int nr,
return -1;
prune_icache(nr);
}
- return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+ return (atomic_read(&inodes_stat.nr_unused) / 100) *
+ sysctl_vfs_cache_pressure;
}

static struct shrinker icache_shrinker = {
@@ -677,7 +679,7 @@ static inline void
__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
struct inode *inode)
{
- inodes_stat.nr_inodes++;
+ atomic_inc(&inodes_stat.nr_inodes);
spin_lock(&sb_inode_list_lock);
list_add(&inode->i_sb_list, &sb->s_inodes);
spin_unlock(&sb_inode_list_lock);
@@ -1321,8 +1323,8 @@ void generic_delete_inode(struct inode *
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ atomic_dec(&inodes_stat.nr_inodes);

if (op->delete_inode) {
void (*delete)(struct inode *) = op->delete_inode;
@@ -1365,7 +1367,7 @@ int generic_detach_inode(struct inode *i
list_move(&inode->i_list, &inode_unused);
spin_unlock(&wb_inode_list_lock);
}
- inodes_stat.nr_unused++;
+ atomic_inc(&inodes_stat.nr_unused);
if (sb->s_flags & MS_ACTIVE) {
spin_unlock(&inode->i_lock);
spin_unlock(&sb_inode_list_lock);
@@ -1383,7 +1385,7 @@ int generic_detach_inode(struct inode *i
spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
- inodes_stat.nr_unused--;
+ atomic_dec(&inodes_stat.nr_unused);
spin_lock(&inode_hash_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_hash_lock);
@@ -1395,9 +1397,9 @@ int generic_detach_inode(struct inode *i
spin_unlock(&sb_inode_list_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
+ atomic_dec(&inodes_stat.nr_inodes);
return 1;
}
EXPORT_SYMBOL_GPL(generic_detach_inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -38,13 +38,6 @@ struct files_stat_struct {
int max_files; /* tunable */
};

-struct inodes_stat_t {
- int nr_inodes;
- int nr_unused;
- int dummy[5]; /* padding for sysctl ABI compatibility */
-};
-
-
#define NR_FILE 8192 /* this can well be larger on a larger system */

#define MAY_EXEC 1
@@ -418,6 +411,12 @@ typedef int (get_block_t)(struct inode *
typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
ssize_t bytes, void *private);

+struct inodes_stat_t {
+ atomic_t nr_inodes;
+ atomic_t nr_unused;
+ int dummy[5]; /* padding for sysctl ABI compatibility */
+};
+
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c
+++ linux-2.6/drivers/staging/pohmelfs/inode.c
@@ -1283,11 +1283,11 @@ static void pohmelfs_put_super(struct su
dprintk("%s: ino: %llu, pi: %p, inode: %p, count: %u.\n",
__func__, pi->ino, pi, inode, count);

- if (atomic_read(&inode->i_count) != count) {
+ if (inode->i_count != count) {
printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_count: %d.\n",
__func__, pi->ino, pi, inode, count,
- atomic_read(&inode->i_count));
- count = atomic_read(&inode->i_count);
+ inode->i_count);
+ count = inode->i_count;
in_drop_list++;
}

@@ -1299,7 +1299,7 @@ static void pohmelfs_put_super(struct su
pi = POHMELFS_I(inode);

dprintk("%s: ino: %llu, pi: %p, inode: %p, i_count: %u.\n",
- __func__, pi->ino, pi, inode, atomic_read(&inode->i_count));
+ __func__, pi->ino, pi, inode, inode->i_count);

/*
* These are special inodes, they were created during
@@ -1307,7 +1307,7 @@ static void pohmelfs_put_super(struct su
* so they live here with reference counter being 1 and prevent
* umount from succeed since it believes that they are busy.
*/
- count = atomic_read(&inode->i_count);
+ count = inode->i_count;
if (count) {
list_del_init(&inode->i_sb_list);
while (count--)
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c
+++ linux-2.6/fs/btrfs/inode.c
@@ -1964,8 +1964,13 @@ void btrfs_add_delayed_iput(struct inode
struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
struct delayed_iput *delayed;

- if (atomic_add_unless(&inode->i_count, -1, 1))
+ spin_lock(&inode->i_lock);
+ if (inode->i_count > 1) {
+ inode->i_count--;
+ spin_unlock(&inode->i_lock);
return;
+ }
+ spin_unlock(&inode->i_lock);

delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
delayed->inode = inode;
@@ -2718,10 +2723,10 @@ static struct btrfs_trans_handle *__unli
return ERR_PTR(-ENOSPC);

/* check if there is someone else holds reference */
- if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+ if (S_ISDIR(inode->i_mode) && inode->i_count > 1)
return ERR_PTR(-ENOSPC);

- if (atomic_read(&inode->i_count) > 2)
+ if (inode->i_count > 2)
return ERR_PTR(-ENOSPC);

if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3934,7 +3939,7 @@ again:
inode = igrab(&entry->vfs_inode);
if (inode) {
spin_unlock(&root->inode_lock);
- if (atomic_read(&inode->i_count) > 1)
+ if (inode->i_count > 1)
d_prune_aliases(inode);
/*
* btrfs_drop_inode will remove it from
Index: linux-2.6/fs/ceph/mds_client.c
===================================================================
--- linux-2.6.orig/fs/ceph/mds_client.c
+++ linux-2.6/fs/ceph/mds_client.c
@@ -1028,7 +1028,7 @@ static int trim_caps_cb(struct inode *in
spin_unlock(&inode->i_lock);
d_prune_aliases(inode);
dout("trim_caps_cb %p cap %p pruned, count now %d\n",
- inode, cap, atomic_read(&inode->i_count));
+ inode, cap, inode->i_count);
return 0;
}

Index: linux-2.6/fs/logfs/dir.c
===================================================================
--- linux-2.6.orig/fs/logfs/dir.c
+++ linux-2.6/fs/logfs/dir.c
@@ -566,7 +566,9 @@ static int logfs_link(struct dentry *old
return -EMLINK;

inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- atomic_inc(&inode->i_count);
+ spin_lock(&inode->i_lock);
+ inode->i_count++;
+ spin_unlock(&inode->i_lock);
inode->i_nlink++;
mark_inode_dirty_sync(inode);

Index: linux-2.6/fs/logfs/readwrite.c
===================================================================
--- linux-2.6.orig/fs/logfs/readwrite.c
+++ linux-2.6/fs/logfs/readwrite.c
@@ -1002,7 +1002,7 @@ static int __logfs_is_valid_block(struct
{
struct logfs_inode *li = logfs_inode(inode);

- if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+ if (inode->i_nlink == 0 && inode->i_count == 1)
return 0;

if (bix < I0_BLOCKS)
n***@suse.de
2010-06-24 03:02:54 UTC
Permalink
From: Eric Dumazet <***@cosmosbay.com>

new_inode() dirties a contended cache line to get increasing inode numbers.

Solve this problem by providing to each cpu a per_cpu variable, feeded by the
shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same spreading ino
numbers than before. (same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
Signed-off-by: Nick Piggin <***@suse.de>
---
fs/inode.c | 42 +++++++++++++++++++++++++++++++++++-------
1 file changed, 35 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -664,6 +664,40 @@ __inode_add_to_lists(struct super_block
}
}

+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+static atomic_t shared_last_ino;
+
+static int last_ino_get(void)
+{
+ int *p = &get_cpu_var(last_ino);
+ int res = *p;
+
+ if (unlikely((res & 1023) == 0))
+ res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+ *p = ++res;
+ put_cpu_var(last_ino);
+ return res;
+}
+#else
+static int last_ino_get(void)
+{
+ static int last_ino;
+
+ return ++last_ino;
+}
+#endif
+
/**
* inode_add_to_lists - add a new inode to relevant lists
* @sb: superblock inode belongs to
@@ -700,19 +734,13 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
*/
struct inode *new_inode(struct super_block *sb)
{
- /*
- * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
- * error if st_ino won't fit in target struct field. Use 32bit counter
- * here to attempt to avoid that.
- */
- static atomic_t last_ino = ATOMIC_INIT(0);
struct inode *inode;

inode = alloc_inode(sb);
if (inode) {
/* XXX: init as locked for speedup */
spin_lock(&inode->i_lock);
- inode->i_ino = atomic_inc_return(&last_ino);
+ inode->i_ino = last_ino_get();
inode->i_state = 0;
__inode_add_to_lists(sb, NULL, inode);
spin_unlock(&inode->i_lock);
Andi Kleen
2010-06-24 09:48:13 UTC
Permalink
Post by n***@suse.de
new_inode() dirties a contended cache line to get increasing inode numbers.
Solve this problem by providing to each cpu a per_cpu variable, feeded by the
shared last_ino, but once every 1024 allocations.
Most file systems don't even need this because they
allocate their own inode numbers, right?. So perhaps it could be turned
off for all of those, e.g. with a superblock flag.

I guess the main customer is sockets only.
Post by n***@suse.de
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
I don't understand how the 32bit counter should prevent that.
Post by n***@suse.de
+ */
+static DEFINE_PER_CPU(int, last_ino);
+static atomic_t shared_last_ino;
With the 1024 skip, isn't overflow much more likely, just scaling
with the number of CPUs on a large CPU number systems, even if there
aren't that many new inodes?
Post by n***@suse.de
+static int last_ino_get(void)
+{
+ int *p = &get_cpu_var(last_ino);
+ int res = *p;
+
+ if (unlikely((res & 1023) == 0))
+ res = atomic_add_return(1024, &shared_last_ino) - 1024;
The magic numbers really want to be defines?

-Andi
--
***@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2010-06-24 15:52:43 UTC
Permalink
Post by Andi Kleen
Post by n***@suse.de
new_inode() dirties a contended cache line to get increasing inode numbers.
Solve this problem by providing to each cpu a per_cpu variable, feeded by the
shared last_ino, but once every 1024 allocations.
Most file systems don't even need this because they
allocate their own inode numbers, right?. So perhaps it could be turned
off for all of those, e.g. with a superblock flag.
That's right. More or less it just requires alloc_inode to be exported,
adding more branches in new_inode would not be a good way to go.

But I didn't want to start microoptimisations in filesystems just yet.
Post by Andi Kleen
I guess the main customer is sockets only.
I guess. Sockets and ram based filesystems. Interestingly I don't know
really what it's for (in socket code it's mostly for reporting and
hashing it seems). It sure isn't guaranteed to be unique.

Anyway it's outside the scope of this patchset to change functionality
at all.
Post by Andi Kleen
Post by n***@suse.de
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
I don't understand how the 32bit counter should prevent that.
Well I think glibc will convert 64 bit stat struct to 32bit for
old apps. It detects if the ino can't fit in 32 bits.
Post by Andi Kleen
Post by n***@suse.de
+static DEFINE_PER_CPU(int, last_ino);
+static atomic_t shared_last_ino;
With the 1024 skip, isn't overflow much more likely, just scaling
with the number of CPUs on a large CPU number systems, even if there
aren't that many new inodes?
Well EOVERFLOW should never happen with only the low 32 significant
bits set in the inode. If you are worried about wrapping the counter,
then no I don't think it is much more likely.

Because each CPU will only reserve another 1024 inode interval after
it has already allocated 1024 numbers. So the most wastage you will
get is (1024-1)*NR_CPUS -- somewhere around 1/1000th of the available
range.

I guess overflow will be more common now because it will be possible
to allocate inodes much faster on such a huge machine :)
Post by Andi Kleen
Post by n***@suse.de
+static int last_ino_get(void)
+{
+ int *p = &get_cpu_var(last_ino);
+ int res = *p;
+
+ if (unlikely((res & 1023) == 0))
+ res = atomic_add_return(1024, &shared_last_ino) - 1024;
The magic numbers really want to be defines?
Sure OK.
Andi Kleen
2010-06-24 16:19:49 UTC
Permalink
Post by Nick Piggin
That's right. More or less it just requires alloc_inode to be exported,
adding more branches in new_inode would not be a good way to go.
One test/branch shouldn't hurt much.
Post by Nick Piggin
Post by Andi Kleen
I guess the main customer is sockets only.
I guess. Sockets and ram based filesystems. Interestingly I don't know
really what it's for (in socket code it's mostly for reporting and
hashing it seems). It sure isn't guaranteed to be unique.
Maybe it could be generated lazily on access for those?
I suppose stat on a socket is relatively rare.
The only problem is would need an accessor.

But ok out of scope.
Post by Nick Piggin
Well I think glibc will convert 64 bit stat struct to 32bit for
old apps. It detects if the ino can't fit in 32 bits.
... and will fail the stat.

-Andi
--
***@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2010-06-24 16:38:59 UTC
Permalink
Post by Andi Kleen
Post by Nick Piggin
That's right. More or less it just requires alloc_inode to be exported,
adding more branches in new_inode would not be a good way to go.
One test/branch shouldn't hurt much.
If we go through filesystems anyway may as well just use alloc_inode.
Post by Andi Kleen
Post by Nick Piggin
Post by Andi Kleen
I guess the main customer is sockets only.
I guess. Sockets and ram based filesystems. Interestingly I don't know
really what it's for (in socket code it's mostly for reporting and
hashing it seems). It sure isn't guaranteed to be unique.
Maybe it could be generated lazily on access for those?
I suppose stat on a socket is relatively rare.
The only problem is would need an accessor.
But ok out of scope.
Yea that might work. sock_i_ino() and ->dname covers a lot.
Post by Andi Kleen
Post by Nick Piggin
Well I think glibc will convert 64 bit stat struct to 32bit for
old apps. It detects if the ino can't fit in 32 bits.
... and will fail the stat.
Which is what we're trying to avoid, I guess.
n***@suse.de
2010-06-24 03:02:15 UTC
Permalink
struct fs_struct.lock is an rwlock with the read-side used to protect
root and pwd members while taking references to them. Taking a reference
to a path typically requires just 2 atomic ops, so the critical section
is very small. Parallel read-side operations would have cacheline contention
on the lock, the dentry, and the vfsmount cachelines, so the rwlock is
unlikely to ever give a real parallelism increase.

Replace it with a spinlock to avoid one or two atomic operations in
typical path lookup fastpath.

Signed-off-by: Nick Piggin <***@suse.de>
---
drivers/staging/pohmelfs/path_entry.c | 8 +++----
fs/cachefiles/daemon.c | 8 +++----
fs/dcache.c | 8 +++----
fs/exec.c | 4 +--
fs/fs_struct.c | 36 +++++++++++++++++-----------------
fs/namei.c | 8 +++----
fs/namespace.c | 4 +--
fs/proc/base.c | 4 +--
include/linux/fs_struct.h | 2 -
kernel/auditsc.c | 4 +--
kernel/fork.c | 10 ++++-----
11 files changed, 48 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -2012,10 +2012,10 @@ char *d_path(const struct path *path, ch
if (path->dentry->d_op && path->dentry->d_op->d_dname)
return path->dentry->d_op->d_dname(path->dentry, buf, buflen);

- read_lock(&current->fs->lock);
+ spin_lock(&current->fs->lock);
root = current->fs->root;
path_get(&root);
- read_unlock(&current->fs->lock);
+ spin_unlock(&current->fs->lock);
spin_lock(&dcache_lock);
tmp = root;
res = __d_path(path, &tmp, buf, buflen);
@@ -2110,12 +2110,12 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
if (!page)
return -ENOMEM;

- read_lock(&current->fs->lock);
+ spin_lock(&current->fs->lock);
pwd = current->fs->pwd;
path_get(&pwd);
root = current->fs->root;
path_get(&root);
- read_unlock(&current->fs->lock);
+ spin_unlock(&current->fs->lock);

error = -ENOENT;
spin_lock(&dcache_lock);
Index: linux-2.6/fs/fs_struct.c
===================================================================
--- linux-2.6.orig/fs/fs_struct.c
+++ linux-2.6/fs/fs_struct.c
@@ -13,11 +13,11 @@ void set_fs_root(struct fs_struct *fs, s
{
struct path old_root;

- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
old_root = fs->root;
fs->root = *path;
path_get(path);
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
if (old_root.dentry)
path_put(&old_root);
}
@@ -30,11 +30,11 @@ void set_fs_pwd(struct fs_struct *fs, st
{
struct path old_pwd;

- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
old_pwd = fs->pwd;
fs->pwd = *path;
path_get(path);
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);

if (old_pwd.dentry)
path_put(&old_pwd);
@@ -51,7 +51,7 @@ void chroot_fs_refs(struct path *old_roo
task_lock(p);
fs = p->fs;
if (fs) {
- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
if (fs->root.dentry == old_root->dentry
&& fs->root.mnt == old_root->mnt) {
path_get(new_root);
@@ -64,7 +64,7 @@ void chroot_fs_refs(struct path *old_roo
fs->pwd = *new_root;
count++;
}
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
}
task_unlock(p);
} while_each_thread(g, p);
@@ -87,10 +87,10 @@ void exit_fs(struct task_struct *tsk)
if (fs) {
int kill;
task_lock(tsk);
- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
tsk->fs = NULL;
kill = !--fs->users;
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
task_unlock(tsk);
if (kill)
free_fs_struct(fs);
@@ -104,14 +104,14 @@ struct fs_struct *copy_fs_struct(struct
if (fs) {
fs->users = 1;
fs->in_exec = 0;
- rwlock_init(&fs->lock);
+ spin_lock_init(&fs->lock);
fs->umask = old->umask;
- read_lock(&old->lock);
+ spin_lock(&old->lock);
fs->root = old->root;
path_get(&old->root);
fs->pwd = old->pwd;
path_get(&old->pwd);
- read_unlock(&old->lock);
+ spin_unlock(&old->lock);
}
return fs;
}
@@ -126,10 +126,10 @@ int unshare_fs_struct(void)
return -ENOMEM;

task_lock(current);
- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
kill = !--fs->users;
current->fs = new_fs;
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
task_unlock(current);

if (kill)
@@ -148,7 +148,7 @@ EXPORT_SYMBOL(current_umask);
/* to be mentioned only in INIT_TASK */
struct fs_struct init_fs = {
.users = 1,
- .lock = __RW_LOCK_UNLOCKED(init_fs.lock),
+ .lock = __SPIN_LOCK_UNLOCKED(init_fs.lock),
.umask = 0022,
};

@@ -161,14 +161,14 @@ void daemonize_fs_struct(void)

task_lock(current);

- write_lock(&init_fs.lock);
+ spin_lock(&init_fs.lock);
init_fs.users++;
- write_unlock(&init_fs.lock);
+ spin_unlock(&init_fs.lock);

- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
current->fs = &init_fs;
kill = !--fs->users;
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);

task_unlock(current);
if (kill)
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -486,10 +486,10 @@ static __always_inline void set_root(str
{
if (!nd->root.mnt) {
struct fs_struct *fs = current->fs;
- read_lock(&fs->lock);
+ spin_lock(&fs->lock);
nd->root = fs->root;
path_get(&nd->root);
- read_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
}
}

@@ -1017,10 +1017,10 @@ static int path_init(int dfd, const char
path_get(&nd->root);
} else if (dfd == AT_FDCWD) {
struct fs_struct *fs = current->fs;
- read_lock(&fs->lock);
+ spin_lock(&fs->lock);
nd->path = fs->pwd;
path_get(&fs->pwd);
- read_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
} else {
struct dentry *dentry;

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -2208,10 +2208,10 @@ SYSCALL_DEFINE2(pivot_root, const char _
goto out1;
}

- read_lock(&current->fs->lock);
+ spin_lock(&current->fs->lock);
root = current->fs->root;
path_get(&current->fs->root);
- read_unlock(&current->fs->lock);
+ spin_unlock(&current->fs->lock);
down_write(&namespace_sem);
mutex_lock(&old.dentry->d_inode->i_mutex);
error = -EINVAL;
Index: linux-2.6/include/linux/fs_struct.h
===================================================================
--- linux-2.6.orig/include/linux/fs_struct.h
+++ linux-2.6/include/linux/fs_struct.h
@@ -5,7 +5,7 @@

struct fs_struct {
int users;
- rwlock_t lock;
+ spinlock_t lock;
int umask;
int in_exec;
struct path root, pwd;
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -1117,7 +1117,7 @@ int check_unsafe_exec(struct linux_binpr
bprm->unsafe = tracehook_unsafe_exec(p);

n_fs = 1;
- write_lock(&p->fs->lock);
+ spin_lock(&p->fs->lock);
rcu_read_lock();
for (t = next_thread(p); t != p; t = next_thread(t)) {
if (t->fs == p->fs)
@@ -1134,7 +1134,7 @@ int check_unsafe_exec(struct linux_binpr
res = 1;
}
}
- write_unlock(&p->fs->lock);
+ spin_unlock(&p->fs->lock);

return res;
}
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -156,10 +156,10 @@ static int get_fs_path(struct task_struc
task_lock(task);
fs = task->fs;
if (fs) {
- read_lock(&fs->lock);
+ spin_lock(&fs->lock);
*path = root ? fs->root : fs->pwd;
path_get(path);
- read_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
result = 0;
}
task_unlock(task);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -752,13 +752,13 @@ static int copy_fs(unsigned long clone_f
struct fs_struct *fs = current->fs;
if (clone_flags & CLONE_FS) {
/* tsk->fs is already what we want */
- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
if (fs->in_exec) {
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
return -EAGAIN;
}
fs->users++;
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
return 0;
}
tsk->fs = copy_fs_struct(fs);
@@ -1675,13 +1675,13 @@ SYSCALL_DEFINE1(unshare, unsigned long,

if (new_fs) {
fs = current->fs;
- write_lock(&fs->lock);
+ spin_lock(&fs->lock);
current->fs = new_fs;
if (--fs->users)
new_fs = NULL;
else
new_fs = fs;
- write_unlock(&fs->lock);
+ spin_unlock(&fs->lock);
}

if (new_mm) {
Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c
+++ linux-2.6/drivers/staging/pohmelfs/path_entry.c
@@ -44,9 +44,9 @@ int pohmelfs_construct_path_string(struc
return -ENOENT;
}

- read_lock(&current->fs->lock);
+ spin_lock(&current->fs->lock);
path.mnt = mntget(current->fs->root.mnt);
- read_unlock(&current->fs->lock);
+ spin_unlock(&current->fs->lock);

path.dentry = d;

@@ -91,9 +91,9 @@ int pohmelfs_path_length(struct pohmelfs
return -ENOENT;
}

- read_lock(&current->fs->lock);
+ spin_lock(&current->fs->lock);
root = dget(current->fs->root.dentry);
- read_unlock(&current->fs->lock);
+ spin_unlock(&current->fs->lock);

spin_lock(&dcache_lock);

Index: linux-2.6/fs/cachefiles/daemon.c
===================================================================
--- linux-2.6.orig/fs/cachefiles/daemon.c
+++ linux-2.6/fs/cachefiles/daemon.c
@@ -574,9 +574,9 @@ static int cachefiles_daemon_cull(struct

/* extract the directory dentry from the cwd */
fs = current->fs;
- read_lock(&fs->lock);
+ spin_lock(&fs->lock);
dir = dget(fs->pwd.dentry);
- read_unlock(&fs->lock);
+ spin_unlock(&fs->lock);

if (!S_ISDIR(dir->d_inode->i_mode))
goto notdir;
@@ -650,9 +650,9 @@ static int cachefiles_daemon_inuse(struc

/* extract the directory dentry from the cwd */
fs = current->fs;
- read_lock(&fs->lock);
+ spin_lock(&fs->lock);
dir = dget(fs->pwd.dentry);
- read_unlock(&fs->lock);
+ spin_unlock(&fs->lock);

if (!S_ISDIR(dir->d_inode->i_mode))
goto notdir;
Index: linux-2.6/kernel/auditsc.c
===================================================================
--- linux-2.6.orig/kernel/auditsc.c
+++ linux-2.6/kernel/auditsc.c
@@ -1838,10 +1838,10 @@ void __audit_getname(const char *name)
context->names[context->name_count].osid = 0;
++context->name_count;
if (!context->pwd.dentry) {
- read_lock(&current->fs->lock);
+ spin_lock(&current->fs->lock);
context->pwd = current->fs->pwd;
path_get(&current->fs->pwd);
- read_unlock(&current->fs->lock);
+ spin_unlock(&current->fs->lock);
}

}
n***@suse.de
2010-06-24 03:02:13 UTC
Permalink
Introduce a type of hlist that can support the use of the lowest bit in the
hlist_head. This will be subsequently used to implement per-bucket bit spinlock
for inode and dentry hashes.

Signed-off-by: Nick Piggin <***@suse.de>

---
include/linux/list_bl.h | 99 +++++++++++++++++++++++++++++++++++++
include/linux/rculist_bl.h | 120 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 219 insertions(+)

Index: linux-2.6/include/linux/list_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/list_bl.h
@@ -0,0 +1,99 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ */
+
+struct hlist_bl_head {
+ struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+ struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+ h->next = NULL;
+ h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+ return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+ h->first = (struct hlist_bl_node *)((unsigned long)n | ((unsigned long)h->first & 1UL));
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+ return !((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+ struct hlist_bl_node *next = n->next;
+ struct hlist_bl_node **pprev = n->pprev;
+ *pprev = (struct hlist_bl_node *)((unsigned long)next | ((unsigned long)*pprev & 1UL));
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ INIT_HLIST_BL_NODE(n);
+ }
+}
+
+/**
+ * hlist_bl_for_each_entry - iterate over list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member) \
+ for (pos = hlist_bl_first(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = pos->next)
+
+#endif
Index: linux-2.6/include/linux/rculist_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rculist_bl.h
@@ -0,0 +1,120 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+#ifdef __KERNEL__
+
+/*
+ * RCU-protected list version
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned long)n | ((unsigned long)h->first & 1UL)));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h->first) & ~1UL);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list. However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ n->pprev = NULL;
+ }
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs. Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_bl_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member) \
+ for (pos = hlist_bl_first_rcu(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+ pos = rcu_dereference_raw(pos->next))
+
+#endif
+#endif


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2010-06-24 06:04:22 UTC
Permalink
pi=C3=A8ce jointe document texte brut (list-bitlock.patch)
Introduce a type of hlist that can support the use of the lowest bit =
in the
hlist_head. This will be subsequently used to implement per-bucket bi=
t spinlock
for inode and dentry hashes.
=20
=20
---
include/linux/list_bl.h | 99 ++++++++++++++++++++++++++++++++++=
+++
include/linux/rculist_bl.h | 120 ++++++++++++++++++++++++++++++++++=
+++++++++++
2 files changed, 219 insertions(+)
=20
Index: linux-2.6/include/linux/list_bl.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- /dev/null
+++ linux-2.6/include/linux/list_bl.h
@@ -0,0 +1,99 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlo=
ck
+ * in the lowest bit. This is useful for scalable hash tables withou=
t
+ * increasing memory footprint overhead.
+ */
+
+struct hlist_bl_head {
+ struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+ struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first =3D NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+ h->next =3D NULL;
+ h->pprev =3D NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,memb=
er)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+ return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_h=
ead *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, struc=
t hlist_bl_node *n)
+{
+ h->first =3D (struct hlist_bl_node *)((unsigned long)n | ((unsigned=
long)h->first & 1UL));

Hmm, shouldnt hlist_bl_set_first() be used only with bit lock held ?

h->first =3D (struct hlist_bl_node *)((unsigned long)n | 1UL);
+}
+
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, s=
truct hlist_bl_node *n)
+{
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned lon=
g)n | ((unsigned long)h->first & 1UL)));

Same question here.
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_=
bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h->f=
irst) & ~1UL);
+}
Looks really nice Nick, maybe we should push this so that other
subsystem can start using it.

Thanks
Nick Piggin
2010-06-24 14:42:28 UTC
Permalink
Post by Eric Dumazet
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, str=
uct hlist_bl_node *n)
Post by Eric Dumazet
+{
+ h->first =3D (struct hlist_bl_node *)((unsigned long)n | ((unsign=
ed long)h->first & 1UL));
Post by Eric Dumazet
=20
Hmm, shouldnt hlist_bl_set_first() be used only with bit lock held ?
=20
h->first =3D (struct hlist_bl_node *)((unsigned long)n | 1UL);
+}
I had it that way but changed it for some reason. Thinking about it
again though, you're right I'm sure (it could have been some other
bug in my code making me think I needed it).

Thanks.
Post by Eric Dumazet
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,=
struct hlist_bl_node *n)
Post by Eric Dumazet
+{
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned l=
ong)n | ((unsigned long)h->first & 1UL)));
Post by Eric Dumazet
=20
Same question here.
=20
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlis=
t_bl_head *h)
Post by Eric Dumazet
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h-=
first) & ~1UL);
+}
=20
=20
Looks really nice Nick, maybe we should push this so that other
subsystem can start using it.
Sure, if you have an interest in using it, it will be trivial to send
upstream. Should we merge it before any users appear? I don't know...

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2010-06-24 16:01:53 UTC
Permalink
Post by Nick Piggin
Sure, if you have an interest in using it, it will be trivial to send
upstream. Should we merge it before any users appear? I don't know...
=20
I'll work on using this for ip route cache, as you suggested.

BTW, I think we could care about this 0 bit only if it could possibly
set, ie :

#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)

since bit_spin_lock() wont set it anyway if CONFIG_SMP=3Dn and
CONFIG_DEBUG_SPINLOCK=3Dn



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paul E. McKenney
2010-06-28 21:37:39 UTC
Permalink
Post by n***@suse.de
Introduce a type of hlist that can support the use of the lowest bit in the
hlist_head. This will be subsequently used to implement per-bucket bit spinlock
for inode and dentry hashes.
Looks good! One question on non-RCU pointer poisoning and a typo.
Post by n***@suse.de
---
include/linux/list_bl.h | 99 +++++++++++++++++++++++++++++++++++++
include/linux/rculist_bl.h | 120 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 219 insertions(+)
Index: linux-2.6/include/linux/list_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/list_bl.h
@@ -0,0 +1,99 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ */
+
+struct hlist_bl_head {
+ struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+ struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+ h->next = NULL;
+ h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+ return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+ h->first = (struct hlist_bl_node *)((unsigned long)n | ((unsigned long)h->first & 1UL));
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+ return !((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+ struct hlist_bl_node *next = n->next;
+ struct hlist_bl_node **pprev = n->pprev;
+ *pprev = (struct hlist_bl_node *)((unsigned long)next | ((unsigned long)*pprev & 1UL));
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
OK, I'll bite... Why don't we poison the ->next pointer?
Post by n***@suse.de
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ INIT_HLIST_BL_NODE(n);
+ }
+}
+
+/**
+ * hlist_bl_for_each_entry - iterate over list of given type
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member) \
+ for (pos = hlist_bl_first(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = pos->next)
+
+#endif
Index: linux-2.6/include/linux/rculist_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rculist_bl.h
@@ -0,0 +1,120 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+#ifdef __KERNEL__
+
+/*
+ * RCU-protected list version
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned long)n | ((unsigned long)h->first & 1UL)));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h->first) & ~1UL);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ *
+ * Note: hlist_bl_unhashed() on the node return true after this. It is
returns
Post by n***@suse.de
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list. However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ n->pprev = NULL;
+ }
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ *
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs. Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member) \
+ for (pos = hlist_bl_first_rcu(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+ pos = rcu_dereference_raw(pos->next))
+
+#endif
+#endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Nick Piggin
2010-06-29 06:30:10 UTC
Permalink
Post by Paul E. McKenney
Post by n***@suse.de
Introduce a type of hlist that can support the use of the lowest bit in the
hlist_head. This will be subsequently used to implement per-bucket bit spinlock
for inode and dentry hashes.
Looks good! One question on non-RCU pointer poisoning and a typo.
..
Post by Paul E. McKenney
Post by n***@suse.de
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
OK, I'll bite... Why don't we poison the ->next pointer?
Ah, I took the code from list_nulls.h, but actually, except for the
hlist anchoring, the code is much more similar to the standard hlist.
This can be poisoned, and I'll go through and look for other possible
differences with hlists.
Post by Paul E. McKenney
Post by n***@suse.de
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ *
+ * Note: hlist_bl_unhashed() on the node return true after this. It is
returns
Yes, thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:19 UTC
Permalink
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.

The number of atomics should remain the same for fastpath rlock cases, though
code will be slightly slower due to per-cpu access. Scalability will probably
not be much improved in common cases yet, due to other locks getting in the
way. However independent path lookups over mountpoints should be one case
where scalability is improved.

The slowpath will be made significantly slower due to use of brlock. On a 64
core, 64 socket, 32 node Altix system (so a decent amount of latency to remote
nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2
loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, for
about 5% increase in elapsed time.

Cc: linux-***@vger.kernel.org
Cc: linux-***@vger.kernel.org
Cc: Al Viro <***@ZenIV.linux.org.uk>
Cc: Frank Mayhar <***@google.com>,
Cc: John Stultz <***@us.ibm.com>
Cc: Andi Kleen <***@linux.intel.com>
Signed-off-by: Nick Piggin <***@suse.de>
---
fs/dcache.c | 11 +--
fs/internal.h | 5 +
fs/namei.c | 7 +-
fs/namespace.c | 177 +++++++++++++++++++++++++++++++++++----------------------
fs/pnode.c | 11 ++-
5 files changed, 134 insertions(+), 77 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1928,7 +1928,7 @@ char *__d_path(const struct path *path,
char *end = buffer + buflen;
char *retval;

- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
prepend(&end, &buflen, "\0", 1);
if (d_unlinked(dentry) &&
(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -1964,7 +1964,7 @@ char *__d_path(const struct path *path,
}

out:
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return retval;

global_root:
@@ -2195,11 +2195,12 @@ int path_is_under(struct path *path1, st
struct vfsmount *mnt = path1->mnt;
struct dentry *dentry = path1->dentry;
int res;
- spin_lock(&vfsmount_lock);
+
+ br_read_lock(vfsmount_lock);
if (mnt != path2->mnt) {
for (;;) {
if (mnt->mnt_parent == mnt) {
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return 0;
}
if (mnt->mnt_parent == path2->mnt)
@@ -2209,7 +2210,7 @@ int path_is_under(struct path *path1, st
dentry = mnt->mnt_mountpoint;
}
res = is_subdir(dentry, path2->dentry);
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return res;
}
EXPORT_SYMBOL(path_is_under);
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -601,15 +601,16 @@ int follow_up(struct path *path)
{
struct vfsmount *parent;
struct dentry *mountpoint;
- spin_lock(&vfsmount_lock);
+
+ br_read_lock(vfsmount_lock);
parent = path->mnt->mnt_parent;
if (parent == path->mnt) {
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return 0;
}
mntget(parent);
mountpoint = dget(path->mnt->mnt_mountpoint);
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
dput(path->dentry);
path->dentry = mountpoint;
mntput(path->mnt);
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -11,6 +11,8 @@
#include <linux/syscalls.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -37,12 +39,10 @@
#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
#define HASH_SIZE (1UL << HASH_SHIFT)

-/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
-
static int event;
static DEFINE_IDA(mnt_id_ida);
static DEFINE_IDA(mnt_group_ida);
+static DEFINE_SPINLOCK(mnt_id_lock);
static int mnt_id_start = 0;
static int mnt_group_start = 1;

@@ -54,6 +54,16 @@ static struct rw_semaphore namespace_sem
struct kobject *fs_kobj;
EXPORT_SYMBOL_GPL(fs_kobj);

+/*
+ * vfsmount lock may be taken for read to prevent changes to the
+ * vfsmount hash, ie. during mountpoint lookups or walking back
+ * up the tree.
+ *
+ * It should be taken for write in all cases where the vfsmount
+ * tree or hash is modified or when a vfsmount structure is modified.
+ */
+DEFINE_BRLOCK(vfsmount_lock);
+
static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
{
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -64,18 +74,21 @@ static inline unsigned long hash(struct

#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)

-/* allocation is serialized by namespace_sem */
+/*
+ * allocation is serialized by namespace_sem, but we need the spinlock to
+ * serialize with freeing.
+ */
static int mnt_alloc_id(struct vfsmount *mnt)
{
int res;

retry:
ida_pre_get(&mnt_id_ida, GFP_KERNEL);
- spin_lock(&vfsmount_lock);
+ spin_lock(&mnt_id_lock);
res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
if (!res)
mnt_id_start = mnt->mnt_id + 1;
- spin_unlock(&vfsmount_lock);
+ spin_unlock(&mnt_id_lock);
if (res == -EAGAIN)
goto retry;

@@ -85,11 +98,11 @@ retry:
static void mnt_free_id(struct vfsmount *mnt)
{
int id = mnt->mnt_id;
- spin_lock(&vfsmount_lock);
+ spin_lock(&mnt_id_lock);
ida_remove(&mnt_id_ida, id);
if (mnt_id_start > id)
mnt_id_start = id;
- spin_unlock(&vfsmount_lock);
+ spin_unlock(&mnt_id_lock);
}

/*
@@ -344,7 +357,7 @@ static int mnt_make_readonly(struct vfsm
{
int ret = 0;

- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt->mnt_flags |= MNT_WRITE_HOLD;
/*
* After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -378,15 +391,15 @@ static int mnt_make_readonly(struct vfsm
*/
smp_wmb();
mnt->mnt_flags &= ~MNT_WRITE_HOLD;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
return ret;
}

static void __mnt_unmake_readonly(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt->mnt_flags &= ~MNT_READONLY;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}

void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -410,6 +423,7 @@ void free_vfsmnt(struct vfsmount *mnt)
/*
* find the first or last mount at @dentry on vfsmount @mnt depending on
* @dir. If @dir is set return the first mount else return the last mount.
+ * vfsmount_lock must be held for read or write.
*/
struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
int dir)
@@ -439,10 +453,11 @@ struct vfsmount *__lookup_mnt(struct vfs
struct vfsmount *lookup_mnt(struct path *path)
{
struct vfsmount *child_mnt;
- spin_lock(&vfsmount_lock);
+
+ br_read_lock(vfsmount_lock);
if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
mntget(child_mnt);
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
return child_mnt;
}

@@ -451,6 +466,9 @@ static inline int check_mnt(struct vfsmo
return mnt->mnt_ns == current->nsproxy->mnt_ns;
}

+/*
+ * vfsmount lock must be held for write
+ */
static void touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns) {
@@ -459,6 +477,9 @@ static void touch_mnt_namespace(struct m
}
}

+/*
+ * vfsmount lock must be held for write
+ */
static void __touch_mnt_namespace(struct mnt_namespace *ns)
{
if (ns && ns->event != event) {
@@ -467,6 +488,9 @@ static void __touch_mnt_namespace(struct
}
}

+/*
+ * vfsmount lock must be held for write
+ */
static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
{
old_path->dentry = mnt->mnt_mountpoint;
@@ -478,6 +502,9 @@ static void detach_mnt(struct vfsmount *
old_path->dentry->d_mounted--;
}

+/*
+ * vfsmount lock must be held for write
+ */
void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
struct vfsmount *child_mnt)
{
@@ -486,6 +513,9 @@ void mnt_set_mountpoint(struct vfsmount
dentry->d_mounted++;
}

+/*
+ * vfsmount lock must be held for write
+ */
static void attach_mnt(struct vfsmount *mnt, struct path *path)
{
mnt_set_mountpoint(path->mnt, path->dentry, mnt);
@@ -495,7 +525,7 @@ static void attach_mnt(struct vfsmount *
}

/*
- * the caller must hold vfsmount_lock
+ * vfsmount lock must be held for write
*/
static void commit_tree(struct vfsmount *mnt)
{
@@ -618,39 +648,43 @@ static inline void __mntput(struct vfsmo
void mntput_no_expire(struct vfsmount *mnt)
{
repeat:
- if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
- if (likely(!mnt->mnt_pinned)) {
- spin_unlock(&vfsmount_lock);
- __mntput(mnt);
- return;
- }
- atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
- mnt->mnt_pinned = 0;
- spin_unlock(&vfsmount_lock);
- acct_auto_close_mnt(mnt);
- goto repeat;
+ if (atomic_add_unless(&mnt->mnt_count, -1, 1))
+ return;
+ br_write_lock(vfsmount_lock);
+ if (!atomic_dec_and_test(&mnt->mnt_count)) {
+ br_write_unlock(vfsmount_lock);
+ return;
+ }
+ if (likely(!mnt->mnt_pinned)) {
+ br_write_unlock(vfsmount_lock);
+ __mntput(mnt);
+ return;
}
+ atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+ mnt->mnt_pinned = 0;
+ br_write_unlock(vfsmount_lock);
+ acct_auto_close_mnt(mnt);
+ goto repeat;
}
-
EXPORT_SYMBOL(mntput_no_expire);

void mnt_pin(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt->mnt_pinned++;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}

EXPORT_SYMBOL(mnt_pin);

void mnt_unpin(struct vfsmount *mnt)
{
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (mnt->mnt_pinned) {
atomic_inc(&mnt->mnt_count);
mnt->mnt_pinned--;
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}

EXPORT_SYMBOL(mnt_unpin);
@@ -741,12 +775,12 @@ int mnt_had_events(struct proc_mounts *p
struct mnt_namespace *ns = p->ns;
int res = 0;

- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
if (p->event != ns->event) {
p->event = ns->event;
res = 1;
}
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);

return res;
}
@@ -948,12 +982,12 @@ int may_umount_tree(struct vfsmount *mnt
int minimum_refs = 0;
struct vfsmount *p;

- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
for (p = mnt; p; p = next_mnt(p, mnt)) {
actual_refs += atomic_read(&p->mnt_count);
minimum_refs += 2;
}
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);

if (actual_refs > minimum_refs)
return 0;
@@ -980,10 +1014,10 @@ int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_read_lock(vfsmount_lock);
if (propagate_mount_busy(mnt, 2))
ret = 0;
- spin_unlock(&vfsmount_lock);
+ br_read_unlock(vfsmount_lock);
up_read(&namespace_sem);
return ret;
}
@@ -999,13 +1033,14 @@ void release_mounts(struct list_head *he
if (mnt->mnt_parent != mnt) {
struct dentry *dentry;
struct vfsmount *m;
- spin_lock(&vfsmount_lock);
+
+ br_write_lock(vfsmount_lock);
dentry = mnt->mnt_mountpoint;
m = mnt->mnt_parent;
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
m->mnt_ghosts--;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
dput(dentry);
mntput(m);
}
@@ -1013,6 +1048,10 @@ void release_mounts(struct list_head *he
}
}

+/*
+ * vfsmount lock must be held for write
+ * namespace_sem must be held for write
+ */
void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
{
struct vfsmount *p;
@@ -1103,7 +1142,7 @@ static int do_umount(struct vfsmount *mn
}

down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
event++;

if (!(flags & MNT_DETACH))
@@ -1115,7 +1154,7 @@ static int do_umount(struct vfsmount *mn
umount_tree(mnt, 1, &umount_list);
retval = 0;
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);
release_mounts(&umount_list);
return retval;
@@ -1227,19 +1266,19 @@ struct vfsmount *copy_tree(struct vfsmou
q = clone_mnt(p, p->mnt_root, flag);
if (!q)
goto Enomem;
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, &path);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
}
return res;
Enomem:
if (res) {
LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
umount_tree(res, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
}
return NULL;
@@ -1258,9 +1297,9 @@ void drop_collected_mounts(struct vfsmou
{
LIST_HEAD(umount_list);
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);
release_mounts(&umount_list);
}
@@ -1388,7 +1427,7 @@ static int attach_recursive_mnt(struct v
if (err)
goto out_cleanup_ids;

- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);

if (IS_MNT_SHARED(dest_mnt)) {
for (p = source_mnt; p; p = next_mnt(p, source_mnt))
@@ -1407,7 +1446,8 @@ static int attach_recursive_mnt(struct v
list_del_init(&child->mnt_hash);
commit_tree(child);
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
+
return 0;

out_cleanup_ids:
@@ -1462,10 +1502,10 @@ static int do_change_type(struct path *p
goto out_unlock;
}

- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
change_mnt_propagation(m, type);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);

out_unlock:
up_write(&namespace_sem);
@@ -1509,9 +1549,10 @@ static int do_loopback(struct path *path
err = graft_tree(mnt, path);
if (err) {
LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
+
+ br_write_lock(vfsmount_lock);
umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
}

@@ -1564,16 +1605,16 @@ static int do_remount(struct path *path,
else
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK;
path->mnt->mnt_flags = mnt_flags;
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
up_write(&sb->s_umount);
if (!err) {
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
touch_mnt_namespace(path->mnt->mnt_ns);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
}
return err;
}
@@ -1750,7 +1791,7 @@ void mark_mounts_for_expiry(struct list_
return;

down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);

/* extract from the expiration list every vfsmount that matches the
* following criteria:
@@ -1769,7 +1810,7 @@ void mark_mounts_for_expiry(struct list_
touch_mnt_namespace(mnt->mnt_ns);
umount_tree(mnt, 1, &umounts);
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);

release_mounts(&umounts);
@@ -1826,6 +1867,8 @@ resume:
/*
* process a list of expirable mountpoints with the intent of discarding any
* submounts of a specific parent mountpoint
+ *
+ * vfsmount_lock must be held for write
*/
static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
{
@@ -2044,9 +2087,9 @@ static struct mnt_namespace *dup_mnt_ns(
kfree(new_ns);
return ERR_PTR(-ENOMEM);
}
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);

/*
* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2243,7 +2286,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
goto out2; /* not attached */
/* make sure we can reach put_old from new_root */
tmp = old.mnt;
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (tmp != new.mnt) {
for (;;) {
if (tmp->mnt_parent == tmp)
@@ -2263,7 +2306,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
/* mount new_root on / */
attach_mnt(new.mnt, &root_parent);
touch_mnt_namespace(current->nsproxy->mnt_ns);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
chroot_fs_refs(&root, &new);
error = 0;
path_put(&root_parent);
@@ -2278,7 +2321,7 @@ out1:
out0:
return error;
out3:
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
goto out2;
}

@@ -2325,6 +2368,8 @@ void __init mnt_init(void)
for (u = 0; u < HASH_SIZE; u++)
INIT_LIST_HEAD(&mount_hashtable[u]);

+ br_lock_init(vfsmount_lock);
+
err = sysfs_init();
if (err)
printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2343,9 +2388,9 @@ void put_mnt_ns(struct mnt_namespace *ns
if (!atomic_dec_and_test(&ns->count))
return;
down_write(&namespace_sem);
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
umount_tree(ns->root, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_write(&namespace_sem);
release_mounts(&umount_list);
kfree(ns);
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -126,6 +126,9 @@ static int do_make_slave(struct vfsmount
return 0;
}

+/*
+ * vfsmount lock must be held for write
+ */
void change_mnt_propagation(struct vfsmount *mnt, int type)
{
if (type == MS_SHARED) {
@@ -270,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_
prev_src_mnt = child;
}
out:
- spin_lock(&vfsmount_lock);
+ br_write_lock(vfsmount_lock);
while (!list_empty(&tmp_list)) {
child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
umount_tree(child, 0, &umount_list);
}
- spin_unlock(&vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
return ret;
}
@@ -296,6 +299,8 @@ static inline int do_refcount_check(stru
* other mounts its parent propagates to.
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
+ *
+ * vfsmount lock must be held for read or write
*/
int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
{
@@ -353,6 +358,8 @@ static void __propagate_umount(struct vf
* collect all mounts that receive propagation from the mount in @list,
* and return these additional mounts in the same list.
* @list: the list of mounts to be unmounted.
+ *
+ * vfsmount lock must be held for write
*/
int propagate_umount(struct list_head *list)
{
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h
+++ linux-2.6/fs/internal.h
@@ -9,6 +9,8 @@
* 2 of the License, or (at your option) any later version.
*/

+#include <linux/lglock.h>
+
struct super_block;
struct linux_binprm;
struct path;
@@ -70,7 +72,8 @@ extern struct vfsmount *copy_tree(struct

extern void __init mnt_init(void);

-extern spinlock_t vfsmount_lock;
+DECLARE_BRLOCK(vfsmount_lock);
+

/*
* fs_struct.c
n***@suse.de
2010-06-24 03:02:17 UTC
Permalink
This patch introduces "local-global" locks (lglocks). These can be used to:

- Provide fast exclusive access to per-CPU data, with exclusive access to
another CPU's data allowed but possibly subject to contention, and to provide
very slow exclusive access to all per-CPU data.
- Or to provide very fast and scalable read serialisation, and to provide
very slow exclusive serialisation of data (not necessarily per-CPU data).

Brlocks are also implemented as a short-hand notation for the latter use
case.

Thanks to Paul for local/global naming convention.

Cc: linux-***@vger.kernel.org
Cc: linux-***@vger.kernel.org
Cc: Al Viro <***@ZenIV.linux.org.uk>
Cc: "Paul E. McKenney" <***@linux.vnet.ibm.com>
Cc: Frank Mayhar <***@google.com>,
Cc: John Stultz <***@us.ibm.com>
Cc: Andi Kleen <***@linux.intel.com>
Signed-off-by: Nick Piggin <***@suse.de>
---
include/linux/lglock.h | 172 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 172 insertions(+)

Index: linux-2.6/include/linux/lglock.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/lglock.h
@@ -0,0 +1,172 @@
+/*
+ * Specialised local-global spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * "local/global locks" (lglocks) can be used to:
+ *
+ * - Provide fast exclusive access to per-CPU data, with exclusive access to
+ * another CPU's data allowed but possibly subject to contention, and to
+ * provide very slow exclusive access to all per-CPU data.
+ * - Or to provide very fast and scalable read serialisation, and to provide
+ * very slow exclusive serialisation of data (not necessarily per-CPU data).
+ *
+ * Brlocks are also implemented as a short-hand notation for the latter use
+ * case.
+ *
+ * Copyright 2009, 2010, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_LGLOCK_H
+#define __LINUX_LGLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/lockdep.h>
+#include <linux/percpu.h>
+
+/* can make br locks by using local lock for read side, global lock for write */
+#define br_lock_init(name) name##_lock_init()
+#define br_read_lock(name) name##_local_lock()
+#define br_read_unlock(name) name##_local_unlock()
+#define br_write_lock(name) name##_global_lock_online()
+#define br_write_unlock(name) name##_global_unlock_online()
+
+#define DECLARE_BRLOCK(name) DECLARE_LGLOCK(name)
+#define DEFINE_BRLOCK(name) DEFINE_LGLOCK(name)
+
+
+#define lg_lock_init(name) name##_lock_init()
+#define lg_local_lock(name) name##_local_lock()
+#define lg_local_unlock(name) name##_local_unlock()
+#define lg_local_lock_cpu(name, cpu) name##_local_lock_cpu(cpu)
+#define lg_local_unlock_cpu(name, cpu) name##_local_unlock_cpu(cpu)
+#define lg_global_lock(name) name##_global_lock()
+#define lg_global_unlock(name) name##_global_unlock()
+#define lg_global_lock_online(name) name##_global_lock_online()
+#define lg_global_unlock_online(name) name##_global_unlock_online()
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#define LOCKDEP_INIT_MAP lockdep_init_map
+
+#define DEFINE_LGLOCK_LOCKDEP(name) \
+ struct lock_class_key name##_lock_key; \
+ struct lockdep_map name##_lock_dep_map; \
+ EXPORT_SYMBOL(name##_lock_dep_map)
+
+#else
+#define LOCKDEP_INIT_MAP(a, b, c, d)
+
+#define DEFINE_LGLOCK_LOCKDEP(name)
+#endif
+
+
+#define DECLARE_LGLOCK(name) \
+ extern void name##_lock_init(void); \
+ extern void name##_local_lock(void); \
+ extern void name##_local_unlock(void); \
+ extern void name##_local_lock_cpu(int cpu); \
+ extern void name##_local_unlock_cpu(int cpu); \
+ extern void name##_global_lock(void); \
+ extern void name##_global_unlock(void); \
+ extern void name##_global_lock_online(void); \
+ extern void name##_global_unlock_online(void); \
+
+#define DEFINE_LGLOCK(name) \
+ \
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \
+ DEFINE_LGLOCK_LOCKDEP(name); \
+ \
+ void name##_lock_init(void) { \
+ int i; \
+ LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
+ for_each_possible_cpu(i) { \
+ arch_spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ *lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; \
+ } \
+ } \
+ EXPORT_SYMBOL(name##_lock_init); \
+ \
+ void name##_local_lock(void) { \
+ arch_spinlock_t *lock; \
+ preempt_disable(); \
+ rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_); \
+ lock = &__get_cpu_var(name##_lock); \
+ arch_spin_lock(lock); \
+ } \
+ EXPORT_SYMBOL(name##_local_lock); \
+ \
+ void name##_local_unlock(void) { \
+ arch_spinlock_t *lock; \
+ rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_); \
+ lock = &__get_cpu_var(name##_lock); \
+ arch_spin_unlock(lock); \
+ preempt_enable(); \
+ } \
+ EXPORT_SYMBOL(name##_local_unlock); \
+ \
+ void name##_local_lock_cpu(int cpu) { \
+ arch_spinlock_t *lock; \
+ preempt_disable(); \
+ rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_); \
+ lock = &per_cpu(name##_lock, cpu); \
+ arch_spin_lock(lock); \
+ } \
+ EXPORT_SYMBOL(name##_local_lock_cpu); \
+ \
+ void name##_local_unlock_cpu(int cpu) { \
+ arch_spinlock_t *lock; \
+ rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_); \
+ lock = &per_cpu(name##_lock, cpu); \
+ arch_spin_unlock(lock); \
+ preempt_enable(); \
+ } \
+ EXPORT_SYMBOL(name##_local_unlock_cpu); \
+ \
+ void name##_global_lock_online(void) { \
+ int i; \
+ preempt_disable(); \
+ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \
+ for_each_online_cpu(i) { \
+ arch_spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ arch_spin_lock(lock); \
+ } \
+ } \
+ EXPORT_SYMBOL(name##_global_lock_online); \
+ \
+ void name##_global_unlock_online(void) { \
+ int i; \
+ rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \
+ for_each_online_cpu(i) { \
+ arch_spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ arch_spin_unlock(lock); \
+ } \
+ preempt_enable(); \
+ } \
+ EXPORT_SYMBOL(name##_global_unlock_online); \
+ \
+ void name##_global_lock(void) { \
+ int i; \
+ preempt_disable(); \
+ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \
+ for_each_online_cpu(i) { \
+ arch_spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ arch_spin_lock(lock); \
+ } \
+ } \
+ EXPORT_SYMBOL(name##_global_lock); \
+ \
+ void name##_global_unlock(void) { \
+ int i; \
+ rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \
+ for_each_online_cpu(i) { \
+ arch_spinlock_t *lock; \
+ lock = &per_cpu(name##_lock, i); \
+ arch_spin_unlock(lock); \
+ } \
+ preempt_enable(); \
+ } \
+ EXPORT_SYMBOL(name##_global_unlock);
+#endif
Thomas Gleixner
2010-06-24 18:15:54 UTC
Permalink
Post by n***@suse.de
+#define DEFINE_LGLOCK(name) \
+ \
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \
Uuurgh. You want to make that an arch_spinlock ? Just to avoid the
preempt_count overflow when you lock all cpu locks nested ?

I'm really not happy about that, it's going to be a complete nightmare
for RT. If you wanted to make this a present for RT giving the
scalability stuff massive testing, then you failed miserably :)

I know how to fix it, but can't we go for an approach which
does not require massive RT patching again ?

struct percpu_lock {
spinlock_t lock;
unsigned global_state;
};

And let the lock function do:

spin_lock(&pcp->lock);
while (pcp->global_state)
cpu_relax();

So the global lock side can take each single lock, modify the percpu
"global state" and release the lock. On unlock you just need to reset
the global state w/o taking the percpu lock and be done.

I doubt that the extra conditional in the lock path is going to be
relevant overhead, compared to the spin_lock it's noise.

Thanks,

tglx
Nick Piggin
2010-06-25 06:22:50 UTC
Permalink
Post by Thomas Gleixner
Post by n***@suse.de
+#define DEFINE_LGLOCK(name) \
+ \
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \
Uuurgh. You want to make that an arch_spinlock ? Just to avoid the
preempt_count overflow when you lock all cpu locks nested ?
Yep, and the lockdep wreckage too :)

Actually it's nice to avoid the function call too (lglock/brlocks
are already out of line). Calls aren't totally free, especially
on small chips without RSBs. And even with RSBs it's helpful not
to overflow them, although Nehalem seems to have 12-16 entries.
Post by Thomas Gleixner
I'm really not happy about that, it's going to be a complete nightmare
for RT. If you wanted to make this a present for RT giving the
scalability stuff massive testing, then you failed miserably :)
Heh, it's a present for -rt because the locking is quite isolated
(I did the same thing with hashtable bitlocks, added a new structure
for them, in case you prefer spinlocks than bit spinlocks there).

-rt already changes locking primitives, so in the worst case you
might have to tweak this. My previous patches were open coding
these locks in fs/ so I can understand why that was a headache.
Post by Thomas Gleixner
I know how to fix it, but can't we go for an approach which
does not require massive RT patching again ?
struct percpu_lock {
spinlock_t lock;
unsigned global_state;
};
spin_lock(&pcp->lock);
while (pcp->global_state)
cpu_relax();
So the global lock side can take each single lock, modify the percpu
"global state" and release the lock. On unlock you just need to reset
the global state w/o taking the percpu lock and be done.
I doubt that the extra conditional in the lock path is going to be
relevant overhead, compared to the spin_lock it's noise.
Hmm. We need a smp_rmb() which costs a bit (on non-x86). For -rt you would
need to do priority boosting too.

reader lock:
spin_lock(&pcp->rlock);
if (unlikely(pcp->global_state)) {
spin_unlock(&pcp->rlock);
spin_lock(&wlock);
spin_lock(&pcp->rlock);
spin_unlock(&wlock);
} else
smp_rmb();

I think I'll keep it as is for now, it's hard enough to keep single
threaded performance up. But it should be much easier to override
this in -rt and I'll be happy to try restructuring things to help rt
if and when it's possible.
Thomas Gleixner
2010-06-25 09:50:02 UTC
Permalink
Post by Nick Piggin
Post by Thomas Gleixner
Post by n***@suse.de
+#define DEFINE_LGLOCK(name) \
+ \
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \
Uuurgh. You want to make that an arch_spinlock ? Just to avoid the
preempt_count overflow when you lock all cpu locks nested ?
Yep, and the lockdep wreckage too :)
Actually it's nice to avoid the function call too (lglock/brlocks
are already out of line). Calls aren't totally free, especially
on small chips without RSBs. And even with RSBs it's helpful not
to overflow them, although Nehalem seems to have 12-16 entries.
Post by Thomas Gleixner
I'm really not happy about that, it's going to be a complete nightmare
for RT. If you wanted to make this a present for RT giving the
scalability stuff massive testing, then you failed miserably :)
Heh, it's a present for -rt because the locking is quite isolated
(I did the same thing with hashtable bitlocks, added a new structure
for them, in case you prefer spinlocks than bit spinlocks there).
Sure, bitlocks are equally horrible.
Post by Nick Piggin
-rt already changes locking primitives, so in the worst case you
might have to tweak this. My previous patches were open coding
these locks in fs/ so I can understand why that was a headache.
I agree that having the code isolated makes my life easier, but I'm a
bit worried about the various new locking primitives which pop up in
all corners of the kernel.
Post by Nick Piggin
I think I'll keep it as is for now, it's hard enough to keep single
threaded performance up. But it should be much easier to override
this in -rt and I'll be happy to try restructuring things to help rt
if and when it's possible.
Ok, lets see how bad it gets :)

Thanks,

tglx
Nick Piggin
2010-06-25 10:11:08 UTC
Permalink
Post by Thomas Gleixner
Post by Nick Piggin
Post by Thomas Gleixner
Post by n***@suse.de
+#define DEFINE_LGLOCK(name) \
+ \
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \
Uuurgh. You want to make that an arch_spinlock ? Just to avoid the
preempt_count overflow when you lock all cpu locks nested ?
Yep, and the lockdep wreckage too :)
Actually it's nice to avoid the function call too (lglock/brlocks
are already out of line). Calls aren't totally free, especially
on small chips without RSBs. And even with RSBs it's helpful not
to overflow them, although Nehalem seems to have 12-16 entries.
Post by Thomas Gleixner
I'm really not happy about that, it's going to be a complete nightmare
for RT. If you wanted to make this a present for RT giving the
scalability stuff massive testing, then you failed miserably :)
Heh, it's a present for -rt because the locking is quite isolated
(I did the same thing with hashtable bitlocks, added a new structure
for them, in case you prefer spinlocks than bit spinlocks there).
Sure, bitlocks are equally horrible.
Post by Nick Piggin
-rt already changes locking primitives, so in the worst case you
might have to tweak this. My previous patches were open coding
these locks in fs/ so I can understand why that was a headache.
I agree that having the code isolated makes my life easier, but I'm a
bit worried about the various new locking primitives which pop up in
all corners of the kernel.
Be sure to shout at people for it :). Locking primitives really need
to be reviewed, and having them in common places lets other people
use them too.
Post by Thomas Gleixner
Post by Nick Piggin
I think I'll keep it as is for now, it's hard enough to keep single
threaded performance up. But it should be much easier to override
this in -rt and I'll be happy to try restructuring things to help rt
if and when it's possible.
Ok, lets see how bad it gets :)
Ok good.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
n***@suse.de
2010-06-24 03:02:16 UTC
Permalink
Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-***@vger.kernel.org
Cc: linux-***@vger.kernel.org
Cc: Al Viro <***@ZenIV.linux.org.uk>
Cc: Frank Mayhar <***@google.com>,
Cc: John Stultz <***@us.ibm.com>
Cc: Andi Kleen <***@linux.intel.com>,
Cc: Alan Cox <***@lxorguk.ukuu.org.uk>,
Cc: "Eric W. Biederman" <***@xmission.com>,
Acked-by: Andi Kleen <***@linux.intel.com>
Acked-by: Greg Kroah-Hartman <***@suse.de>
Signed-off-by: Nick Piggin <***@suse.de>
---
drivers/char/pty.c | 6 +++++-
drivers/char/tty_io.c | 26 ++++++++++++++++++--------
fs/file_table.c | 42 ++++++++++++++++++------------------------
fs/open.c | 4 ++--
include/linux/fs.h | 7 ++-----
include/linux/tty.h | 1 +
security/selinux/hooks.c | 4 ++--
7 files changed, 48 insertions(+), 42 deletions(-)

Index: linux-2.6/security/selinux/hooks.c
===================================================================
--- linux-2.6.orig/security/selinux/hooks.c
+++ linux-2.6/security/selinux/hooks.c
@@ -2219,7 +2219,7 @@ static inline void flush_unauthorized_fi

tty = get_current_tty();
if (tty) {
- file_list_lock();
+ spin_lock(&tty_files_lock);
if (!list_empty(&tty->tty_files)) {
struct inode *inode;

@@ -2235,7 +2235,7 @@ static inline void flush_unauthorized_fi
drop_tty = 1;
}
}
- file_list_unlock();
+ spin_unlock(&tty_files_lock);
tty_kref_put(tty);
}
/* Reset controlling tty. */
Index: linux-2.6/drivers/char/pty.c
===================================================================
--- linux-2.6.orig/drivers/char/pty.c
+++ linux-2.6/drivers/char/pty.c
@@ -650,7 +650,11 @@ static int __ptmx_open(struct inode *ino

set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+
+ file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
+ spin_lock(&tty_files_lock);
+ list_add(&filp->f_u.fu_list, &tty->tty_files);
+ spin_unlock(&tty_files_lock);

retval = devpts_pty_new(inode, tty->link);
if (retval)
Index: linux-2.6/drivers/char/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/char/tty_io.c
+++ linux-2.6/drivers/char/tty_io.c
@@ -136,6 +136,9 @@ LIST_HEAD(tty_drivers); /* linked list
DEFINE_MUTEX(tty_mutex);
EXPORT_SYMBOL(tty_mutex);

+/* Spinlock to protect the tty->tty_files list */
+DEFINE_SPINLOCK(tty_files_lock);
+
static ssize_t tty_read(struct file *, char __user *, size_t, loff_t *);
static ssize_t tty_write(struct file *, const char __user *, size_t, loff_t *);
ssize_t redirected_tty_write(struct file *, const char __user *,
@@ -234,11 +237,11 @@ static int check_tty_count(struct tty_st
struct list_head *p;
int count = 0;

- file_list_lock();
+ spin_lock(&tty_files_lock);
list_for_each(p, &tty->tty_files) {
count++;
}
- file_list_unlock();
+ spin_unlock(&tty_files_lock);
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_SLAVE &&
tty->link && tty->link->count)
@@ -517,7 +520,7 @@ static void do_tty_hangup(struct work_st
lock_kernel();
check_tty_count(tty, "do_tty_hangup");

- file_list_lock();
+ spin_lock(&tty_files_lock);
/* This breaks for file handles being sent over AF_UNIX sockets ? */
list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
if (filp->f_op->write == redirected_tty_write)
@@ -528,7 +531,7 @@ static void do_tty_hangup(struct work_st
tty_fasync(-1, filp, 0); /* can't block */
filp->f_op = &hung_up_tty_fops;
}
- file_list_unlock();
+ spin_unlock(&tty_files_lock);

tty_ldisc_hangup(tty);

@@ -1419,9 +1422,9 @@ static void release_one_tty(struct work_
tty_driver_kref_put(driver);
module_put(driver->owner);

- file_list_lock();
+ spin_lock(&tty_files_lock);
list_del_init(&tty->tty_files);
- file_list_unlock();
+ spin_unlock(&tty_files_lock);

put_pid(tty->pgrp);
put_pid(tty->session);
@@ -1666,7 +1669,10 @@ int tty_release(struct inode *inode, str
* - do_tty_hangup no longer sees this file descriptor as
* something that needs to be handled for hangups.
*/
- file_kill(filp);
+ spin_lock(&tty_files_lock);
+ BUG_ON(list_empty(&filp->f_u.fu_list));
+ list_del_init(&filp->f_u.fu_list);
+ spin_unlock(&tty_files_lock);
filp->private_data = NULL;

/*
@@ -1835,7 +1841,11 @@ got_driver:
}

filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ BUG_ON(list_empty(&filp->f_u.fu_list));
+ file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
+ spin_lock(&tty_files_lock);
+ list_add(&filp->f_u.fu_list, &tty->tty_files);
+ spin_unlock(&tty_files_lock);
check_tty_count(tty, "tty_open");
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_MASTER)
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -32,8 +32,7 @@ struct files_stat_struct files_stat = {
.max_files = NR_FILE
};

-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);

/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __read_mostly;
@@ -249,7 +248,7 @@ static void __fput(struct file *file)
cdev_put(inode->i_cdev);
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- file_kill(file);
+ file_sb_list_del(file);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
@@ -319,31 +318,29 @@ struct file *fget_light(unsigned int fd,
return file;
}

-
void put_filp(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
security_file_free(file);
- file_kill(file);
+ file_sb_list_del(file);
file_free(file);
}
}

-void file_move(struct file *file, struct list_head *list)
+void file_sb_list_add(struct file *file, struct super_block *sb)
{
- if (!list)
- return;
- file_list_lock();
- list_move(&file->f_u.fu_list, list);
- file_list_unlock();
+ spin_lock(&files_lock);
+ BUG_ON(!list_empty(&file->f_u.fu_list));
+ list_add(&file->f_u.fu_list, &sb->s_files);
+ spin_unlock(&files_lock);
}

-void file_kill(struct file *file)
+void file_sb_list_del(struct file *file)
{
if (!list_empty(&file->f_u.fu_list)) {
- file_list_lock();
+ spin_lock(&files_lock);
list_del_init(&file->f_u.fu_list);
- file_list_unlock();
+ spin_unlock(&files_lock);
}
}

@@ -352,7 +349,7 @@ int fs_may_remount_ro(struct super_block
struct file *file;

/* Check that no files are currently opened for writing. */
- file_list_lock();
+ spin_lock(&files_lock);
list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
struct inode *inode = file->f_path.dentry->d_inode;

@@ -364,10 +361,10 @@ int fs_may_remount_ro(struct super_block
if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
goto too_bad;
}
- file_list_unlock();
+ spin_unlock(&files_lock);
return 1; /* Tis' cool bro. */
too_bad:
- file_list_unlock();
+ spin_unlock(&files_lock);
return 0;
}

@@ -383,7 +380,7 @@ void mark_files_ro(struct super_block *s
struct file *f;

retry:
- file_list_lock();
+ spin_lock(&files_lock);
list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
struct vfsmount *mnt;
if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
@@ -399,16 +396,13 @@ retry:
continue;
file_release_write(f);
mnt = mntget(f->f_path.mnt);
- file_list_unlock();
- /*
- * This can sleep, so we can't hold
- * the file_list_lock() spinlock.
- */
+ /* This can sleep, so we can't hold the spinlock. */
+ spin_unlock(&files_lock);
mnt_drop_write(mnt);
mntput(mnt);
goto retry;
}
- file_list_unlock();
+ spin_unlock(&files_lock);
}

void __init files_init(unsigned long mempages)
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -675,7 +675,7 @@ static struct file *__dentry_open(struct
f->f_path.mnt = mnt;
f->f_pos = 0;
f->f_op = fops_get(inode->i_fop);
- file_move(f, &inode->i_sb->s_files);
+ file_sb_list_add(f, inode->i_sb);

error = security_dentry_open(f, cred);
if (error)
@@ -721,7 +721,7 @@ cleanup_all:
mnt_drop_write(mnt);
}
}
- file_kill(f);
+ file_sb_list_del(f);
f->f_path.dentry = NULL;
f->f_path.mnt = NULL;
cleanup_file:
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -949,9 +949,6 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);

#define get_file(x) atomic_long_inc(&(x)->f_count)
#define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1)
@@ -2182,8 +2179,8 @@ static inline void insert_inode_hash(str
__insert_inode_hash(inode, inode->i_ino);
}

-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
+extern void file_sb_list_add(struct file *f, struct super_block *sb);
+extern void file_sb_list_del(struct file *f);
#ifdef CONFIG_BLOCK
struct bio;
extern void submit_bio(int, struct bio *);
Index: linux-2.6/include/linux/tty.h
===================================================================
--- linux-2.6.orig/include/linux/tty.h
+++ linux-2.6/include/linux/tty.h
@@ -467,6 +467,7 @@ extern struct tty_struct *tty_pair_get_t
extern struct tty_struct *tty_pair_get_pty(struct tty_struct *tty);

extern struct mutex tty_mutex;
+extern spinlock_t tty_files_lock;

extern void tty_write_unlock(struct tty_struct *tty);
extern int tty_write_lock(struct tty_struct *tty, int ndelay);
n***@suse.de
2010-06-24 03:02:18 UTC
Permalink
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on. Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.


Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.


Tim Chen run some numbers for a 64 thread Nehalem system performing a compile.

throughput
2.6.34-rc2 24.5
+patch 24.9

us sys idle IO wait (in %)
2.6.34-rc2 51.25 28.25 17.25 3.25
+patch 53.75 18.5 19 8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.


Single threaded performance difference was within the noise of microbenchmarks.
That is not to say one does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-***@vger.kernel.org
Cc: linux-***@vger.kernel.org
Cc: Al Viro <***@ZenIV.linux.org.uk>
Cc: Frank Mayhar <***@google.com>,
Cc: John Stultz <***@us.ibm.com>
Cc: "Eric W. Biederman" <***@xmission.com>,
Cc: Tim Chen <***@linux.intel.com>
Cc: Andi Kleen <***@linux.intel.com>
Signed-off-by: Nick Piggin <***@suse.de>
---
fs/file_table.c | 108 ++++++++++++++++++++++++++++++++++++++++++++---------
fs/super.c | 18 ++++++++
include/linux/fs.h | 7 +++
3 files changed, 115 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -20,7 +20,9 @@
#include <linux/cdev.h>
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
+#include <linux/lglock.h>
#include <linux/percpu_counter.h>
+#include <linux/percpu.h>
#include <linux/ima.h>

#include <asm/atomic.h>
@@ -32,7 +34,8 @@ struct files_stat_struct files_stat = {
.max_files = NR_FILE
};

-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+DECLARE_LGLOCK(files_lglock);
+DEFINE_LGLOCK(files_lglock);

/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __read_mostly;
@@ -327,30 +330,98 @@ void put_filp(struct file *file)
}
}

+static inline int file_list_cpu(struct file *file)
+{
+#ifdef CONFIG_SMP
+ return file->f_sb_list_cpu;
+#else
+ return smp_processor_id();
+#endif
+}
+
+/* helper for file_sb_list_add to reduce ifdefs */
+static inline void __file_sb_list_add(struct file *file, struct super_block *sb)
+{
+ struct list_head *list;
+#ifdef CONFIG_SMP
+ int cpu;
+ cpu = smp_processor_id();
+ file->f_sb_list_cpu = cpu;
+ list = per_cpu_ptr(sb->s_files, cpu);
+#else
+ list = &sb->s_files;
+#endif
+ list_add(&file->f_u.fu_list, list);
+}
+
+/**
+ * file_sb_list_add - add a file to the sb's file list
+ * @file: file to add
+ * @sb: sb to add it to
+ *
+ * Use this function to associate a file with the superblock of the inode it
+ * refers to.
+ */
void file_sb_list_add(struct file *file, struct super_block *sb)
{
- spin_lock(&files_lock);
- BUG_ON(!list_empty(&file->f_u.fu_list));
- list_add(&file->f_u.fu_list, &sb->s_files);
- spin_unlock(&files_lock);
+ lg_local_lock(files_lglock);
+ __file_sb_list_add(file, sb);
+ lg_local_unlock(files_lglock);
}

+/**
+ * file_sb_list_del - remove a file from the sb's file list
+ * @file: file to remove
+ * @sb: sb to remove it from
+ *
+ * Use this function to remove a file from its superblock.
+ */
void file_sb_list_del(struct file *file)
{
if (!list_empty(&file->f_u.fu_list)) {
- spin_lock(&files_lock);
+ lg_local_lock_cpu(files_lglock, file_list_cpu(file));
list_del_init(&file->f_u.fu_list);
- spin_unlock(&files_lock);
+ lg_local_unlock_cpu(files_lglock, file_list_cpu(file));
}
}

+#ifdef CONFIG_SMP
+
+/*
+ * These macros iterate all files on all CPUs for a given superblock.
+ * files_lglock must be held globally.
+ */
+#define do_file_list_for_each_entry(__sb, __file) \
+{ \
+ int i; \
+ for_each_possible_cpu(i) { \
+ struct list_head *list; \
+ list = per_cpu_ptr((__sb)->s_files, i); \
+ list_for_each_entry((__file), list, f_u.fu_list)
+
+#define while_file_list_for_each_entry \
+ } \
+}
+
+#else
+
+#define do_file_list_for_each_entry(__sb, __file) \
+{ \
+ struct list_head *list; \
+ list = &(sb)->s_files; \
+ list_for_each_entry((__file), list, f_u.fu_list)
+
+#define while_file_list_for_each_entry \
+}
+
+#endif
+
int fs_may_remount_ro(struct super_block *sb)
{
struct file *file;
-
/* Check that no files are currently opened for writing. */
- spin_lock(&files_lock);
- list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
+ lg_global_lock(files_lglock);
+ do_file_list_for_each_entry(sb, file) {
struct inode *inode = file->f_path.dentry->d_inode;

/* File with pending delete? */
@@ -360,11 +431,11 @@ int fs_may_remount_ro(struct super_block
/* Writeable file? */
if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
goto too_bad;
- }
- spin_unlock(&files_lock);
+ } while_file_list_for_each_entry;
+ lg_global_unlock(files_lglock);
return 1; /* Tis' cool bro. */
too_bad:
- spin_unlock(&files_lock);
+ lg_global_unlock(files_lglock);
return 0;
}

@@ -380,8 +451,8 @@ void mark_files_ro(struct super_block *s
struct file *f;

retry:
- spin_lock(&files_lock);
- list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
+ lg_global_lock(files_lglock);
+ do_file_list_for_each_entry(sb, f) {
struct vfsmount *mnt;
if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
continue;
@@ -397,12 +468,12 @@ retry:
file_release_write(f);
mnt = mntget(f->f_path.mnt);
/* This can sleep, so we can't hold the spinlock. */
- spin_unlock(&files_lock);
+ lg_global_unlock(files_lglock);
mnt_drop_write(mnt);
mntput(mnt);
goto retry;
- }
- spin_unlock(&files_lock);
+ } while_file_list_for_each_entry;
+ lg_global_unlock(files_lglock);
}

void __init files_init(unsigned long mempages)
@@ -422,5 +493,6 @@ void __init files_init(unsigned long mem
if (files_stat.max_files < NR_FILE)
files_stat.max_files = NR_FILE;
files_defer_init();
+ lg_lock_init(files_lglock);
percpu_counter_init(&nr_files, 0);
}
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -54,7 +54,22 @@ static struct super_block *alloc_super(s
s = NULL;
goto out;
}
+#ifdef CONFIG_SMP
+ s->s_files = alloc_percpu(struct list_head);
+ if (!s->s_files) {
+ security_sb_free(s);
+ kfree(s);
+ s = NULL;
+ goto out;
+ } else {
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
+ }
+#else
INIT_LIST_HEAD(&s->s_files);
+#endif
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
@@ -108,6 +123,9 @@ out:
*/
static inline void destroy_super(struct super_block *s)
{
+#ifdef CONFIG_SMP
+ free_percpu(s->s_files);
+#endif
security_sb_free(s);
kfree(s->s_subtype);
kfree(s->s_options);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -925,6 +925,9 @@ struct file {
#define f_vfsmnt f_path.mnt
const struct file_operations *f_op;
spinlock_t f_lock; /* f_ep_links, f_flags, no IRQ */
+#ifdef CONFIG_SMP
+ int f_sb_list_cpu;
+#endif
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
@@ -1339,7 +1342,11 @@ struct super_block {

struct list_head s_inodes; /* all inodes */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
+#ifdef CONFIG_SMP
+ struct list_head __percpu *s_files;
+#else
struct list_head s_files;
+#endif
/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra
2010-06-24 07:52:17 UTC
Permalink
Post by n***@suse.de
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on. Scalability
could suffer if files are frequently removed from different cpu's list.
Is this really a lot less complex than what I did with my fine-grained
locked list?


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2010-06-24 15:00:23 UTC
Permalink
Post by Peter Zijlstra
Post by n***@suse.de
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on. Scalability
could suffer if files are frequently removed from different cpu's list.
Is this really a lot less complex than what I did with my fine-grained
locked list?
http://www.mail-archive.com/linux-***@vger.kernel.org/msg115071.html

Honestly the filevec code seemed overkill to me, and yes it was a bit
complex. The only reason to consider it AFAIKS would be if the space
overhead of the per-cpu structures, or the slowpath cost of the brlock
was unbearable.

filevecs probably dont perform as well in the fastpath. My patch doesn't
add any atomics. The cost of adding or removing a file from its list are
one atomic for the spinlock.

The cost of adding a file with filevecs is a spinlock to put it on the
vec, a spinlock to take it off the vec, a spinlock to put it on the
lock-list. 3 atomics. A heap more icache and branches.

Removing a file with filevecs is a spinlock to check the vec, and 1 or 2
spinlocks to take it off the list (common case).

Scalability will be improved, but it will hit the global list still
1/15th times (and there is even no lock batching on the list but I
assume that could be fixed). Compared with never for my patch (unless
there is a cross-CPU removal, in which case they both need to hit a
remote-CPU cacheline).

But before we even get to scalability, I think filevecs from complexity
and single threaded performance point already lose.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2010-06-25 07:12:21 UTC
Permalink
If you actuall want to get this work in reposting huge patchkit again and
again probably doesn't help. Start to prioritize areas and work on small
sets to get them ready.

files_lock and vfsmount_lock seem like rather easy targets to start
with. But for files_lock I really want to see something to generalize
the tty special case. If you touch that are in detail that wart needs
to go. Al didn't seem to like my variant very much, so he might have
a better idea for it - otherwise it really makes the VFS locking simple
by removing any tty interaction with the superblock files list. The
other suggestion would be to only open regular (maybe even just
writeable) files to the list. In addition to reducing the number of
list operations require it will also make the tty code a lot easier.

As for the other patches: I don't think the massive fine-grained
locking in the hash tables is a good idea. I would recommend to defer
them for now, and then look into better data structures for these caches
instead of working around the inherent problems of global hash tables.
Nick Piggin
2010-06-25 08:05:00 UTC
Permalink
Post by Christoph Hellwig
If you actuall want to get this work in reposting huge patchkit again and
again probably doesn't help. Start to prioritize areas and work on small
sets to get them ready.
Sure, I haven't been posting the same thing (haven't posted it for a
long time). This simply had a lot of new stuff and improvements to all
existing patches.

I didn't cc anyone in particular because it's only for interested
people to take a look at. As you saw last time I cc'ed Al I exactly
was just trying to get those easier targets merged.
Post by Christoph Hellwig
files_lock and vfsmount_lock seem like rather easy targets to start
with. But for files_lock I really want to see something to generalize
the tty special case. If you touch that are in detail that wart needs
to go. Al didn't seem to like my variant very much, so he might have
a better idea for it - otherwise it really makes the VFS locking simple
by removing any tty interaction with the superblock files list.
Actually I didn't like it because the error handling in the tty code
was broken and difficult to fix properly. The concept was OK though.

But the fact is that today already tty "knows" that vfs doesn't need its
files on the superblock list, and so it may take them off and use that
list_head privately. Currently it is also using files lock to protect
that private usage. These are two independent problems. My patch fixes
the second, and anything that fixes the first also needs to fix the
second in exactly the same way.
Post by Christoph Hellwig
The
other suggestion would be to only open regular (maybe even just
writeable) files to the list. In addition to reducing the number of
list operations require it will also make the tty code a lot easier.
This was my suggestion, yes. Either way is conceptually the same, this
one just avoids the memory allocation and error handling problems that
yours had.

But again, locking change is still required and it would look exactly
the same as my patch really.
Post by Christoph Hellwig
As for the other patches: I don't think the massive fine-grained
locking in the hash tables is a good idea. I would recommend to defer
them for now, and then look into better data structures for these caches
instead of working around the inherent problems of global hash tables.
I don't agree actually. I don't think there is any downside to fine
grained locking the hash with bit spinlocks. Until I see one, I will
keep them.

I agree that some other data structure may be better, but it should be
compared with the best possible hash implementation, which is a scalable
hash like this one.

Also, our big impending performance problem is SMP scalability, not hash
lookup, AFAIKS.
Loading...