[RFC] lustre treatment of dentry->d

Post by Al Viro
Similar to that, where is ll_d_mountpoint() ever called with NULL dchild,
why do you have
if (unlikely(dchild))
in there when it's true on every call and why does it exist in the first
place? All its callers are reachable only from vfs_{unlink,rmdir,rename}
and we *do* d_mountpoint() checks there.

Could somebody explain what this is for?
/* Try to find the child dentry by its name.
If found, put the result fid into @fid. */
static void ll_get_child_fid(struct inode * dir, struct qstr *name,
struct lu_fid *fid)
{
struct dentry *parent, *child;

parent = ll_d_hlist_entry(dir->i_dentry, struct dentry, d_alias);
child = d_lookup(parent, name);
if (child) {
if (child->d_inode)
*fid = *ll_inode2fid(child->d_inode);
dput(child);
}
}

The funny thing being, it's always called from ll_rmdir(), ll_unlink()
or ll_rename(), with name being equal to &dentry->d_name and dir -
dentry->d_parent->d_inode. IOW, that child is already known to caller.
What the hell?

Obvious jokes about exotic adenoidectomy technics aside, what's the story
with that file? Is that just a trimmed down dual-use code that is sometimes
called by VFS and sometimes lives in userland and called by hell knows what?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-21 03:55:58 UTC

Similar to that, where is ll_d_mountpoint() ever called with NULL dc=

hild,

why do you have
if (unlikely(dchild))
in there when it's true on every call and why does it exist in the f=

irst

place? All its callers are reachable only from vfs_{unlink,rmdir,re=

name}

and we *do* d_mountpoint() checks there.

=20
Could somebody explain what this is for?
/* Try to find the child dentry by its name.

Hm=85 This is really unused.
It's also in the patch queue for syncing.
Something like this, if you are interested: http://review.whamcloud.com=
/6648

Post by Al Viro
Obvious jokes about exotic adenoidectomy technics aside, what's the s=

tory

Post by Al Viro
with that file? Is that just a trimmed down dual-use code that is so=

metimes

Post by Al Viro
called by VFS and sometimes lives in userland and called by hell know=

s what?

Those are all remains of that old intent-code of long ago, I believe.

Bye,
Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-21 03:46:02 UTC

Hello!

Post by Al Viro
a) what protects ->d_name in ll_intent_file_open()? It copies
->d_name.name and ->d_name.len into local variables and proceeds to
use those; what's to guarantee that dentry won't get hit with d_move(=

)

Post by Al Viro
halfway through that? None of the locks that would give an exclusion
against d_move() appear to be held=85

You are right. We hit something very similar not too long ago.

Post by Al Viro
=20
b) what stabilizes *dentryp->d_name in do_statahead_enter()?
=20
c) what stabilizes fdentry->d_parent->d_name in llog_lvfs_destroy()?

I don't think there could be possible d_move in llog path. Those files
are never visible in the proper namespace, are not exposed to rename
and other user-influenced actions so should be safe.

Post by Al Viro
Unless I'm missing something subtle, all three can race with d_move()=

Post by Al Viro
with obvious unpleasant results. The next bunch doesn't (the callers
are holding ->i_mutex on parents), but it's also bloody odd - why are=

Post by Al Viro
playing these games in llite/namei.c?

This is a throwback from the past where we patched "raw" rename methods=
into
VFS, intent-like (I am including a big explanation below, that you can =
skip
if you are not into Luste archeology. TL;DR: patches are in the queue t=
o get rid of that).
So basically we had parent inode and the child name we wanted
operation to be performed on. All of that was shipped directly to the s=
erver
and server returned the result, without any intermediate lookup RPCs th=
at would
have happened otherwise, and a significant erformance boost ensued (at =
certain
cost of course, like no local SELinux enforcement and such).
After all wedecided to switch to "patchless clients" after open intents=
appeared
in 2.6.15, and had a pairof possible methofds here: ll_XXX_raw for the =
old-style
intent call and ll_XXX are the normal vfs-style calls that both in the =
end
did some processign and called ll_XXX_generic that talked to the server=
=2E

Since we don't need any of this for some time, we are getting rid of th=
ose things
and I am pretty sure the patches for removal of this extra wrapping is =
in the
queue of stuff to send.

Post by Al Viro
While we are at it, where is ll_new_inode() ever called with NULL dch=

ild?

This is the deprecated "intent" code path explained above.

Post by Al Viro
Similar to that, where is ll_d_mountpoint() ever called with NULL dch=

ild,

Post by Al Viro
why do you have
if (unlikely(dchild))

Same as above.
All of that would go away once we get rid of the wrapper and unroll the
vfs ops.

Hopefully this makes at least some sense to you.

Bye,
Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Al Viro

2014-10-21 04:02:10 UTC

e()

Post by Al Viro
halfway through that? None of the locks that would give an exclusi=

Post by Al Viro
against d_move() appear to be held=E2=80=A6

=20
You are right. We hit something very similar not too long ago.

Post by Al Viro
=20
b) what stabilizes *dentryp->d_name in do_statahead_enter()?

Same as above.
All of that would go away once we get rid of the wrapper and unroll t=

Post by Drokin, Oleg
vfs ops.
=20
Hopefully this makes at least some sense to you.

OK... I'd gone and ripped some of that stuff out today; see if what's =
in
vfs.git#for-oleg makes sense...

Another question: what's wrong with d_splice_alias() or d_materialise_u=
nique()?
I.e. why do we need ll_splice_alias()? I have patches in local queue
(soon to show up in for-next) that merge d_splice_alias() and
d_materialise_unique(), essentially teaching the former to deal with on=
e
case d_materialise_unique() can handle while d_splice_alias() couldn't.
If you need something not covered by those, it would be interesting to
find out if it would make sense to fold _that_ into d_splice_alias() as=
well...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Al Viro

2014-10-21 19:30:33 UTC

Another question: what's wrong with d_splice_alias() or d_materialise_unique()?
I.e. why do we need ll_splice_alias()? I have patches in local queue
(soon to show up in for-next) that merge d_splice_alias() and
d_materialise_unique(), essentially teaching the former to deal with one
case d_materialise_unique() can handle while d_splice_alias() couldn't.
If you need something not covered by those, it would be interesting to
find out if it would make sense to fold _that_ into d_splice_alias() as well...

Next one: is there any codepath that could lead to ll_md_blocking_ast()
before we get ->s_root assigned? IOW, what's
inode->i_sb->s_root != NULL &&
in there about? Pure paranoia or something more serious? Because from the
look of the call chains leading to that, having them hit before the superblock
has been set up seems to be risky...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-22 01:49:02 UTC

I bet it's pure paranoia.
I tracked this bit of code all the way back into 2003 in this unchanged form.--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-21 13:34:57 UTC

Hello!

=20

e()

Post by Al Viro
halfway through that? None of the locks that would give an exclusi=

Post by Al Viro
against d_move() appear to be held=85

=20
You are right. We hit something very similar not too long ago.

Post by Al Viro
=20
b) what stabilizes *dentryp->d_name in do_statahead_enter()?

=20

Post by Drokin, Oleg
Same as above.
All of that would go away once we get rid of the wrapper and unroll =

the

Post by Drokin, Oleg
vfs ops.
=20
Hopefully this makes at least some sense to you.

=20
OK... I'd gone and ripped some of that stuff out today; see if what'=

s in

vfs.git#for-oleg makes sense=85

I think it makes sense indeed.
The "get rid of duplicate mountpoint checks" patch has a hunk at ll_lin=
k
that looks like it belongs to "kill ll_link_generic()" patch,
otherwise the intermediate result is somewhat strange.

Another question: what's wrong with d_splice_alias() or d_materialise=

_unique()?

I.e. why do we need ll_splice_alias()? I have patches in local queue

We used to need it to also find and reuse invalid aliases, so that the =
dentry
cache does not get out of control with constant dentry invalidations we=
have.

I checked the code and now that d_splice_alias is calling __d_find_any_=
alias
that seems to just pick the first alias off the list, we should be fine
with just using generic d_splice_alias.
Just to be sure I guess I need to try a testrun, but it's a bit less co=
nvenient
now as I am away from my usual test nodes on a pretty awful internet co=
nnection
too.
John, I wonder if you can give this idea a try soonish?

Bye,
Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Al Viro

2014-10-21 21:17:10 UTC

Post by Drokin, Oleg
I think it makes sense indeed.
The "get rid of duplicate mountpoint checks" patch has a hunk at ll_link
that looks like it belongs to "kill ll_link_generic()" patch,
otherwise the intermediate result is somewhat strange.

Thanks; fixed and force-pushed.

Another question: what's wrong with d_splice_alias() or d_materialise_unique()?
I.e. why do we need ll_splice_alias()? I have patches in local queue

We used to need it to also find and reuse invalid aliases, so that the dentry
cache does not get out of control with constant dentry invalidations we have.
I checked the code and now that d_splice_alias is calling __d_find_any_alias
that seems to just pick the first alias off the list, we should be fine

ITYM "the one and only". That's pretty much what d_splice_alias() is about -
we do not allow more than one dentry (be it hashed or not) for a directory
inode, so d_add() variant that can run into directory inode already with a
dentry _must_ use that existing dentry instead of what it's been given.
That's why we need it to be able to move a preexisting dentry in place of
the one it's been given and return a new reference to that preexisting
dentry. For non-directories and for directories that have no aliases it's
just d_add() and return NULL.

Post by Drokin, Oleg
with just using generic d_splice_alias.

BTW, mind if we kill ll_umode_t? This kind of compat wrappers is the wrong
way - it should be "supply an equivalent of more recent kernel stuff when
builds on older kernels".
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-22 01:48:48 UTC

Another question: what's wrong with d_splice_alias() or d_materialise_unique()?
I.e. why do we need ll_splice_alias()? I have patches in local queue

Ah! I see now.
We do the "pick alias from a list" for everything, not just for the directories.
So that difference is still missing and as such there is a fear if we
convert to d_splice_alias in its current form, dcache would explode
from all the gazillions of invalid and unhashed dentries we produce during
operations.

Post by Drokin, Oleg
with just using generic d_splice_alias.

BTW, mind if we kill ll_umode_t? This kind of compat wrappers is the wrong
way - it should be "supply an equivalent of more recent kernel stuff when
builds on older kernels".

Right.
I have no problems with killing it since such a compat define is not needed
in current kernel anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Al Viro

2014-10-22 02:50:26 UTC

Post by Drokin, Oleg
Ah! I see now.
We do the "pick alias from a list" for everything, not just for the directories.
So that difference is still missing and as such there is a fear if we
convert to d_splice_alias in its current form, dcache would explode
from all the gazillions of invalid and unhashed dentries we produce during
operations.

Interesting... Where do those gazillions come from? AFAICS, your
->d_revalidate() does that only when you get LOOKUP_OPEN | LOOKUP_CREATE
in flags, i.e. on the final component of pathname at open() with O_CREAT.
Hmm... So basically you are trying to force them into ->atomic_open()
codepath? Fine, but... why not simply have ->open() pick what hadn't
gone through ->atomic_open()?

Check what nfs4_file_open() is doing; if you find out that the damn thing
*was* stale (i.e. that you really need a different inode, etc.), it's not
a problem - d_drop() and return -EOPENSTALE; VFS will repeat lookups.
Note that do_dentry_open() doesn't strip O_CREAT from file->f_flags until
after return from ->open(). So you can see O_CREAT in ->open() just
fine - this case is easy to distinguish there.

Incidentally, why the hell do you have separate ll_revalidate_nd() and
ll_revalidate_dentry()? I realize that it'll be inlined by compiler (the
only call of the latter is tail-call with identical arguments from the
former), but...

Another nasty question: is d_need_statahead() safe in RCU pathwalk mode?
When are ll_dentry_data and ll_inode_info freed? Ditto for ->lli_sai.
Sure, actual freeing of struct inode and struct dentry is RCU-delayed,
but from the quick glance it seems that freeing those guys isn't...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-22 09:30:18 UTC

Hello!

Post by Drokin, Oleg
Ah! I see now.
We do the "pick alias from a list" for everything, not just for the =

directories.

Post by Drokin, Oleg
So that difference is still missing and as such there is a fear if w=

Post by Drokin, Oleg
convert to d_splice_alias in its current form, dcache would explode
from all the gazillions of invalid and unhashed dentries we produce =

during

Post by Drokin, Oleg
operations.

Interesting... Where do those gazillions come from? AFAICS, your
->d_revalidate() does that only when you get LOOKUP_OPEN | LOOKUP_CRE=

ATE

Post by Al Viro
in flags, i.e. on the final component of pathname at open() with O_CR=

EAT.

Post by Al Viro
Hmm... So basically you are trying to force them into ->atomic_open(=

)

Post by Al Viro
codepath? Fine, but... why not simply have ->open() pick what hadn't
gone through ->atomic_open()?

It's not the d_revalidate open path that's causing them.
It's ll_md_blocking_ast()->ll_invalidate_aliases()
(and then subsequent check for d_lustre_invalid()).

The idea is like this:
When we get a valid dentry from server, it comes with a lock too.
As long as the lock is valid, the dentry remains valid indefinitely.
Now, once you have some operation on the server that invalidates the st=
ate
(and revokes the lock), the client gets the message and declares that d=
entry
invalid.
In the past lustre locking was really coarse, since lock covered an ino=
de
and everything (metadata) attached to it, including all the aliases.
It's better now, but still a sequence of say
"stat()-setattr()" or similar activity could cause a lot of reinstantia=
tions
of dentry to the same file and their subsequent rapid invalidation.
(and the parts of this could happen on different nodes of course).

As for the actual open path, there's another complication in there.
Lustre has a so called "open cache", basically a cache for open file
handles, since opens are really expensive (one RPC roundtrip to metadat=
a server,
and then another for the close) and when nfs is doing open/close
for every write/read, that really drags the whole lustre reexport perfo=
rmance
down (the biggest issue, but there might be other usecases for this, li=
ke
stupid apps that do a lot of open/closes for the same file).
We also don't want to do it for every file open because sometimes it's
too expensive to actually cache the handles in this way and might need
extra lock bouncing (like say open for exec vs open for write situation=
,
also create a bunch of files and then delete them right away benchmark=
s).
So the logic is - if we come into the ->open() handler directly with no
traces of previous lookup-obtained open handle - that means we are eith=
er
called directly from NFS without any lookup, or it's a file that's
experiencing particularly high level of opens/closes and therefore it's
worth caching the open handle, so we set a special flag and the server
gives us the open handle and also the lock that allows us to cache that
handle.
Now, if dentry revalidation matched a dentry - that would result in a
straight call into ->open() without preceeding lookup (naturally),
but we now cannot distinguish if this is a real direct call to
->open() from the kernel user like NFS, or not.
In the past we just replicated part of lookup code in ->revalidate, so
it sent around lookup RPCs and then stored returned status in dentry
to be reused by ->open later, but that was not really pretty,
so the other way around seems to be to just never report dentry
cache hits for any opens (this part is not submitted upstream yet,
but it's in the queue). It'll look like this: http://review.whamcloud.c=
om/11062

Post by Al Viro
Check what nfs4_file_open() is doing; if you find out that the damn t=

hing

Post by Al Viro
*was* stale (i.e. that you really need a different inode, etc.), it's=

not

Post by Al Viro
a problem - d_drop() and return -EOPENSTALE; VFS will repeat lookups.
Note that do_dentry_open() doesn't strip O_CREAT from file->f_flags u=

ntil

Post by Al Viro
after return from ->open(). So you can see O_CREAT in ->open() just
fine - this case is easy to distinguish there.

So in our invalid case, the thing is not STALE, we still need to repeat
the lookups and all, but in most cases underlying inode did not change =
and
so it makes no sense to have another dentry created for us, so we reuse
one of the stale ones we found.

Post by Al Viro
Incidentally, why the hell do you have separate ll_revalidate_nd() an=

Post by Al Viro
ll_revalidate_dentry()? I realize that it'll be inlined by compiler =

(the

Post by Al Viro
only call of the latter is tail-call with identical arguments from th=

Post by Al Viro
former), but=85

It's a compatibility artefact with older kernels that did not have atom=
ic_open()
and the revalidate_nd had two arguments: dentry and nameidata (hence th=
e name).
So we have two version of this ll_revalidate_nd, but only one in upstre=
am
kernel.

Post by Al Viro
Another nasty question: is d_need_statahead() safe in RCU pathwalk mo=

de?

Post by Al Viro
When are ll_dentry_data and ll_inode_info freed? Ditto for ->lli_sai=

=2E

Post by Al Viro
Sure, actual freeing of struct inode and struct dentry is RCU-delayed=

Post by Al Viro
but from the quick glance it seems that freeing those guys isn't...

Hm=85
This code was plagued by races since it's inception in it's previous fo=
rm,
and the later rewrite that's in the upstream kernel.
So I won't be totally surprised if there's more of it left.
We had some recent fixes in lli_sai department for the various races of
use vs cleanup that I was hitting regularly in my special race-promotin=
g
burn-in system. Once I get back home early next week, I should be able =
to
send them in and also see what they were.
ll_dentry_data is freed via ll_release's call to
call_rcu(&lld->lld_rcu_head, free_dentry_data);

the ll_inode_info is freed via ll_destroy_inode call to
call_rcu(&inode->i_rcu, ll_inode_destroy_callback);

so I think we should be fine on this front?

Bye,
Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Al Viro

2014-10-21 20:07:42 UTC

Post by Drokin, Oleg
Hello!
=20
=20

e()

Post by Al Viro
halfway through that? None of the locks that would give an exclusi=

Post by Al Viro
against d_move() appear to be held=E2=80=A6

=20
You are right. We hit something very similar not too long ago.

Umm... While we are at it - what's the story with ll_setxattr() handli=
ng
of "trusted.lov"? What happens on the protocol level and why do we nee=
d
a file name for that, while for directories we don't seem to need it?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Drokin, Oleg

2014-10-22 01:53:08 UTC

Post by Drokin, Oleg
Hello!
=20
=20

e()

Post by Al Viro
halfway through that? None of the locks that would give an exclusi=