Discussion:
[fuse-devel] FW: Problem with concurrent ls calls to a basic fuse device
Pielage, Fiona
2006-06-15 13:00:39 UTC
Permalink
A colleague of mine is experiencing the following problem use fuse.
Can anyone advise?

________________________________

From: Clark, Steven L
Sent: 15 June 2006 12:22
To: Pielage, Fiona
Subject: Problem with concurrent ls calls to a basic fuse device


Hi there,

I've encountered a problem when trying to do some concurrency testing of
a FUSE device I've been working on. The test in question creates two
threads and performs a set of ls calls to the device for several
iterations checking the results each time.

Most of the time the test seems to work fine, but from time to time the
test just hangs, and worse than that ends up hanging the entire box,
forcing a reboot. Initially I was concerned that the problem was within
my own code so I inserted lots of trace messages to see which of my
methods the hang was occurring in. I was surprised to discover that the
hang occurs completely outside my code, which implies that it is
occurring somewhere in the FUSE code.

What I'm seeing with the trace is that the hang occurs directly after a
call to getattr has finished. Running the device in debug mode shows
that the getattr has completed successfully, but it then doesn't go on
to call either opendir or readdir, and at this point the hang could
occur before either of those calls.

If I run the device in single threaded mode I can't reproduce the
problem at all which suggests that this is a multithreading issue.

My only concern is over the version of compiler we are using for our
development code. The linux kernel (SUSE SLES 9 SP2) is compiled using
gcc 3.3.3 as is the FUSE library. Our developed code is compiled using
gcc 3.4.5. Could this cause a problem? If so could someone please
explain what that problem is?

Thanks in advance for your time,

Steve Clark
***@goodrich.com
Miklos Szeredi
2006-06-15 13:22:23 UTC
Permalink
Post by Pielage, Fiona
I've encountered a problem when trying to do some concurrency testing of
a FUSE device I've been working on. The test in question creates two
threads and performs a set of ls calls to the device for several
iterations checking the results each time.
Most of the time the test seems to work fine, but from time to time the
test just hangs, and worse than that ends up hanging the entire box,
forcing a reboot.
What kind of a hang is it? Unrelated applications don't respond
either? Does the machine respond to SysRq (e.g. Alt-SysRq-t)
commands?

What is the fuse version? What is the kernel version?
Post by Pielage, Fiona
Initially I was concerned that the problem was within
my own code so I inserted lots of trace messages to see which of my
methods the hang was occurring in. I was surprised to discover that the
hang occurs completely outside my code, which implies that it is
occurring somewhere in the FUSE code.
What I'm seeing with the trace is that the hang occurs directly after a
call to getattr has finished. Running the device in debug mode shows
that the getattr has completed successfully, but it then doesn't go on
to call either opendir or readdir, and at this point the hang could
occur before either of those calls.
If I run the device in single threaded mode I can't reproduce the
problem at all which suggests that this is a multithreading issue.
My only concern is over the version of compiler we are using for our
development code. The linux kernel (SUSE SLES 9 SP2) is compiled using
gcc 3.3.3 as is the FUSE library. Our developed code is compiled using
gcc 3.4.5. Could this cause a problem?
I don't think so. The only important thing is that the kernel and the
fuse module are compiled with exactly the same version of gcc.

Thanks,
Miklos
John Muir
2006-06-15 15:22:53 UTC
Permalink
Post by Pielage, Fiona
What I'm seeing with the trace is that the hang occurs directly after a
call to getattr has finished. Running the device in debug mode shows
that the getattr has completed successfully, but it then doesn't go on
to call either opendir or readdir, and at this point the hang could
occur before either of those calls.
Interesting that there should be a problem with getattr. Are you using
SMP machines?

I have the attached series of patches against 2.5.3 which correct the
problems with FUSE and the kernel NFS server. The lock-ups in that
scenario are similar to those found with multiple concurrent 'ls' on the
same directory, in an SMP environment.

The problem occurs because FUSE modifies some of the inode data
structures in non-write operations such as getattr without taking the
inode semaphore, and this is a problem for the NFS server, and also
between threads running through fuse.

The patches attached are a the effort of debugging by myself, Sean
Kormilo, and Matt Maynard. We have not released them until now for a few
reasons:
1. We weren't confident that they solved all locking issues.
2. They don't apply against the CVS head.
3. I'm not sure that they are optimal; we may have been over zealous
with our locks.
4. I haven't had time to fix the above two problems (I will have time in
about 3 months).
5. Miklos has stated many times that he doesn't think knfsd and FUSE
should mix.

The first patch, 0150-fuse_ll_process.patch, is something that is
already included in 2.6.0, and used by the next patch.

(This message was initially too large for fuse-devel, so I split it into
two. The second posting will contain the 0150-fuse_ll_process.patch,
which was posted previously in this mailing list.)

The second patch, 0200-fuse-2.5.1-ilookup.patch, implements additional
functions required by the NFS server; lookup by inode number only, and
lookup an inode's parent. My file-system implementation is inode-based,
so we did not implement these functions within the fuse.c. Also, given
that inode numbers returned by fuse.c are not consistent between mounts,
it doesn't make sense to implement them. For NFS to work correctly, a
file-system which returns consistent and unique inode numbers is required.

The third patch, 0300-fuse-inode-locking.patch, implements the inode
locking required to correct the problems with the NFS server, and ls.

Anyways, that said, there they are. If they don't solve your problem,
then you can safely ignore this e-mail. Otherwise, I will endeavor to
make these patches more acceptable to Miklos, and perhaps they can be
included in a future release.

John.
--
John Muir
NORTEL
***@nortel.com
Miklos Szeredi
2006-06-16 08:50:41 UTC
Permalink
Post by John Muir
Post by Pielage, Fiona
What I'm seeing with the trace is that the hang occurs directly after a
call to getattr has finished. Running the device in debug mode shows
that the getattr has completed successfully, but it then doesn't go on
to call either opendir or readdir, and at this point the hang could
occur before either of those calls.
Interesting that there should be a problem with getattr. Are you using
SMP machines?
I have the attached series of patches against 2.5.3 which correct the
problems with FUSE and the kernel NFS server. The lock-ups in that
scenario are similar to those found with multiple concurrent 'ls' on the
same directory, in an SMP environment.
The problem occurs because FUSE modifies some of the inode data
structures in non-write operations such as getattr without taking the
inode semaphore, and this is a problem for the NFS server, and also
between threads running through fuse.
Interesting theory, but I'm not sure exactly what could cause problems
around setting the attributes. One suspect is i_size_read(), which
spins on SMP until i_size is stable. But this would only cause a hang
if i_size_write() were called in a loop, which I don't see happening
anywhere.
Post by John Muir
The patches attached are a the effort of debugging by myself, Sean
Kormilo, and Matt Maynard. We have not released them until now for a few
1. We weren't confident that they solved all locking issues.
2. They don't apply against the CVS head.
3. I'm not sure that they are optimal; we may have been over zealous
with our locks.
Yes :)
Post by John Muir
4. I haven't had time to fix the above two problems (I will have time in
about 3 months).
5. Miklos has stated many times that he doesn't think knfsd and FUSE
should mix.
Yes, but current code should be fixed nonetheless.

Also this may not just be NFS related. I've got another lockup report
in which NFS is not involved at all. So there's probably something
fishy in there and SMP seems to be a common denominator.

Thanks,
Miklos
John Muir
2006-06-16 12:16:36 UTC
Permalink
Post by Miklos Szeredi
Post by John Muir
The problem occurs because FUSE modifies some of the inode data
structures in non-write operations such as getattr without taking the
inode semaphore, and this is a problem for the NFS server, and also
between threads running through fuse.
Interesting theory, but I'm not sure exactly what could cause problems
around setting the attributes. One suspect is i_size_read(), which
spins on SMP until i_size is stable. But this would only cause a hang
if i_size_write() were called in a loop, which I don't see happening
anywhere.
We did observe that the lockups are occurring in this code.

Questions that come to mind:
How would i_size_read() be affected by caching?
Does i_size_write() not require a semaphore lock?
We are running 2.6.10. I wonder if there have been changes to
i_size_read() and company?
Post by Miklos Szeredi
Post by John Muir
5. Miklos has stated many times that he doesn't think knfsd and FUSE
should mix.
Yes, but current code should be fixed nonetheless.
Well, the 'ilookup' and 'lookupparent' functions that we have added are
essential to providing 'proper' NFS service, although as I had said in
my e-mail, it may not make much sense to add an implementation of those
to the fuse.c file-based library.
Post by Miklos Szeredi
Also this may not just be NFS related. I've got another lockup report
in which NFS is not involved at all. So there's probably something
fishy in there and SMP seems to be a common denominator.
Agreed, this is definitely not only NFS related. It's just that I'm
posting my series of patches as is, including the additional NFS
functionality, and separating the patches was more work than I wanted to
do at the moment.

John.
--
John Muir
NORTEL
***@nortel.com
Miklos Szeredi
2006-06-16 12:31:56 UTC
Permalink
Post by John Muir
We did observe that the lockups are occurring in this code.
You mean it entered fuse_change_attributes() before the lockup, and
did not exit this function?
Post by John Muir
How would i_size_read() be affected by caching?
Does i_size_write() not require a semaphore lock?
No, the reason i_size_read/write exist is to be able to do atomic
(lockless) access to i_size on 32 bit archs.
Post by John Muir
We are running 2.6.10. I wonder if there have been changes to
i_size_read() and company?
Dunno.

The other thing in there which is not just a plain assignment is
invalidate_inode_pages(), but that does it's own locking and I don't
think it requires the caller to do any. But maybe I'm wrong.

It would be very useful if you could test where the lockup occurs
within fuse_change_attributes().
Post by John Muir
Post by Miklos Szeredi
Post by John Muir
5. Miklos has stated many times that he doesn't think knfsd and FUSE
should mix.
Yes, but current code should be fixed nonetheless.
Well, the 'ilookup' and 'lookupparent' functions that we have added are
essential to providing 'proper' NFS service, although as I had said in
my e-mail, it may not make much sense to add an implementation of those
to the fuse.c file-based library.
Post by Miklos Szeredi
Also this may not just be NFS related. I've got another lockup report
in which NFS is not involved at all. So there's probably something
fishy in there and SMP seems to be a common denominator.
Agreed, this is definitely not only NFS related. It's just that I'm
posting my series of patches as is, including the additional NFS
functionality, and separating the patches was more work than I wanted to
do at the moment.
Yeah, thanks for responding. If you can reproduce this lockup with or
without NFS it would be worth putting a bit more effort into finding
out the root cause.

Thanks,
Miklos

John Muir
2006-06-15 15:26:33 UTC
Permalink
This message contains the missing patch for this thread.

John.
--
John Muir
NORTEL
***@nortel.com
Loading...