2.6.36 io bring the system to its knees

Find attached screenshot ( latencytop_n_powertop.png ) which depicts
artifacts where the window manager froze at the time I was trying to
see a tab in Konsole where the powertop was running.

You seem to have forgotten to include the attachment.

I got it - it appears it was too large for lkml's ~500K mail size limit.

Aidar, mind sending a smaller image?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pekka Enberg

2010-10-28 09:35:10 UTC

Find attached screenshot ( latencytop_n_powertop.png ) which depicts
artifacts where the window manager froze at the time I was trying to
see a tab in Konsole where the powertop was running.

You seem to have forgotten to include the attachment.

I got it - it appears it was too large for lkml's ~500K mail size limit.
Aidar, mind sending a smaller image?

Ingo, didn't you have some nice script to capture system state? Maybe
that could shed some light to what's going on in Aidar's system?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pekka Enberg

2010-10-28 11:16:17 UTC

Find attached screenshot ( latencytop_n_powertop.png ) which depicts
artifacts where the window manager froze at the time I was trying to
see a tab in Konsole where the powertop was running.

You seem to have forgotten to include the attachment.

I got it - it appears it was too large for lkml's ~500K mail size limit.
Aidar, mind sending a smaller image?

Looks mostly VFS to me. Aidar, does killing Picasa make things
smoother for you? If so, maybe the VFS scalability patches will help.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Aidar Kultayev

2010-10-28 11:34:05 UTC

if it wasn't picasa, it would have been something else. I mean if I
kill picasa ( later on it was done indexing new pics anyway ), it
would have been for virtualbox to thrash the io. So, nope, getting rid
of picasa doesn't help either. In general the systems responsiveness
or sluggishness is dominated by those io operations going on - the DD
& CP & probably VBOX issuing whole bunch of its load for IO.

Another way I see these delays, is when I leave system overnight, with
ktorrent & juk(stopped) in the background. It takes some time for
WM(kwin) to work out ALT+TAB the very next morning. But this might be
because the WM(kwin & its code) has been swapped out, because of long
period of not using it.

But, in general, I have troubles with responsiveness, when I try to
restore my virtualbox image from saved state. If there is a DD doing
its stuff while virtualbox is restoring its image, I see those nasty
delays - the kwin, mouse pointer, etc...

thanks Aidar

PS : the good thing is, and I am getting used to it, I don't loose
data, I mean the system doesn't hang, just freezes for a while :)

Find attached screenshot ( latencytop_n_powertop.png ) which depicts
artifacts where the window manager froze at the time I was trying to
see a tab in Konsole where the powertop was running.

You seem to have forgotten to include the attachment.

I got it - it appears it was too large for lkml's ~500K mail size limit.
Aidar, mind sending a smaller image?

Looks mostly VFS to me. Aidar, does killing Picasa make things
smoother for you? If so, maybe the VFS scalability patches will help.
Pekka

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pekka Enberg

2010-10-28 11:48:28 UTC

Post by Aidar Kultayev
if it wasn't picasa, it would have been something else. I mean if I
kill picasa ( later on it was done indexing new pics anyway ), it
would have been for virtualbox to thrash the io. So, nope, getting rid
of picasa doesn't help either. In general the systems responsiveness
or sluggishness is dominated by those io operations going on - the DD
& CP & probably VBOX issuing whole bunch of its load for IO.

Do you still see high latencies in vfs_lseek() and vfs_fsync()? I'm
not a VFS expert but looking at your latencytop output, it seems that
fsync grabs ->i_mutex which blocks vfs_llseek(), for example. I'm not
sure why that causes high latencies though it's a mutex we're holding.

Post by Aidar Kultayev
Another way I see these delays, is when I leave system overnight, with
ktorrent & juk(stopped) in the background. It takes some time for
WM(kwin) to work out ALT+TAB the very next morning. But this might be
because the WM(kwin & its code) has been swapped out, because of long
period of not using it.

Yeah, that's probably paging overhead.

P.S. Can you please upload latencytop output somewhere and post an URL
to it so other people can also see it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Aidar Kultayev

2010-10-28 12:18:14 UTC

http://picasaweb.google.com/aidar.eiei/LinuxIo#5533068249408411698

I will look into latencytop output and will figure out a usage pattern
that is most annoying with regards to IO.
Will try to see what leads to that & if possible to make a screenshot
of what is going on.
The thing is, I don't think the program that captures the screenshots
does it in a meaningful way, because at the moment the system is
brought to its knees, I don't think that this particular program
(KSnapshot) can get away from being affected. I mean it might take a
snapshot which is not representative enough.

thanks, Aidar

Yeah, that's probably paging overhead.
P.S. Can you please upload latencytop output somewhere and post an URL
to it so other people can also see it?

Christoph Hellwig

2010-10-28 13:47:10 UTC

It does. But what workload does a lot of llseeks while fsyncing the
same file? I'd bet some application is doing really stupid things here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

2010-10-28 13:55:07 UTC

Do you still see high latencies in vfs_lseek() and vfs_fsync()? I'm not a VFS
expert but looking at your latencytop output, it seems that fsync grabs
->i_mutex which blocks vfs_llseek(), for example. I'm not sure why that causes
high latencies though it's a mutex we're holding.

It does. But what workload does a lot of llseeks while fsyncing the same file?
I'd bet some application is doing really stupid things here.

Seeking in a file and fsync-ing it does not seem like an inherently bad or even
stupid thing to do - why do you claim that it is stupid?

If mixed seek()+fsync() is the reason for these latencies (which is just an
assumption right now) then it needs to be fixed in the kernel, not in apps.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

2010-10-28 13:31:03 UTC

Find attached screenshot ( latencytop_n_powertop.png ) which depicts
artifacts where the window manager froze at the time I was trying to
see a tab in Konsole where the powertop was running.

You seem to have forgotten to include the attachment.

I got it - it appears it was too large for lkml's ~500K mail size limit.
Aidar, mind sending a smaller image?

Looks mostly VFS to me. Aidar, does killing Picasa make things smoother for you?
If so, maybe the VFS scalability patches will help.

Hm, but the VFS scalability patches mostly decrease CPU usage, and does that mostly
on many-core systems.

How do I notice slowdowns ? The JuK lags so badly that it can't play any music,
the mouse pointer freezes, kwin effects freeze for few seconds.
How can I make it much worse ? I can try & run disk clean up under XP, that is
running in VBox, with folder compression. On top of it if I start copying big
files in linux ( 700MB avis, etc ), GUI effects freeze, mouse pointer freezes for
few seconds.
And this is on 2.6.36 that is supposed to cure these "features". From this
perspective, 2.6.36 is no better than any previous stable kernel I've tried.
Probably as bad with regards to IO issues.

"Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
i'm afraid.

This has the appearance of some really bad IO or VM latency problem. Unfixed and
present in stable kernel versions going from years ago all the way to v2.6.36.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Hellwig

2010-10-28 13:47:44 UTC

Looks mostly VFS to me. Aidar, does killing Picasa make things smoother for you?
If so, maybe the VFS scalability patches will help.

Hm, but the VFS scalability patches mostly decrease CPU usage, and does that mostly
on many-core systems.

If you have i_mutex contention they are not going to change anything.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

2010-10-28 13:51:13 UTC

Post by Christoph Hellwig

Looks mostly VFS to me. Aidar, does killing Picasa make things smoother for you?
If so, maybe the VFS scalability patches will help.

Hm, but the VFS scalability patches mostly decrease CPU usage, and does that
mostly on many-core systems.

If you have i_mutex contention they are not going to change anything.

Yes, that was my point.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Chris Mason

2010-10-28 17:05:02 UTC

Post by Ingo Molnar
"Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
i'm afraid.
This has the appearance of some really bad IO or VM latency problem. Unfixed and
present in stable kernel versions going from years ago all the way to v2.6.36.

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.

The virtualbox part of the workload is probably filling the queue with
huge amounts of synchronous random IO (I'm assuming it is going in via
O_DIRECT), and this will defeat any attempts from the filesystem to tell
the elevator "hey look, my IO is synchronous, please do hurry"

So, I'd try mounting ext4 in data=writeback mode. I can't make ext4
stall fsyncs on non-fsync IO locally and it looks like they have solved
the ext3 data=ordered problem. But I still like to rule out old and
known issues before we dig into new things.

I'd also suggest something like the below patch which is entirely
untested and must be blessed by an actual ext4 developer. I think we
can make fsync faster if we put the mutex locking down in the FS, but
until then it should be ok to drop the mutex while we are doing the
expensive log commits:

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..1b7a637 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -114,6 +114,7 @@ int ext4_sync_file(struct file *file, int datasync)
if (ext4_should_journal_data(inode))
return ext4_force_commit(inode->i_sb);

+ mutex_unlock(&inode->i_mutex);
commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
if (jbd2_log_start_commit(journal, commit_tid)) {
/*
@@ -133,5 +134,7 @@ int ext4_sync_file(struct file *file, int datasync)
} else if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
BLKDEV_IFL_WAIT);
+
+ mutex_lock(&inode->i_mutex);
return ret;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pekka Enberg

2010-10-28 17:58:01 UTC

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.
The virtualbox part of the workload is probably filling the queue with
huge amounts of synchronous random IO (I'm assuming it is going in via
O_DIRECT), and this will defeat any attempts from the filesystem to tell
the elevator "hey look, my IO is synchronous, please do hurry"
So, I'd try mounting ext4 in data=writeback mode. I can't make ext4
stall fsyncs on non-fsync IO locally and it looks like they have solved
the ext3 data=ordered problem. But I still like to rule out old and
known issues before we dig into new things.
I'd also suggest something like the below patch which is entirely
untested and must be blessed by an actual ext4 developer. I think we
can make fsync faster if we put the mutex locking down in the FS, but
until then it should be ok to drop the mutex while we are doing the
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..1b7a637 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -114,6 +114,7 @@ int ext4_sync_file(struct file *file, int datasync)
if (ext4_should_journal_data(inode))
return ext4_force_commit(inode->i_sb);
+ mutex_unlock(&inode->i_mutex);
commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
if (jbd2_log_start_commit(journal, commit_tid)) {
/*
@@ -133,5 +134,7 @@ int ext4_sync_file(struct file *file, int datasync)
} else if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
BLKDEV_IFL_WAIT);
+
+ mutex_lock(&inode->i_mutex);
return ret;
}

Don't we need to call ext4_should_writeback_data() before we drop the
lock? It pokes at ->i_mode which needs ->i_mutex AFAICT.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ted Ts'o

2010-10-29 14:52:42 UTC

Post by Pekka Enberg
Don't we need to call ext4_should_writeback_data() before we drop the
lock? It pokes at ->i_mode which needs ->i_mutex AFAICT.

No, it should be fine. It's not like a file is going to change from
being a regular file to a directory or vice versa. :-)

From a quick inspection it looks OK, but I haven't had the time to
look more closely to be 100% sure, and of course I haven't run it
through a battery of regression tests. For normal usage it should be
fine though.

Aidar, if you'd be willing to try it with this patch applied, and with
the file system mounted data=writeback, and then let me know what the
latencytop reports, that would be useful. I'm fairly sure that fixing
llseek() probably won't make that much difference, since it will
probably spread things out to other places, but it would be good to
make the experiment.

We will probably also need to use the uninitialized bit for protecting
data from showing up after a crash for extent-based files, and turning
on data=writeback is a good way to simulate that. (Sorry, no way
we're going to make a change like that this merge cycle, but that
might be something we could do for 2.6.38.) But I am curious to see
what are the next things that come up as being problematic after that.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Aidar Kultayev

2010-10-29 15:34:05 UTC

puling the git now - I will try whatever you throw at me.

Post by Pekka Enberg
Don't we need to call ext4_should_writeback_data() before we drop the
lock? It pokes at ->i_mode which needs ->i_mutex AFAICT.

No, it should be fine. It's not like a file is going to change from
being a regular file to a directory or vice versa. :-)
From a quick inspection it looks OK, but I haven't had the time to
look more closely to be 100% sure, and of course I haven't run it
through a battery of regression tests. For normal usage it should be
fine though.
Aidar, if you'd be willing to try it with this patch applied, and with
the file system mounted data=writeback, and then let me know what the
latencytop reports, that would be useful. I'm fairly sure that fixing
llseek() probably won't make that much difference, since it will
probably spread things out to other places, but it would be good to
make the experiment.
We will probably also need to use the uninitialized bit for protecting
data from showing up after a crash for extent-based files, and turning
on data=writeback is a good way to simulate that. (Sorry, no way
we're going to make a change like that this merge cycle, but that
might be something we could do for 2.6.38.) But I am curious to see
what are the next things that come up as being problematic after that.
Thanks,
- Ted

Ingo Molnar

2010-10-30 09:15:25 UTC

Post by Aidar Kultayev
puling the git now - I will try whatever you throw at me.

Ted, i stuck that patch into tip:out-of-tree as:

22fd555f6c5f: <not for upstream> ext4: Relax i_mutex hold times

So that Aidar can test things more easily via:

http://people.redhat.com/mingo/tip.git/README

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Aidar Kultayev

2010-10-30 13:02:44 UTC

Hi,

here is what I have :

ext4 mounted with data=ordered
-tip tree ( uname -a gives : Linux pussy 2.6.36-tip+ )

here is the latencytop & powertop & top screenshot:

http://picasaweb.google.com/lh/photo/bMTgbVDoojwUeXtVdyvIKw?feat=directlink

the system is/was doing :
dd if=/dev/zero of=test.10g bs=1M count=10000;rm test.10g
netbeans
compiling gcc-4.5.1
running VBox, which wasn't doing any IO. The guest os was idle in other words
vlc
chromium
firefox
and bunch of other small stuff.

Even without having running DD, the mouse cursor would occasionally
lag. The alt+tab effect in KWin would take 5+seconds to workout.
When I run DD on top of the workload it consistently made system much
more laggy. The cursor would freeze much more frequent. It is like if
you drag your mouse physically, but the cursor on the screen would
jump discretely, in other words there is no continuity.
Music would stop.

I am free to try out anything here.

thanks, Aidar

Post by Aidar Kultayev
puling the git now - I will try whatever you throw at me.

22fd555f6c5f: <not for upstream> ext4: Relax i_mutex hold times
http://people.redhat.com/mingo/tip.git/README
Thanks,
Ingo

Chris Mason

2010-10-30 19:08:58 UTC

Hi,
.ext4 mounted with data=ordered
.-tip tree ( uname -a gives : Linux pussy 2.6.36-tip+ )
http://picasaweb.google.com/lh/photo/bMTgbVDoojwUeXtVdyvIKw?feat=directlink

It's actually better, fsync is missing anyway. Please try ext4
data=writeback.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ted Ts'o

2010-10-31 02:32:31 UTC

.dd if=/dev/zero of=test.10g bs=1M count=10000;rm test.10g
.netbeans
.compiling gcc-4.5.1
.running VBox, which wasn't doing any IO. The guest os was idle in other words
.vlc
.chromium
.firefox
and bunch of other small stuff.
Even without having running DD, the mouse cursor would occasionally
lag. The alt+tab effect in KWin would take 5+seconds to workout.
When I run DD on top of the workload it consistently made system much
more laggy. The cursor would freeze much more frequent. It is like if
you drag your mouse physically, but the cursor on the screen would
jump discretely, in other words there is no continuity.
Music would stop.

If you start shutting down tasks, Vbox, netbeans, chromium, etc., at
what point does the cursor start tracking the system easily? Is the
system swapping? Do you know how to use tools like dstat or iostat to
see if the system is actively writing to the swap partition? (And are
you using a swap partition or a swap file?)

The fact that cursor isn't tracking well even when the dd is running,
and presumably the only source of I/O is the gcc and vlc, makes me
suspect that you may be swapping pretty heavily. Have you tried
investigating that possibility, and made sure it has been ruled out?

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Corrado Zoccolo

2010-10-31 17:49:51 UTC

Post by Ted Ts'o

If you start shutting down tasks, Vbox, netbeans, chromium, etc., at
what point does the cursor start tracking the system easily? Is the
system swapping? Do you know how to use tools like dstat or iostat to
see if the system is actively writing to the swap partition? (And are
you using a swap partition or a swap file?)
The fact that cursor isn't tracking well even when the dd is running,
and presumably the only source of I/O is the gcc and vlc, makes me
suspect that you may be swapping pretty heavily. Have you tried
investigating that possibility, and made sure it has been ruled out?

Something to try is also to raise X cpu scheduling priority, since I
would be really surprised if we evict from memory the routine that
draws the cursor.
BTW, I've seen the cursor jumping problem even when not swapping, and
with minimal *real* disk activity (but with heavy usage of a fuse
filesystem providing remote resources), and high cpu activity.
Raising X priority solved the problem with the mouse pointer, but the
gui programs still didn't respond quickly...

Thanks
Corrado

Post by Ted Ts'o
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/

Shaohua Li

2010-11-02 03:10:29 UTC

Hi,
.ext4 mounted with data=ordered
.-tip tree ( uname -a gives : Linux pussy 2.6.36-tip+ )
http://picasaweb.google.com/lh/photo/bMTgbVDoojwUeXtVdyvIKw?feat=directlink
.dd if=/dev/zero of=test.10g bs=1M count=10000;rm test.10g
.netbeans
.compiling gcc-4.5.1
.running VBox, which wasn't doing any IO. The guest os was idle in other words
.vlc
.chromium
.firefox
and bunch of other small stuff.
Even without having running DD, the mouse cursor would occasionally
lag. The alt+tab effect in KWin would take 5+seconds to workout.
When I run DD on top of the workload it consistently made system much
more laggy. The cursor would freeze much more frequent. It is like if
you drag your mouse physically, but the cursor on the screen would
jump discretely, in other words there is no continuity.
Music would stop.
I am free to try out anything here.

would you please try the vm_exec protect patch here?
http://www.spinics.net/lists/linux-mm/msg09617.html

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Sanjoy Mahajan

2010-11-02 11:53:05 UTC

Post by Ingo Molnar
This has the appearance of some really bad IO or VM latency
problem. Unfixed and present in stable kernel versions going from
years ago all the way to v2.6.36.

Hmmm, the workload you're describing here has two special parts.
First it dramatically overloads the disk, and then it has guis doing
things waiting for the disk.

I think I see this same issue every few days when I back up my hard
drive to a USB hard drive using rsync. While the backup is running, the
interactive response is bad. A reproducible measurement of the badness
is starting an rxvt with F8 (bound to "rxvt &" in my .twmrc). Often it
takes 8 seconds for the window to appear (as it just did about 2 minutes
ago)! (Starting a subsequent rxvt is quick.)

The command for running the backup:

rsync -av --delete /etc /home /media/usbdrive/bak > /tmp/homebackup.log

The hardware is a T60 w/ Intel graphics and wireless, 1.5GB RAM, 5400rpm
160GB harddrive w/ ext3 filesystems, and it's running vanilla 2.6.36.
There's not much memory pressure. The swap is mostly empty, and there's
usually a Firefox eating 500MB of RAM. Even Emacs at 50MB is in the
noise compared to the Firefox.

Here's the 'free' output:

total used free shared buffers cached
Mem: 1545292 1500288 45004 0 92848 713988
-/+ buffers/cache: 693452 851840
Swap: 2000088 22680 1977408

What tests or probes are worth running when the problem reappears in
order to find the root cause?

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Chris Mason

2010-11-02 13:15:13 UTC

Post by Sanjoy Mahajan

Post by Ingo Molnar
This has the appearance of some really bad IO or VM latency
problem. Unfixed and present in stable kernel versions going from
years ago all the way to v2.6.36.

Hmmm, the workload you're describing here has two special parts.
First it dramatically overloads the disk, and then it has guis doing
things waiting for the disk.

So this sounds like the backup is just thrashing your cache. Latencies
starting an app are less surprising than latencies where a running app
doesn't respond at all.

Does rsync have the option to do an fadvise DONTNEED?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Sanjoy Mahajan

2010-11-04 16:06:06 UTC

Post by Chris Mason
So this sounds like the backup is just thrashing your cache.

I think it's more than that. Starting an rxvt shouldn't take 8 seconds,
even with a cold cache. Actually, it does take a while, so you do have
a point. I just did

echo 3 > /proc/sys/vm/drop_caches

and then started rxvt. That takes about 3 seconds (which seems long,
but I don't know wherein that slowness lies), of which maybe 0.25
seconds is loading and running 'date':

$ time rxvt -e date
real 0m2.782s
user 0m0.148s
sys 0m0.032s

The 8-second delay during the rsync must have at least two causes: (1)
the cache is wiped out, and (2) the rxvt binary cannot be paged in
quickly because the disk is doing lots of other I/O.

Can the system someknow that paging in the rxvt binary and shared
libraries is interactive I/O, because it was started by an interactive
process, and therefore should take priority over the rsync?

Post by Chris Mason
Does rsync have the option to do an fadvise DONTNEED?

I couldn't find one. It would be good to have a solution that is
independent of the backup app. (The 'locate' cron job does a similar
thrashing of the interactive response.)

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Steven Barrett

2010-11-04 23:35:42 UTC

Post by Sanjoy Mahajan

Post by Chris Mason
So this sounds like the backup is just thrashing your cache.

I think it's more than that. Starting an rxvt shouldn't take 8 seconds,
even with a cold cache. Actually, it does take a while, so you do have
a point. I just did
echo 3 > /proc/sys/vm/drop_caches
and then started rxvt. That takes about 3 seconds (which seems long,
but I don't know wherein that slowness lies), of which maybe 0.25
$ time rxvt -e date
real 0m2.782s
user 0m0.148s
sys 0m0.032s
The 8-second delay during the rsync must have at least two causes: (1)
the cache is wiped out, and (2) the rxvt binary cannot be paged in
quickly because the disk is doing lots of other I/O.
Can the system someknow that paging in the rxvt binary and shared
libraries is interactive I/O, because it was started by an interactive
process, and therefore should take priority over the rsync?

Post by Chris Mason
Does rsync have the option to do an fadvise DONTNEED?

I couldn't find one. It would be good to have a solution that is
independent of the backup app. (The 'locate' cron job does a similar
thrashing of the interactive response.)

I'm definitely no expert in Linux' file cache management, but from what
I've experienced... isn't the real problem that the "interactive"
processes, like your web browser or file manager, lose their inode and
dentry cache when rsync runs? Then while rsync is busy reading and
writing to the disk, whenever you click on your interactive application,
it tries to read what it lost to rsync from the disk while rsync is
still thrashing your inode/dentry cache.

This is a major problem even when my system has lots of ram (4gB on this
laptop).

What has helped me, however, is reducing vm.vfs_cache_pressure to a
smaller value (25 here) so that Linux prefers to retain the current
inode / dentry cache rather than suddenly give it up for a new greedy
I/O type of program. The only side effect is that file copying is a
little slower than usual... totally worth it though.

Post by Sanjoy Mahajan
-Sanjoy
`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

Steven Barrett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jesper Juhl

2010-11-04 23:55:45 UTC

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.

Just want to chime in with a 'me too'.

I see something similar on Arch Linux when doing 'pacman -Syyuv' and there
are many (as in more than 5-10) updates to apply. While the update is
running (even if that's all the system is doing) system responsiveness is
terrible - just starting 'chromium' which is usually instant (at least
less than 2 sec at worst) can take upwards of 10 seconds and the mouse
cursor in X starts to jump a bit as well and switching virtual desktops
noticably lags when redrawing the new desktop if there's a full screen app
like gimp or OpenOffice open there. This is on a Lenovo Thinkpad R61i
which has a 'Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz' CPU, 2GB of
memory and 499996 kilobytes of swap.
--
Jesper Juhl <***@chaosbits.net> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jesper Juhl

2010-11-04 23:59:24 UTC

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.

Forgot to mention the kernel I currently experience this with :

[***@dragon ~]$ uname -a
Linux dragon 2.6.35-ARCH #1 SMP PREEMPT Sat Oct 30 21:22:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
--
Jesper Juhl <***@chaosbits.net> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

2010-11-05 01:45:16 UTC

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.

I think anyone reporting a interactivity problem also needs to
indicate what their filesystem is, what mount paramters they are
using, what their storage config is, whether barriers are active or
not, what elevator they are using, whether one or more of the
applications are issuing fsync() or sync() calls, and so on.

Basically, what we need to know is whether these problems are
isolated to a particular filesystem or storage type because
they may simply be known problems (e.g. the ext3 fsync-the-world
problem).

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Sanjoy Mahajan

2010-11-05 12:48:30 UTC

Post by Dave Chinner
I think anyone reporting a interactivity problem also needs to
indicate what their filesystem is, what mount paramters they are
using, what their storage config is, whether barriers are active or
not, what elevator they are using, whether one or more of the
applications are issuing fsync() or sync() calls, and so on.

Good idea.

The filesystems are all ext3 with default mount parameters. The dmesgs
say that the filesystems are mounted in ordered data mode and that
barriers are not enabled.

mount says:

/dev/sda2 on / type ext3 (rw,errors=remount-ro,commit=0)
/dev/sda1 on /boot type ext3 (rw,commit=0)
/dev/sda3 on /home type ext3 (rw,commit=0)

Post by Dave Chinner
storage config

Do you mean the partition sizes? Here's that:

$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 72G 52G 17G 77% /
tmpfs 755M 4.0K 755M 1% /lib/init/rw
udev 750M 212K 750M 1% /dev
tmpfs 755M 0 755M 0% /dev/shm
/dev/sda1 274M 117M 143M 45% /boot
/dev/sda3 74G 37G 33G 53% /home

Post by Dave Chinner
elevator

CFQ

Post by Dave Chinner
sync-related calls

I don't have a test from the time I ran rsync (but I'll check that
tonight), but I traced the currently running emacs and iceweasel
(a.k.a. firefox) with "strace -p PID 2>&1 | grep sync". That didn't
turn up any sync-related calls.

(I checked the firefox because I seem to remember that it used to do
fsync absurdly often, but I also seem to remember that the outcry made
them stop.)

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

dave b

2010-11-06 14:11:34 UTC

I now personally have thought that this problem is the kernel not
keeping track of reads vs writers properly or not providing enough
time to reading processes as writing ones which look like they are
blocking the system....

If you want to do a simple test do an unlimited dd (or two dd's of a
limited size, say 10gb) and a find /
Tell me how it goes :) ( the system will stall)
(obviously stop the dd after some time :) ).

http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
iirc can reproduce this on plain ext3.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

2010-11-06 15:14:35 UTC

Could be anything from that description....

Post by dave b
If you want to do a simple test do an unlimited dd (or two dd's of a
limited size, say 10gb) and a find /
Tell me how it goes :)

The find runs at IO latency speed while the dd processes run at disk
bandwidth:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 58.00 1251.00 0.45 556.54 871.45 26.69 20.39 0.72 94.32
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

That looks pretty normal to me for XFS and the noop IO scheduler,
and there are no signs of latency or interactive problems in
the system at all. Kill the dd's and:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 214.80 0.40 1.68 0.00 15.99 0.33 1.54 1.54 33.12
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

And the find runs 3-4x faster, but ~200 iops is about the limit
I'd expect from 7200rpm SATA drives given a single thread issuing IO
(i.e. 5ms average seek time).

Post by dave b
( the system will stall)

No, the system doesn't stall at all. It runs just fine. Sure,
anything that requires IO on the loaded filesystem is _slower_, but
if you're writing huge files to it that's pretty much expected. The
root drive (on a different spindle) is still perfectly responsive on
a cold cache:

$ sudo time find / -xdev > /dev/null
0.10user 1.87system 0:03.39elapsed 58%CPU (0avgtext+0avgdata 7008maxresident)k
0inputs+0outputs (1major+844minor)pagefaults 0swap

So what you describe is not a systemic problem, but a problem that
your system configuration triggers. That's why we need to know
_exactly_ how your storage subsystem is configured....

Post by dave b
http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
iirc can reproduce this on plain ext3.

You're pointing to a "fsync-tester" program that exercises a
well-known problem with ext3 (sync-the-world-on-fsync). Other
filesystems do not have that design flaw so don't suffer from
interactivity problems uner these workloads. As it is, your above
dd workload example is not related to this fsync problem, either.

This is what I'm trying to point out - you need to describe in
significant detail your setup and what your applications are doing
so we can identify if you are seeing a known problem or not. If you
are seeing problems as a result of the above ext3 fsync problem,
then the simple answer is "don't use ext3".

Cheers,

Dave.

dave b

2010-11-07 06:07:04 UTC

Could be anything from that description....

Post by dave b
If you want to do a simple test do an unlimited dd (or two dd's of a
limited size, say 10gb) and a find /
Tell me how it goes :)

The find runs at IO latency speed while the dd processes run at disk
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 58.00 1251.00 0.45 556.54 871.45 26.69 20.39 0.72 94.32
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
That looks pretty normal to me for XFS and the noop IO scheduler,
and there are no signs of latency or interactive problems in
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 214.80 0.40 1.68 0.00 15.99 0.33 1.54 1.54 33.12
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
And the find runs 3-4x faster, but ~200 iops is about the limit
I'd expect from 7200rpm SATA drives given a single thread issuing IO
(i.e. 5ms average seek time).

Post by dave b
( the system will stall)

No, the system doesn't stall at all. It runs just fine. Sure,
anything that requires IO on the loaded filesystem is _slower_, but
if you're writing huge files to it that's pretty much expected. The
root drive (on a different spindle) is still perfectly responsive on
$ sudo time find / -xdev > /dev/null
0.10user 1.87system 0:03.39elapsed 58%CPU (0avgtext+0avgdata 7008maxresident)k
0inputs+0outputs (1major+844minor)pagefaults 0swap
So what you describe is not a systemic problem, but a problem that
your system configuration triggers. That's why we need to know
_exactly_ how your storage subsystem is configured....

Post by dave b
http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
iirc can reproduce this on plain ext3.

You're pointing to a "fsync-tester" program that exercises a
well-known problem with ext3 (sync-the-world-on-fsync). Other
filesystems do not have that design flaw so don't suffer from
interactivity problems uner these workloads. As it is, your above
dd workload example is not related to this fsync problem, either.
This is what I'm trying to point out - you need to describe in
significant detail your setup and what your applications are doing
so we can identify if you are seeing a known problem or not. If you
are seeing problems as a result of the above ext3 fsync problem,
then the simple answer is "don't use ext3".

Thank you for your reply.
Well I am not sure :)
Is the answer "don't use ext3" ?
If it is what should I really be using instead?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jens Axboe

2010-11-07 12:08:39 UTC

Post by dave b
I now personally have thought that this problem is the kernel not
keeping track of reads vs writers properly or not providing enough
time to reading processes as writing ones which look like they are
blocking the system....
If you want to do a simple test do an unlimited dd (or two dd's of a
limited size, say 10gb) and a find /
Tell me how it goes :) ( the system will stall)
(obviously stop the dd after some time :) ).
http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
iirc can reproduce this on plain ext3.

As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2010-11-07 15:51:23 UTC

Post by Jens Axboe
As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

At least for ext3, more important than atimes is the "data=writeback"
setting. Especially since our atime default is sane these days (ie if
you don't specify anything, we end up using 'relatime').

If you compile your own kernel, answer "N" to the question

Default to 'data=ordered' in ext3?

at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
"data=writeback" is in the fstab (but I don't think everything honors
it for the root filesystem).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

2010-11-10 01:35:14 UTC

Post by Jens Axboe
As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

Don't forget to mention data=writeback is not the default because if
your system crashes or you lose power running in this mode it will
*CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
the significant security issues (e.g stale data exposure) that also
occur even if the filesystem is not corrupted by the crash. IOWs,
data=writeback is the "fast but I'll eat your data" option for ext3.

So I recommend that nobody follows this path because it only leads
to worse trouble down the road. Your best bet it to migrate away
from ext3 to a filesystem that doesn't have such inherent ordering
problems like ext4 or XFS....

Cheers,

Dave.

dave b

2010-11-10 02:02:21 UTC

Ok so all of us on ext3 should just up and move to ext4 ^ ^ ? (who
want to avoid these problems)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Evgeniy Ivanov

2010-11-10 08:08:26 UTC

Post by Dave Chinner
Don't forget to mention data=writeback is not the default because if
your system crashes or you lose power running in this mode it will
*CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
the significant security issues (e.g stale data exposure) that also
occur even if the filesystem is not corrupted by the crash. IOWs,
data=writeback is the "fast but I'll eat your data" option for ext3.
So I recommend that nobody follows this path because it only leads
to worse trouble down the road. Your best bet it to migrate away
from ext3 to a filesystem that doesn't have such inherent ordering
problems like ext4 or XFS....

Is it save to use "data=writeback" with ext4? At least are there
security issues?
Why do you say, that fs can be corrupted? Metadata is still
journalled, so only data might be corrupted, but FS should still be
consistent.

--
Evgeniy Ivanov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

2010-11-10 08:26:19 UTC

Post by Evgeniy Ivanov

Is it save to use "data=writeback" with ext4?

I believe the same issues exist with data=writeback in ext4, but you
probably should have an ext4 developer answer that question for
certain.

Post by Evgeniy Ivanov
At least are there security issues?
Why do you say, that fs can be corrupted? Metadata is still
journalled, so only data might be corrupted, but FS should still be
consistent.

Data corruption is still a filesystem corruption.

Cheers,

Dave.

Pavel Machek

2010-11-10 14:22:33 UTC

Hi!

Data corruption is still a filesystem corruption.

As far as I understand, apps should not expect anything unless they
use fsync(). And fsync() still works in ext3...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pavel Machek

2010-11-10 14:20:51 UTC

Hi!

Post by Jens Axboe
As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

You will lose your data, but the filesystem should still be
consistent, right? Metadata are still journaled.

Post by Dave Chinner
the significant security issues (e.g stale data exposure) that also
occur even if the filesystem is not corrupted by the crash. IOWs,

I agree on security issues.
Pavel

Ingo Molnar

2010-11-10 14:28:02 UTC

Hi!

Post by Jens Axboe
As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

You will lose your data, but the filesystem should still be consistent, right?
Metadata are still journaled.

That is data that was freshly touched around the time the system went down, right?

I.e. data that was probably half-modified by user-space to begin with.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Hellwig

2010-11-10 14:55:52 UTC

Post by Ingo Molnar
That is data that was freshly touched around the time the system went down, right?
I.e. data that was probably half-modified by user-space to begin with.

It's data that wasn't synced out yet, yes. Which isn't the problem per
se. With ext3/4 in ordered mode, or xfs, or btrfs the file size won't
be incremented until the data is written. in ext3/4 in writeback mode
(or various non-journaling filesystems) however the inode size is
updated, and metadagta changes are logged. Besides exposing stale
data which is a security risk in multi-user systems it also means the
inode looks modified (by size and timestamps), but contains other data
than actually written.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pavel Machek

2010-11-10 19:09:47 UTC

Hi!

Post by Christoph Hellwig

Post by Ingo Molnar
That is data that was freshly touched around the time the system went down, right?
I.e. data that was probably half-modified by user-space to begin with.

Well, afaict thats traditional unix behaviour... while it is not user
friendly, I'd not call it 'corrupted filesytem'.
Pavel

Theodore Tso

2010-11-10 14:34:14 UTC

This is strictly speaking not true. Using data=writeback will not cause you to lose any data --- at least, not any more than you would without the feature. If you have applications that write files in an unsafe way, that data is going to be lost, one way or another. (i.e., with XFS in a similar situation you'll get a zero-length file) The difference is that in the case of a system crash, there may be unwritten data revealed if you use data=writeback. This could be a security exposure, especially if you are using your system in as time-sharing system, and where you see the contents of deleted files belonging to another user.

So it is not an "eat your data" situation, but rather, a "possibly expose old data". Whether or not you care on a single-user workstation situation, is an individual judgement call. There's been a lot of controversy about this.

The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block. I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits.

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Hellwig

2010-11-10 14:57:26 UTC

Post by Theodore Tso
The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block. I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits.

That's the scheme used by XFS and btrfs in one form or another. Chris
also had a patch to implement it for ext3, which unfortunately fell
under the floor.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Chris Mason

2010-11-10 15:03:42 UTC

Post by Christoph Hellwig

That's the scheme used by XFS and btrfs in one form or another. Chris
also had a patch to implement it for ext3, which unfortunately fell
under the floor.

It probably still applies, but by the time I had it stable I realized
that ext4 was really a better place to fix this stuff. ext3 is what it
is (good and bad), and a big change like my data=guarded code probably
isn't the best way to help.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

2010-11-10 23:38:29 UTC

Post by Theodore Tso

In theory, that's all that is _supposed_ to happen. However, my
recent experience is that massive ext3 filesystem corruption occurs
in data=writeback mode when the system crashes and that does not
happen in ordered mode.

Why do you think i posted the patches to change the default back to
ordered mode a few months back? I basically trashed the root ext3
partitions on three test machines (to the point where >5000 files
across /sbin, /bin, /lib and /usr were corrupted or missing and I
had to reinstall from scratch) when I'd forgotten to set the
ordered-is-defult config option in the kernel i was testing. And
that is when the only thing being written to the root filesystems
was log files...

The worst part about this was that I also had ext3 filesystems
corrupted by crashes in such a way that e2fsck didn't detect it but
they would repeatedly trigger kernel crashes at runtime....

Post by Theodore Tso
So it is not an "eat your data" situation,

My experience says otherwise....

Cheers,

Dave.

Linus Torvalds

2010-11-10 16:05:17 UTC

You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.

So your argument is kind of dishonest. The thing is, if you have a
crash or power outage or whatever, the only data you can really rely
on is always going to be the data that you fsync'ed before the crash.
Everything else is just gravy.

Are there downsides to "data=writeback"? Absolutely. But anybody who
tries to push those downsides without taking the performance and
latency issues into account is just not thinking straight.

Too many people think that "correct" is somehow black-and-white. It's
not. "The correct answer too late" is not worth anything. Sane people
understand that "good enough" is important.

And quite frankly, "data=writeback" is not wonderful, but it's "good
enough". And it helps enormously with at least one class of serious
performance problems. Dismissing it because it doesn't have quite the
guarantees of "data=ordered" is like saying that you should never use
"pi=3.14" for any calculations because it's not as exact as
"pi=3.14159265". The thing is, for many things, three significant
digits (or even _one_ significant digit) is plenty.

ext3 [f]sync sucks. We know. All filesystems suck. They just tend to
do it in different dimensions.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alexey Dobriyan

2010-11-10 16:46:30 UTC

On Wed, Nov 10, 2010 at 5:59 PM, Linus Torvalds

You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.

Linus, are you using with data=writeback?

Those of us, who did (without UPS), will never do it again.

Propability of non-trivial FS corruption becomes so much bigger.
I believe from my experience, average number of crashes before
one loses FS becomes single digit number.

With data=ordered, it's quite hard.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

2010-11-10 17:01:45 UTC

Post by Alexey Dobriyan

Post by Linus Torvalds
You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.

Linus, are you using with data=writeback?

I used to, indeed. But since I upgrade computers fairly regularly, and
all the distros have moved towards ext4, I'm no longer using ext3 at
all.

But yes, to me ext3 was totally unusable with rotational media and
"data=ordered". Not just bad. Total crap. Whenever the mail client
wanted to write something out, the whole machine basically stopped.

Of course, part of that was that long ago I used reiserfs back when
SuSE had it as the default. So I didn't think that the hickups were
"normal" like a lot of people probably do. I knew better. So it was
"bad latency, and I know it's the filesystem that is total crap".

Post by Alexey Dobriyan
Those of us, who did (without UPS), will never do it again.

Before or after the change to make renaming on top of old files do the
IO flushing?

That made a big difference for some rather common cases.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alexey Dobriyan

2010-11-10 17:10:24 UTC

On Wed, Nov 10, 2010 at 6:55 PM, Linus Torvalds

Post by Alexey Dobriyan
Those of us, who did (without UPS), will never do it again.

Before or after the change to make renaming on top of old files do the
IO flushing?

It was long ago, so before patch.

Post by Linus Torvalds
That made a big difference for some rather common cases.

That's good.
Maybe, it's only an order of magnitude likely to lose FS now instead of several.
:-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mark Lord

2010-11-10 18:55:20 UTC

Post by Alexey Dobriyan
On Wed, Nov 10, 2010 at 6:55 PM, Linus Torvalds

Post by Alexey Dobriyan
Those of us, who did (without UPS), will never do it again.

I've used ext2 and ext3 extensively on all of the boxes here,
every since each first became available. I developed Linux IDE,
the first IDE DMA, lots of custom storage drivers, and more recently
worked on libata drivers. This meant a LOT of sudden and catastrophic
system failures, as the bugs and other kinks were worked on.

Never lost a nibble. Totally, utterly reliable stuff for everyday use.
*WITH* the write-caches all enabled on all of the drives, too.

Sure, sudden power-failures could have a better chance of corrupting data,
but those are really rare, and the few that have happened were again non-events
here.

That's the difference between theory and practice.

Cheers
-ml
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mike Galbraith

2010-11-10 18:27:58 UTC

Post by Alexey Dobriyan
On Wed, Nov 10, 2010 at 5:59 PM, Linus Torvalds

You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.

Linus, are you using with data=writeback?
Those of us, who did (without UPS), will never do it again.

I've been using it for a looong time on my desktop box. Yeah, you can
be bitten easier than ordered, and I have been, but it's never been
anything major. The risk for me is worth it, as data=ordered sucked
really bad.

If I didn't need to maintain compatibility with 30+ old kernels for
regression testing, I'd upgrade desktop to ext4, and likely be happy.

Post by Alexey Dobriyan
Propability of non-trivial FS corruption becomes so much bigger.
I believe from my experience, average number of crashes before
one loses FS becomes single digit number.

That's not my experience. I've yet to have to rebuild my ext3 fs since
upgrading box to shiny new opensuse 11.1 (however long ago and how many
many explosions ago that was;)

Post by Alexey Dobriyan
With data=ordered, it's quite hard.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

2010-11-10 23:44:51 UTC

I crash kernels tens of times every day doing filesystem testing.
With data=ordered I have not seen a corrupted root filesystem as a
result of normal testing and crashing as long as I can remember.
With data=writeback, I'll have corrupted root ext3 partitions in
under a day. Hardly what I'd call stable or something you'd want
to deploy in production.

Cheers,

Dave.

Arjan van de Ven

2010-11-06 19:11:20 UTC

On Fri, 5 Nov 2010 08:48:13 -0400

Post by Sanjoy Mahajan

Good idea.
The filesystems are all ext3 with default mount parameters. The
dmesgs say that the filesystems are mounted in ordered data mode and
that barriers are not enabled.

btw few more things to try (from my standard rc.local script):

echo 4096 > /sys/block/sda/queue/nr_requests

for i in `pidof kjournald` ; do ionice -c1 -p $i ; done

echo 75 > /proc/sys/vm/dirty_ratio

(replace sda with whatever your disk is of course)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jesper Juhl

2010-11-07 17:27:51 UTC

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.

Some details below.

[***@dragon ~]$ mount
proc on /proc type proc (rw,relatime)
sys on /sys type sysfs (rw,relatime)
udev on /dev type devtmpfs
(rw,nosuid,relatime,size=10240k,nr_inodes=255749,mode=755)
/dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e on / type ext4 (rw,commit=0)
devpts on /dev/pts type devpts (rw)
shm on /dev/shm type tmpfs (rw,nosuid,nodev)

[***@dragon ~]# hdparm -v /dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e

/dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e:
multcount = 16 (on)
IO_support = 1 (32-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 9729/255/63, sectors = 25220160, start = 119644560

[***@dragon ~]# dmesg | grep -i ext4
EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: commit=0

The elevator in use is CFQ.

The app that's causing the system to behave this way (the 'pacman' package
manager in Arch Linux) makes a few calls (2-4) to fsync() during its run,
but that's all.
--
Jesper Juhl <***@chaosbits.net> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Evgeniy Ivanov

2010-11-09 19:53:50 UTC

I have almost same problem (system is less interactive, but no freeze happens).
Here are tests I use (written by Alexander Nekrasov):
logrotate.sh (hard writer): http://pastebin.com/PPnSvP2f
writetest (small writer): http://pastebin.com/616JvWEK

If you run "writetest 15 realtime" timings will be OK, but if you also
run "logrotate.sh 300 3" you will see that RT processes start trashing
(timings periodically increase from 50ms to 2000-4000ms).
I do tests on 2.6.31, but same happens on 2.6.36. CFQ with default
settings is used. I've played with page-background.c and noticed, that
writeback still works for RT processes (no write through/disk wait). I
even tried to increase dirty_ratio for RT processes. Also I've limited
memory consumed by dd (logrotate.sh), since I had situation when it
consumed too much and kernel started to reclaim pages.

It doesn't want to work on ext3 (compiled and mounted like Linus
suggested in this thread), but works fine on ext4 with
"data=writeback" and on XFS. I'm not sure if it means that problem in
ext3 and in journaling (in case of ext4 without data=writeback).
I'm not sure if "data=writeback" (makes ext4 journaling similar to
XFS) really fixes the problem, probably it increases FS bandwidth, so
we just don't see the problem, but it can still present.

Hmmm, the workload you're describing here has two special parts. First
it dramatically overloads the disk, and then it has guis doing things
waiting for the disk.

Some details below.
proc on /proc type proc (rw,relatime)
sys on /sys type sysfs (rw,relatime)
udev on /dev type devtmpfs
(rw,nosuid,relatime,size=10240k,nr_inodes=255749,mode=755)
/dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e on / type ext4 (rw,commit=0)
devpts on /dev/pts type devpts (rw)
shm on /dev/shm type tmpfs (rw,nosuid,nodev)
multcount = 16 (on)
IO_support = 1 (32-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 9729/255/63, sectors = 25220160, start = 119644560
EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: commit=0
The elevator in use is CFQ.
The app that's causing the system to behave this way (the 'pacman' package
manager in Arch Linux) makes a few calls (2-4) to fsync() during its run,
but that's all.
--
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/

Christoph Hellwig

2010-11-09 20:21:00 UTC

Post by Evgeniy Ivanov
I'm not sure if "data=writeback" (makes ext4 journaling similar to
XFS) really fixes the problem

It doesn't. XFS does not expose stale data after a crash, while ext3/4
data=writeback does.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Chris Mason

2010-11-09 21:23:19 UTC

[ the disks are slow for me too!!!!!!!!!!!!!! ]
I think anyone reporting a interactivity problem also needs to
indicate what their filesystem is, what mount paramters they are
using, what their storage config is, whether barriers are active or
not, what elevator they are using, whether one or more of the
applications are issuing fsync() or sync() calls, and so on.
Basically, what we need to know is whether these problems are
isolated to a particular filesystem or storage type because
they may simply be known problems (e.g. the ext3 fsync-the-world
problem).

latencytop does help quite a lot in nailing down why we're waiting on
the disk, but the interface doesn't lend itself very well to remote
debugging. We end up asking for screen shots that may or may not really
nail down what is going on.

I've got a patch that adds latencytop -c, which you use like this:

latencytop -c >& out

It spits out latency info for all the procs every 10 seconds or so,
along with a short stack trace that often helps figure things out.

The patch is below and works properly with the current latencytop
git. If some of the people hitting bad latencies could try it, it might
help narrow things down.

From: Chris Mason <***@oracle.com>
Subject: [PATCH] Add latencytop -c to dump process information to the console

This adds something similar to vmstat 1 to latencytop, where
it simply does a text dump of all the process latency information
to the console every 10 seconds. Back traces are included in the
dump.

Signed-off-by: Chris Mason <***@oracle.com>
---
src/Makefile | 2 +-
src/latencytop.c | 38 +++++++---
src/latencytop.h | 1 +
src/text_dump.c | 199 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 227 insertions(+), 13 deletions(-)
create mode 100644 src/text_dump.c

diff --git a/src/Makefile b/src/Makefile
index de24551..1ff9740 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -6,7 +6,7 @@ SBINDIR = /usr/sbin
XCFLAGS = -W -g `pkg-config --cflags glib-2.0` -D_FORTIFY_SOURCE=2 -Wno-sign-compare
LDF = -Wl,--as-needed `pkg-config --libs glib-2.0` -lncursesw

-OBJS= latencytop.o text_display.o translate.o fsync.o
+OBJS= latencytop.o text_display.o text_dump.o translate.o fsync.o

ifdef HAS_GTK_GUI
XCFLAGS += `pkg-config --cflags gtk+-2.0` -DHAS_GTK_GUI
diff --git a/src/latencytop.c b/src/latencytop.c
index f516f53..fe252d0 100644
--- a/src/latencytop.c
+++ b/src/latencytop.c
@@ -111,6 +111,10 @@ static void fixup_reason(struct latency_line *line, char *c)
*(c2++) = 0;
} else
strncpy(line->reason, c2, 1024);
+
+ c2 = strchr(line->reason, '\n');
+ if (c2)
+ *c2=0;
}

void parse_global_list(void)
@@ -538,19 +542,13 @@ static void cleanup_sysctl(void)
int main(int argc, char **argv)
{
int i, use_gtk = 0;
+ int console_dump = 0;

enable_sysctl();
enable_fsync_tracer();
atexit(cleanup_sysctl);

-#ifdef HAS_GTK_GUI
- if (preinitialize_gtk_ui(&argc, &argv))
- use_gtk = 1;
-#endif
- if (!use_gtk)
- preinitialize_text_ui(&argc, &argv);
-
- for (i = 1; i < argc; i++)
+ for (i = 1; i < argc; i++) {
if (strcmp(argv[i],"-d") == 0) {
init_translations("latencytop.trans");
parse_global_list();
@@ -558,6 +556,17 @@ int main(int argc, char **argv)
dump_global_to_console();
return EXIT_SUCCESS;
}
+ if (strcmp(argv[i],"-c") == 0)
+ console_dump = 1;
+ }
+
+#ifdef HAS_GTK_GUI
+ if (!console_dump && preinitialize_gtk_ui(&argc, &argv))
+ use_gtk = 1;
+#endif
+ if (!console_dump && !use_gtk)
+ preinitialize_text_ui(&argc, &argv);
+
for (i = 1; i < argc; i++)
if (strcmp(argv[i], "--unknown") == 0) {
noui = 1;
@@ -579,12 +588,17 @@ int main(int argc, char **argv)
sleep(5);
fprintf(stderr, ".");
}
+
+ if (console_dump) {
+ start_text_dump();
+ } else {
#ifdef HAS_GTK_GUI
- if (use_gtk)
- start_gtk_ui();
- else
+ if (use_gtk)
+ start_gtk_ui();
+ else
#endif
- start_text_ui();
+ start_text_ui();
+ }

prune_unused_procs();
delete_list();
diff --git a/src/latencytop.h b/src/latencytop.h
index 79775ac..f3e0934 100644
--- a/src/latencytop.h
+++ b/src/latencytop.h
@@ -50,6 +50,7 @@ extern void start_gtk_ui(void);

extern void preinitialize_text_ui(int *argc, char ***argv);
extern void start_text_ui(void);
+extern void start_text_dump(void);

extern char *translate(char *line);
extern void init_translations(char *filename);
diff --git a/src/text_dump.c b/src/text_dump.c
new file mode 100644
index 0000000..76fc7b1
--- /dev/null
+++ b/src/text_dump.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright 2008, Intel Corporation
+ *
+ * This file is part of LatencyTOP
+ *
+ * This program file is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program in a file named COPYING; if not, write to the
+ * Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301 USA
+ *
+ * Authors:
+ * Arjan van de Ven <***@linux.intel.com>
+ * Chris Mason <***@oracle.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/time.h>
+#include <dirent.h>
+#include <time.h>
+#include <wchar.h>
+#include <ctype.h>
+
+#include <glib.h>
+
+#include "latencytop.h"
+
+static GList *cursor_e = NULL;
+static int done = 0;
+
+static void print_global_list(void)
+{
+ GList *item;
+ struct latency_line *line;
+ int i = 1;
+
+ printf("Globals: Cause Maximum Percentage\n");
+ item = g_list_first(lines);
+ while (item && i < 10) {
+ line = item->data;
+ item = g_list_next(item);
+
+ if (line->max*0.001 < 0.1)
+ continue;
+ printf("%s", line->reason);
+ printf("\t%5.1f msec %5.1f %%\n",
+ line->max * 0.001,
+ (line->time * 100 +0.0001) / total_time);
+ i++;
+ }
+}
+
+static void print_one_backtrace(char *trace)
+{
+ char *p;
+ int pos;
+ int after;
+ int tabs = 0;
+
+ if (!trace || !trace[0])
+ return;
+ pos = 16;
+ while(*trace && *trace == ' ')
+ trace++;
+
+ if (!trace[0])
+ return;
+
+ while(*trace) {
+ p = strchr(trace, ' ');
+ if (p) {
+ pos += p - trace + 1;
+ *p = '\0';
+ }
+ if (!tabs) {
+ /* we haven't printed anything yet */
+ printf("\t\t");
+ tabs = 1;
+ } else if (pos > 79) {
+ /*
+ * we have printed something our line is going to be
+ * long
+ */
+ printf("\n\t\t");
+ pos = 16 + p - trace + 1;
+ }
+ printf("%s ", trace);
+ if (!p)
+ break;
+
+ trace = p + 1;
+ if (trace && pos > 70) {
+ printf("\n");
+ tabs = 0;
+ pos = 16;
+ }
+ }
+ printf("\n");
+}
+
+static void print_procs()
+{
+ struct process *proc;
+ GList *item;
+ double total;
+
+ printf("Process details:\n");
+ item = g_list_first(procs);
+ while (item) {
+ int printit = 0;
+ GList *item2;
+ struct latency_line *line;
+ proc = item->data;
+ item = g_list_next(item);
+
+ total = 0.0;
+
+ item2 = g_list_first(proc->latencies);
+ while (item2) {
+ line = item2->data;
+ item2 = g_list_next(item2);
+ total = total + line->time;
+ }
+ item2 = g_list_first(proc->latencies);
+ while (item2) {
+ char *p;
+ char *backtrace;
+ line = item2->data;
+ item2 = g_list_next(item2);
+ if (line->max*0.001 < 0.1)
+ continue;
+ if (!printit) {
+ printf("Process %s (%i) ", proc->name, proc->pid);
+ printf("Total: %5.1f msec\n", total*0.001);
+ printit = 1;
+ }
+ printf("\t%s", line->reason);
+ printf("\t%5.1f msec %5.1f %%\n",
+ line->max * 0.001,
+ (line->time * 100 +0.0001) / total
+ );
+ print_one_backtrace(line->backtrace);
+ }
+
+ }
+}
+
+static int done_yet(int time, struct timeval *p1)
+{
+ int seconds;
+ int usecs;
+ struct timeval p2;
+ gettimeofday(&p2, NULL);
+ seconds = p2.tv_sec - p1->tv_sec;
+ usecs = p2.tv_usec - p1->tv_usec;
+
+ usecs += seconds * 1000000;
+ if (usecs > time * 1000000)
+ return 1;
+ return 0;
+}
+
+void signal_func(int foobie)
+{
+ done = 1;
+}
+
+void start_text_dump(void)
+{
+ struct timeval now;
+ struct tm *tm;
+ signal(SIGINT, signal_func);
+ signal(SIGTERM, signal_func);
+
+ while (!done) {
+ gettimeofday(&now, NULL);
+ printf("=============== %s", asctime(localtime(&now.tv_sec)));
+ update_list();
+ print_global_list();
+ print_procs();
+ if (done)
+ break;
+ sleep(10);
+ }
+}
+

--
1.6.5.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Wu Fengguang

2010-10-31 01:22:41 UTC

Hi Aidar,

QUOTE:***
And yes, we'd very much like to fix such slowdowns via heuristics as
well (detecting large sequential IO and not letting it poison the
existing cache), so good bugreports and reproducing testcases sent to
experimental kernel patches would definitely be welcome.
Thanks,
Ingo
*** http://ask.slashdot.org/story/10/10/23/1828251/The-State-of-Linux-IO-Scheduling-For-the-Desktop#commentlisting
I'll be rather quick & to the point here.
I get & run stable kernels the same day they appear on kernel.org in
hope to get away from these annoying, ignored, neglected slowdowns.
.config attached - I have Lenovo ThinkPad T400, Core2Duo T9400, 4Gb
DDR2, w/integrated GM45 - xf86-video-intel, iwlagn for the intel 5300
wifi, CFS, ext2 for
swap partition - 4Gb, ext3 for boot, ext4 - 400Gb for everything else.

If possible I'd suggest to turn off the swap and check if it helps.
Some people reports(*) desktop responsiveness problems that can be
poor-man-fixed by disabling swap.

(*) https://bugzilla.kernel.org/show_bug.cgi?id=12309

All the hardware I have runs linux natively.
No kernel helped me from the days of 2.6.28.x upto 2.6.36. The dubbed
slowdown fixes never worked for me.

There are multiple causes of slowdown. 2.6.36 includes some easy fix.
The swap problem is (maybe partly) root caused(**), however will need a
rather complex and intrusive patch to fix.

(**) http://www.spinics.net/lists/linux-fsdevel/msg35397.html

Thanks,
Fengguang

The kernel config choices are rather typical : NO_HZ, I don't go crazy for
1000Hz and use 100 or 250Hz and voluntary preemption.
Love choices, hence nothing but Gentoo + KDE4. Multilib. Some relevant
==============================================================================================
emerge --info
Portage 2.1.8.3 (default/linux/amd64/10.0/desktop, gcc-4.5.1,
glibc-2.11.2-r0, 2.6.36 x86_64)
=================================================================
Timestamp of tree: Tue, 26 Oct 2010 10:30:01 +0000
app-shells/bash: 4.1_p7
dev-java/java-config: 2.1.11
dev-lang/python: 2.5.4-r4, 2.6.5-r3, 3.1.2-r4
dev-util/cmake: 2.8.1-r2
sys-apps/baselayout: 1.12.13
sys-apps/sandbox: 2.3-r1
sys-devel/autoconf: 2.13, 2.65-r1
sys-devel/automake: 1.7.9-r1, 1.8.5-r4, 1.9.6-r3, 1.10.3, 1.11.1
sys-devel/binutils: 2.20.1-r1
sys-devel/gcc: 4.5.1
sys-devel/gcc-config: 1.4.1
sys-devel/libtool: 2.2.10
sys-devel/make: 3.81-r2
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -pipe -march=native"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/X11/xkb /usr/share/config /var/lib/hsqldb"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d
/etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf
/etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/
/etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d
/etc/terminfo"
CXXFLAGS="-O2 -pipe -march=native"
==============================================================================================
Now, I know, Ingo said he wants : "good bugreports and reproducing
testcases" and my testcase is very real life and rather replicates my
- VirtualBox running XP only to look at some 2007 ppts ( the Ooo3
doens't cut it )
- JuK ( or VLC ) KDE's music player - some music in the background
- Chromium browser, with bunch of tabs with J2EE/J2SE javadocs, eats
out some significant swap space
- bash terminals
- ktorrent
- PDFs opened in okular, Adobe reader
- sync'ing portage tree & emerging new ebuilds ( usually with gentoo )
- Netbeans, Eclipse, apache, vsftd, sshd, tomcat and the whole 9 yards.
How do I notice slowdowns ? The JuK lags so badly that it can't play
any music, the mouse pointer freezes, kwin effects freeze for few
seconds.
How can I make it much worse ? I can try & run disk clean up under XP,
that is running in VBox, with folder compression. On top of it if I
start copying big files in linux ( 700MB avis, etc ), GUI effects
freeze, mouse pointer freezes for few seconds.
And this is on 2.6.36 that is supposed to cure these "features". From
this perspective, 2.6.36 is no better than any previous stable kernel
I've tried. Probably as bad with regards to IO issues.
Find attached screenshot ( latencytop_n_powertop.png ) which depicts
artifacts where the window manager froze at the time I was trying to
see a tab in Konsole where the powertop was running.
.dd if=/dev/zero of=test.10g bs=1M count=10000;rm test.10g
.cp /home/ak/1.distr/Linux/openSUSE-11.2-DVD-x86_64.iso
/home/lameruser/;rm /home/lameruser/openSUSE-11.2-DVD-x86_64.iso;
.dd if=/dev/zero of=test.10g bs=1M count=10000;rm test.10g
.cp /home/ak/funeral.avi /home/ak/0.junk/;rm /home/ak/0.junk/funeral.avi
.the XP under VBox was compacting its old files.
the iso is about 4Gb, the avi is about 700Mb
https://bugzilla.kernel.org/show_bug.cgi?id=12309
This is a monumental failure for kernel development project and FLOSS
in general.
Poor management, no leadership/championship, no responsibility, neglect

Wu Fengguang

2010-10-31 01:51:45 UTC

It may also help to lower the dirty ratio.

echo 5 > /proc/sys/vm/dirty_ratio

Memory pressure + heavy write can easily hurt responsiveness.

- eats up to 20% (the default value for dirty_ratio) memory with dirty
pages and hence increase the memory pressure and number of swap IO

- the file copy makes the device write congested and hence makes
pageout() easily blocked in get_request_wait()

As a result every application may be slowed down by the heavy swap IO
when page fault as well as being blocked when allocating memory (which
may go into direct reclaim and then call pageout()).

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dimitrios Apostolou

2010-11-01 20:25:14 UTC

Hello,

Post by Wu Fengguang
It may also help to lower the dirty ratio.
echo 5 > /proc/sys/vm/dirty_ratio
Memory pressure + heavy write can easily hurt responsiveness.
- eats up to 20% (the default value for dirty_ratio) memory with dirty
pages and hence increase the memory pressure and number of swap IO

My experience has been different with that. Wouldn't it make more sense
to _increase_ dirty_ratio (to 50 lets say) and at the same time decrease
dirty_background_ratio? That way writing to disk starts early, but the
related apps stall waiting for I/O only when dirty_ratio is reached.

Thanks,
Dimitris

Post by Wu Fengguang
- the file copy makes the device write congested and hence makes
pageout() easily blocked in get_request_wait()
As a result every application may be slowed down by the heavy swap IO
when page fault as well as being blocked when allocating memory (which
may go into direct reclaim and then call pageout()).
Thanks,
Fengguang

Wu Fengguang

2010-11-02 01:20:17 UTC