Discussion:
Heavy write IO for no apparent reason
Peter Wood
2013-01-17 03:25:33 UTC
Permalink
I'm running OpenIndiana 151a5 on two identical Supermicro boxes (hardware
details at the bottom of the message). Pretty clean installation. No extra
services running. Main purpose is to be storage servers and nothing else.
No dedup.

Today I started migrating file systems from some old Open Solaris servers
to these Supermicro boxes and noticed the transfer to one of them was going
10x slower then to the other one (like 10GB/hour).
I was using zfs send/receive over ssh.

This server is doing pretty much nothing. Exporting 3 file systems via NFS
and only one of them is somewhat in use.

Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write
operations on the drives and 661 on the ZIL. Compare to the other server
(who is in way heavier use then this one) these numbers are extremely high.

Any idea how to debug any further?

Supermicro SC847
MB X9DRH-iF
Xeon E5-2620 2GHz 6-Core
LSI SAS9211-8i HBA
32GB RAM
Storage pool drives: 3TB Hitachi SAS 7.2K
ZIL mirror and L2ARC: 80GB Intel SSD SATA 3Gb/s

Thank you,

-- Peter
Bob Friesenhahn
2013-01-17 03:49:04 UTC
Permalink
Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write operations on the drives and 661 on the
ZIL. Compare to the other server (who is in way heavier use then this one) these numbers are extremely high.
Any idea how to debug any further?
Do some filesystems contain many snapshots? Do some filesystems use
small zfs block sizes. Have the servers been used the same?

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Peter Wood
2013-01-17 23:33:06 UTC
Permalink
I have a script that rotates hourly, daily and monthly snapshots. Each
filesystem has about 40 snapshots (zfsList.png - output of 'zfs list | grep
-v home/' - the home directories datasets are snipped from the output. 4
users in total.)

I noticed that the hourly snapshots on the heaviest filesystem in use are
about 1.2GB in size where on the other system the regular NFS exported
filesystem has about 60MB snapshots (gallerySnapshots.png - output of
command 'zfs list -t snapshot -r pool01/utils/gallery')

I know that the gallery FS is in heavier use then normal but I was told it
will be mostly reading and based on the iostat seems that there is heavy
writing too.

I guess I'll schedule some downtime and disable gallery export and see if
that will effect the number of write operations and performance in general.

Unless there is some other way to test what/where these write operations
are applied.

The 'zpool iostat -v' output is uncomfortably static. The values of
read/write operations and bandwidth are the same for hours and even days.
I'd expect at least some variations between morning and night. The load on
the servers is different for sure. Any input?

Thanks,

-- Peter


On Wed, Jan 16, 2013 at 7:49 PM, Bob Friesenhahn <
Post by Peter Wood
Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write
operations on the drives and 661 on the
ZIL. Compare to the other server (who is in way heavier use then this
one) these numbers are extremely high.
Any idea how to debug any further?
Do some filesystems contain many snapshots? Do some filesystems use small
zfs block sizes. Have the servers been used the same?
Bob
--
Bob Friesenhahn
users/bfriesen/ <http://www.simplesystems.org/users/bfriesen/>
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2013-01-17 23:42:39 UTC
Permalink
Unless there is some other way to test what/where these write operations are applied.
You can install Brendan Gregg's DTraceToolkit and use it to find out
who and what is doing all the writing. 1.2GB in an hour is quite a
lot of writing. If this is going continuously, then it may be causing
more fragmentation in conjunction with your snapshots.

See "http://www.brendangregg.com/dtrace.html".

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Peter Wood
2013-01-17 23:51:36 UTC
Permalink
Thanks Bob. Will do.

I forgot the attachments. Here they are with hope to bring more clarity to
the problem.


On Thu, Jan 17, 2013 at 3:42 PM, Bob Friesenhahn <
Unless there is some other way to test what/where these write operations are applied.
You can install Brendan Gregg's DTraceToolkit and use it to find out who
and what is doing all the writing. 1.2GB in an hour is quite a lot of
writing. If this is going continuously, then it may be causing more
fragmentation in conjunction with your snapshots.
See "http://www.brendangregg.com/**dtrace.html<http://www.brendangregg.com/dtrace.html>
".
Bob
--
Bob Friesenhahn
users/bfriesen/ <http://www.simplesystems.org/users/bfriesen/>
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2013-01-18 00:34:31 UTC
Permalink
You can install Brendan Gregg's DTraceToolkit and use it to find out who
and what is doing all the writing. 1.2GB in an hour is quite a lot of
writing. If this is going continuously, then it may be causing more
fragmentation in conjunction with your snapshots.
As a moderately wild guess, since you're speaking of galleries,
are these problematic filesystems often-read? By default ZFS
updates the last access-time of files it reads, as do many other
filesystems, and this causes avalanches of metadata updates -
sync writes (likely) as well as fragmentation. This may also
be a poorly traceable but considerable "used" space in frequent
snapshots. You can verify (and unset) this behaviour with the
ZFS FS dataset property "atime", i.e.:

# zfs get atime pond/export/home
NAME PROPERTY VALUE SOURCE
pond/export/home atime off inherited from pond

On another hand, verify where your software keeps the temporary
files (i.e. during uploads as may be with galleries). Again, if
this is a frequently snapshotted dataset (though 1 hour is not
really that frequent) then needless temp files can be held by
those older snapshots. Moving such temporary works to a different
dataset with a different snapshot schedule and/or to a different
pool (to keep related fragmentation constrained) may prove useful.

HTH,
//Jim Klimov
Peter Wood
2013-01-18 01:05:21 UTC
Permalink
Great points Jim. I have requested more information how the gallery share
is being used and any temporary data will be moved out of there.

About atime, it is set to "on" right now and I've considered to turn it off
but I wasn't sure if this will effect incremental zfs send/receive.

'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right?
Post by Jim Klimov
You can install Brendan Gregg's DTraceToolkit and use it to find out who
and what is doing all the writing. 1.2GB in an hour is quite a lot of
writing. If this is going continuously, then it may be causing more
fragmentation in conjunction with your snapshots.
As a moderately wild guess, since you're speaking of galleries,
are these problematic filesystems often-read? By default ZFS
updates the last access-time of files it reads, as do many other
filesystems, and this causes avalanches of metadata updates -
sync writes (likely) as well as fragmentation. This may also
be a poorly traceable but considerable "used" space in frequent
snapshots. You can verify (and unset) this behaviour with the
# zfs get atime pond/export/home
NAME PROPERTY VALUE SOURCE
pond/export/home atime off inherited from pond
On another hand, verify where your software keeps the temporary
files (i.e. during uploads as may be with galleries). Again, if
this is a frequently snapshotted dataset (though 1 hour is not
really that frequent) then needless temp files can be held by
those older snapshots. Moving such temporary works to a different
dataset with a different snapshot schedule and/or to a different
pool (to keep related fragmentation constrained) may prove useful.
HTH,
//Jim Klimov
______________________________**_________________
zfs-discuss mailing list
http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss>
Bob Friesenhahn
2013-01-18 02:24:28 UTC
Permalink
Great points Jim. I have requested more information how the gallery share is being used and any temporary data will
be moved out of there.
About atime, it is set to "on" right now and I've considered to turn it off but I wasn't sure if this will effect
incremental zfs send/receive.
'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right?
Zfs send does not care about atime. The access time is useless other
than as a way to see how long it has been since a file was accessed.

For local access (not true for NFS), Zfs is lazy about updating atime
on disk and so it may not be updated on disk until the next
transaction group is written (e.g. up to 5 seconds) and so it does not
represent much actual load. Without this behavior, the system could
become unusable.

For NFS you should disable atime on the NFS client mounts.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2013-01-18 02:39:10 UTC
Permalink
Post by Bob Friesenhahn
For NFS you should disable atime on the NFS client mounts.
This advice was wrong. It needs to be done on the server side.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Timothy Coalson
2013-01-18 00:18:25 UTC
Permalink
Post by Peter Wood
The 'zpool iostat -v' output is uncomfortably static. The values of
read/write operations and bandwidth are the same for hours and even days.
I'd expect at least some variations between morning and night. The load on
the servers is different for sure. Any input?
Without a repetition time parameter, zpool iostat will print exactly once
and exit, and the output is an average from kernel boot to "now", just like
iostat, this is why it seems so static. If you want to know the activity
over 5 second intervals, use something like "zpool iostat -v 5" (repeat
every 5 seconds) and wait for the second and later blocks. The second and
later blocks are average from previous output until "now". I generally use
5 second intervals to match the 5 second commit interval on my pools.

Tim
Peter Wood
2013-01-18 00:53:10 UTC
Permalink
Right on Tim. Thanks. I didn't know that. I'm sure it's documented
somewhere and I should have read it so double thanks for explaining it.
Post by Timothy Coalson
Post by Peter Wood
The 'zpool iostat -v' output is uncomfortably static. The values of
read/write operations and bandwidth are the same for hours and even days.
I'd expect at least some variations between morning and night. The load on
the servers is different for sure. Any input?
Without a repetition time parameter, zpool iostat will print exactly once
and exit, and the output is an average from kernel boot to "now", just like
iostat, this is why it seems so static. If you want to know the activity
over 5 second intervals, use something like "zpool iostat -v 5" (repeat
every 5 seconds) and wait for the second and later blocks. The second and
later blocks are average from previous output until "now". I generally use
5 second intervals to match the 5 second commit interval on my pools.
Tim
Peter Blajev
2013-01-18 00:48:34 UTC
Permalink
Right on Tim. Thanks. I didn't know that. I'm sure it's documented
somewhere and I should have read it so double thanks for explaining it.


--
Peter Blajev
IT Manager, TAAZ Inc.
Office: 858-597-0512 x125
Post by Timothy Coalson
Post by Peter Wood
The 'zpool iostat -v' output is uncomfortably static. The values of
read/write operations and bandwidth are the same for hours and even days.
I'd expect at least some variations between morning and night. The load on
the servers is different for sure. Any input?
Without a repetition time parameter, zpool iostat will print exactly once
and exit, and the output is an average from kernel boot to "now", just like
iostat, this is why it seems so static. If you want to know the activity
over 5 second intervals, use something like "zpool iostat -v 5" (repeat
every 5 seconds) and wait for the second and later blocks. The second and
later blocks are average from previous output until "now". I generally use
5 second intervals to match the 5 second commit interval on my pools.
Tim
Freddie Cash
2013-01-18 22:55:54 UTC
Permalink
Post by Peter Wood
Right on Tim. Thanks. I didn't know that. I'm sure it's documented
somewhere and I should have read it so double thanks for explaining it.
When in doubt, always check the man page first:
man zpool

It's listed in the section on the "iostat" sub-command:
zpool iostat [-T d|u] [-v] [pool] ... [interval [count]]

Displays I/O statistics for the given pools. When given an
interval,
the statistics are printed every interval seconds until Ctrl-C is
pressed. If no pools are specified, statistics for every pool in
the
system is shown. If count is specified, the command exits after
count
reports are printed.

:D
--
Freddie Cash
***@gmail.com
Timothy Coalson
2013-01-18 23:05:23 UTC
Permalink
Post by Freddie Cash
Post by Peter Wood
Right on Tim. Thanks. I didn't know that. I'm sure it's documented
somewhere and I should have read it so double thanks for explaining it.
man zpool
zpool iostat [-T d|u] [-v] [pool] ... [interval [count]]
Displays I/O statistics for the given pools. When given an
interval,
the statistics are printed every interval seconds until Ctrl-C is
pressed. If no pools are specified, statistics for every pool in
the
system is shown. If count is specified, the command exits after
count
reports are printed.
To my eye, that doesn't actually explain what the output is, only how to
get it to repeat. It seems to assume that one is familiar with iostat, and
expects this to work the same way. So, can't really fault someone for
being confused by the output in this case (perhaps the manpage could use
some clarification).

Tim

Ray Arachelian
2013-01-17 16:55:18 UTC
Permalink
Post by Peter Wood
Today I started migrating file systems from some old Open Solaris
servers to these Supermicro boxes and noticed the transfer to one of
them was going 10x slower then to the other one (like 10GB/hour).
What does "dladm show-link" show? I'm guessing one of your links is at
100mbps or at half duplex.
Loading...