Discussion:
Abysmal ISCSI / ZFS Performance
Brian E. Imhoff
2010-02-10 22:06:05 UTC
Permalink
I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance.

Where to begin...

The Hardware setup:
Supermicro 4U 24 Drive Bay Chassis
Supermicro X8DT3 Server Motherboard
2x Xeon E5520 Nehalem 2.26 Quad Core CPUs
4GB Memory
Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic only)
Adaptec 52445 28 Port SATA/SAS Raid Controller connected to
24x Western Digital WD1002FBYS 1TB Enterprise drives.

I have configured the 24 drives as single simple volumes in the Adeptec RAID BIOS , and are presenting them to the OS as such.

I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare:
zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00

Then create a volume store:
zfs create -o canmount=off tank/volumes

Then create a 10 TB volume to be presented to our file server:
zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data
From here, I discover the iscsi target on our Windows server 2008 R2 File server, and see the disk is attached in Disk Management. I initialize the 10TB disk fine, and begin to quick format it. Here is where I begin to see the poor performance issue. The Quick Format took about 45 minutes. And once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.
I have no clue what I could be doing wrong. To my knowledge, I followed the documentation for setting this up correctly, though I have not looked at any tuning guides beyond the first line saying you shouldn't need to do any of this as the people who picked these defaults know more about it then you.

Jumbo Frames are enabled on both sides of the iscsi path, as well as on the switch, and rx/tx buffers increased to 2048 on both sides as well. I know this is not a hardware / iscsi network issue. As another test, I installed Openfiler in a similar configuration (using hardware raid) on this box, and was getting 350-450 MB/S from our fileserver,

An "iostat -xndz 1" readout of the "%b% coloum during a file copy to the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of 100, and repeats.

Is there anything I need to do to get this usable? Or any additional information I can provide to help solve this problem? As nice as Openfiler is, it doesn't have ZFS, which is necessary to achieve our final goal.
--
This message posted from opensolaris.org
Will Murnane
2010-02-10 22:28:15 UTC
Permalink
Post by Brian E. Imhoff
I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance.
zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
Create several smaller raidz2 vdevs, and consider adding a log device
and/or cache devices. A single raidz2 vdev has about as many IOs per
second as a single disk, which could really hurt iSCSI performance.
zpool create tank raidz2 c1t0d0 c1t1d0 ... \
raidz2 c1t5d0 c1t6d0 ... \
etc
You might try, say, four 5-wide stripes with a spare, a mirrored log
device, and a cache device. More memory wouldn't hurt anything,
either.

Will
David Dyer-Bennet
2010-02-10 22:34:25 UTC
Permalink
Post by Will Murnane
Post by Brian E. Imhoff
I am in the proof-of-concept phase of building a large ZFS/Solaris based
SAN box, and am experiencing absolutely poor / unusable performance.
zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
Create several smaller raidz2 vdevs, and consider adding a log device
and/or cache devices. A single raidz2 vdev has about as many IOs per
second as a single disk, which could really hurt iSCSI performance.
zpool create tank raidz2 c1t0d0 c1t1d0 ... \
raidz2 c1t5d0 c1t6d0 ... \
etc
You might try, say, four 5-wide stripes with a spare, a mirrored log
device, and a cache device. More memory wouldn't hurt anything,
either.
That's useful general advice for increasing I/O I think, but he clearly
has something other than a "general" problem. Did you read the numbers he
gave on his iSCSI performance? That can't be explained just by
overly-large RAIDZ groups I don't think.
--
David Dyer-Bennet, dd-***@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
Frank Cusack
2010-02-10 22:29:25 UTC
Permalink
Post by Brian E. Imhoff
I then, Create a zpool, using raidz2, using all 24 drives, 1 as a
hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare
c1t23d00
Well there's one problem anyway. That's going to be horribly slow no
matter what.
Bob Friesenhahn
2010-02-10 22:53:26 UTC
Permalink
Post by Frank Cusack
Post by Brian E. Imhoff
I then, Create a zpool, using raidz2, using all 24 drives, 1 as a
hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare
c1t23d00
Well there's one problem anyway. That's going to be horribly slow no
matter what.
The other three commonly mentioned issues are:

- Disable the naggle algorithm on the windows clients.

- Set the volume block size so that it matches the client filesystem
block size (default is 128K!).

- Check for an abnormally slow disk drive using 'iostat -xe'.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Kjetil Torgrim Homme
2010-02-10 23:05:37 UTC
Permalink
Post by Bob Friesenhahn
- Disable the naggle algorithm on the windows clients.
for iSCSI? shouldn't be necessary.
Post by Bob Friesenhahn
- Set the volume block size so that it matches the client filesystem
block size (default is 128K!).
default for a zvol is 8 KiB.
Post by Bob Friesenhahn
- Check for an abnormally slow disk drive using 'iostat -xe'.
his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk. tweaking the flush interval down
might help.
Post by Bob Friesenhahn
Post by Brian E. Imhoff
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to
the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
seconds of 100, and repeats.
what are the other values? ie., number of ops and actual amount of data
read/written.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
Marc Nicholas
2010-02-10 23:12:35 UTC
Permalink
How does lowering the flush interval help? If he can't ingress data
fast enough, faster flushing is a Bad Thibg(tm).

-marc
Post by Kjetil Torgrim Homme
Post by Bob Friesenhahn
- Disable the naggle algorithm on the windows clients.
for iSCSI? shouldn't be necessary.
Post by Bob Friesenhahn
- Set the volume block size so that it matches the client filesystem
block size (default is 128K!).
default for a zvol is 8 KiB.
Post by Bob Friesenhahn
- Check for an abnormally slow disk drive using 'iostat -xe'.
his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk. tweaking the flush interval down
might help.
Post by Bob Friesenhahn
Post by Brian E. Imhoff
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to
the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
seconds of 100, and repeats.
what are the other values? ie., number of ops and actual amount of data
read/written.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Sent from my mobile device
Brent Jones
2010-02-11 00:05:14 UTC
Permalink
Post by Marc Nicholas
How does lowering the flush interval help? If he can't ingress data
fast enough, faster flushing is a Bad Thibg(tm).
-marc
 - Disable the naggle algorithm on the windows clients.
for iSCSI?  shouldn't be necessary.
 - Set the volume block size so that it matches the client filesystem
   block size (default is 128K!).
default for a zvol is 8 KiB.
 - Check for an abnormally slow disk drive using 'iostat -xe'.
his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk.  tweaking the flush interval down
might help.
Post by Brian E. Imhoff
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to
the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
seconds of 100, and repeats.
what are the other values?  ie., number of ops and actual amount of data
read/written.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Sent from my mobile device
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
ZIL performance issues? Is writecache enabled on the LUNs?
--
Brent Jones
***@servuhome.net
Marc Nicholas
2010-02-11 00:12:55 UTC
Permalink
This is a Windows box, not a DB that flushes every write.

The drives are capable of over 2000 IOPS (albeit with high latency as
its NCQ that gets you there) which would mean, even with sync flushes,
8-9MB/sec.

-marc
Post by Brent Jones
Post by Marc Nicholas
How does lowering the flush interval help? If he can't ingress data
fast enough, faster flushing is a Bad Thibg(tm).
-marc
 - Disable the naggle algorithm on the windows clients.
for iSCSI?  shouldn't be necessary.
 - Set the volume block size so that it matches the client filesystem
   block size (default is 128K!).
default for a zvol is 8 KiB.
 - Check for an abnormally slow disk drive using 'iostat -xe'.
his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk.  tweaking the flush interval down
might help.
Post by Brian E. Imhoff
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to
the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
seconds of 100, and repeats.
what are the other values?  ie., number of ops and actual amount of data
read/written.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Sent from my mobile device
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
ZIL performance issues? Is writecache enabled on the LUNs?
--
Brent Jones
--
Sent from my mobile device
Kjetil Torgrim Homme
2010-02-11 07:15:07 UTC
Permalink
[please don't top-post, please remove CC's, please trim quotes. it's
really tedious to clean up your post to make it readable.]
Post by Marc Nicholas
Post by Brent Jones
Post by Marc Nicholas
Post by Kjetil Torgrim Homme
his problem is "lazy" ZFS, notice how it gathers up data for 15
seconds before flushing the data to disk.  tweaking the flush
interval down might help.
How does lowering the flush interval help? If he can't ingress data
fast enough, faster flushing is a Bad Thibg(tm).
if network traffic is blocked during the flush, you can experience
back-off on both the TCP and iSCSI level.
Post by Marc Nicholas
Post by Brent Jones
Post by Marc Nicholas
Post by Kjetil Torgrim Homme
what are the other values?  ie., number of ops and actual amount of
data read/written.
this remained unanswered.
Post by Marc Nicholas
Post by Brent Jones
ZIL performance issues? Is writecache enabled on the LUNs?
This is a Windows box, not a DB that flushes every write.
have you checked if the iSCSI traffic is synchronous or not? I don't
use Windows, but other reports on the list have indicated that at least
the NTFS format operation *is* synchronous. use zilstats to see.
Post by Marc Nicholas
The drives are capable of over 2000 IOPS (albeit with high latency as
its NCQ that gets you there) which would mean, even with sync flushes,
8-9MB/sec.
2000 IOPS is the aggregate, but the disks are set up as *one* RAID-Z2!
NCQ doesn't help much, since the write operations issued by ZFS are
already ordered correctly.

the OP may also want to try tweaking metaslab_df_free_pct, this helped
linear write performance on our Linux clients a lot:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869229
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
Brent Jones
2010-02-11 00:46:44 UTC
Permalink
Post by Brent Jones
Post by Marc Nicholas
How does lowering the flush interval help? If he can't ingress data
fast enough, faster flushing is a Bad Thibg(tm).
-marc
 - Disable the naggle algorithm on the windows clients.
for iSCSI?  shouldn't be necessary.
 - Set the volume block size so that it matches the client filesystem
   block size (default is 128K!).
default for a zvol is 8 KiB.
 - Check for an abnormally slow disk drive using 'iostat -xe'.
his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk.  tweaking the flush interval down
might help.
Post by Brian E. Imhoff
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to
the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
seconds of 100, and repeats.
what are the other values?  ie., number of ops and actual amount of data
read/written.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Sent from my mobile device
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
ZIL performance issues? Is writecache enabled on the LUNs?
--
Brent Jones
Also, are you using rdsk based iSCSI LUNs, or file-based LUNs?
--
Brent Jones
***@servuhome.net
Tim Cook
2010-02-10 22:35:52 UTC
Permalink
Post by Brian E. Imhoff
I am in the proof-of-concept phase of building a large ZFS/Solaris based
SAN box, and am experiencing absolutely poor / unusable performance.
Where to begin...
Supermicro 4U 24 Drive Bay Chassis
Supermicro X8DT3 Server Motherboard
2x Xeon E5520 Nehalem 2.26 Quad Core CPUs
4GB Memory
Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic only)
Adaptec 52445 28 Port SATA/SAS Raid Controller connected to
24x Western Digital WD1002FBYS 1TB Enterprise drives.
I have configured the 24 drives as single simple volumes in the Adeptec
RAID BIOS , and are presenting them to the OS as such.
zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
zfs create -o canmount=off tank/volumes
zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data
From here, I discover the iscsi target on our Windows server 2008 R2 File
server, and see the disk is attached in Disk Management. I initialize the
10TB disk fine, and begin to quick format it. Here is where I begin to see
the poor performance issue. The Quick Format took about 45 minutes. And
once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.
I have no clue what I could be doing wrong. To my knowledge, I followed
the documentation for setting this up correctly, though I have not looked at
any tuning guides beyond the first line saying you shouldn't need to do any
of this as the people who picked these defaults know more about it then you.
Jumbo Frames are enabled on both sides of the iscsi path, as well as on the
switch, and rx/tx buffers increased to 2048 on both sides as well. I know
this is not a hardware / iscsi network issue. As another test, I installed
Openfiler in a similar configuration (using hardware raid) on this box, and
was getting 350-450 MB/S from our fileserver,
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to the
LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of
100, and repeats.
Is there anything I need to do to get this usable? Or any additional
information I can provide to help solve this problem? As nice as Openfiler
is, it doesn't have ZFS, which is necessary to achieve our final goal.
You're extremely light on ram for a system with 24TB of storage and two
E5520's. I don't think it's the entire source of your issue, but I'd
strongly suggest considering doubling what you have as a starting point.

What version of opensolaris are you using? Have you considered using
COMSTAR as your iSCSI target?

--Tim
Marc Nicholas
2010-02-10 22:58:44 UTC
Permalink
Definitely use Comstar as Tim says.

At home I'm using 4*WD Caviar Blacks on an AMD Phenom x4 @ 1.Ghz and
only 2GB of RAM. I'm running svn132. No HBA - onboard SB700 SATA
ports.$

I can, with IOmeter, saturate GigE from my WinXP laptop via iSCSI.

Can you toss the RAID controller aside an use motherboard SATA ports
with just a few drives? That could help highlight if its the RAID
controler or not, and even one drive has better throughput than you're
seeing.

Cache, ZIL, and vdev tweaks are great - but you're not seeing any of
those bottlnecks, I can assure you.

-marc
On Wed, Feb 10, 2010 at 4:06 PM, Brian E. Imhoff
Post by Brian E. Imhoff
I am in the proof-of-concept phase of building a large ZFS/Solaris based
SAN box, and am experiencing absolutely poor / unusable performance.
Where to begin...
Supermicro 4U 24 Drive Bay Chassis
Supermicro X8DT3 Server Motherboard
2x Xeon E5520 Nehalem 2.26 Quad Core CPUs
4GB Memory
Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic only)
Adaptec 52445 28 Port SATA/SAS Raid Controller connected to
24x Western Digital WD1002FBYS 1TB Enterprise drives.
I have configured the 24 drives as single simple volumes in the Adeptec
RAID BIOS , and are presenting them to the OS as such.
zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
zfs create -o canmount=off tank/volumes
zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data
From here, I discover the iscsi target on our Windows server 2008 R2 File
server, and see the disk is attached in Disk Management. I initialize the
10TB disk fine, and begin to quick format it. Here is where I begin to see
the poor performance issue. The Quick Format took about 45 minutes. And
once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.
I have no clue what I could be doing wrong. To my knowledge, I followed
the documentation for setting this up correctly, though I have not looked at
any tuning guides beyond the first line saying you shouldn't need to do any
of this as the people who picked these defaults know more about it then you.
Jumbo Frames are enabled on both sides of the iscsi path, as well as on the
switch, and rx/tx buffers increased to 2048 on both sides as well. I know
this is not a hardware / iscsi network issue. As another test, I installed
Openfiler in a similar configuration (using hardware raid) on this box, and
was getting 350-450 MB/S from our fileserver,
An "iostat -xndz 1" readout of the "%b% coloum during a file copy to the
LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of
100, and repeats.
Is there anything I need to do to get this usable? Or any additional
information I can provide to help solve this problem? As nice as Openfiler
is, it doesn't have ZFS, which is necessary to achieve our final goal.
You're extremely light on ram for a system with 24TB of storage and two
E5520's. I don't think it's the entire source of your issue, but I'd
strongly suggest considering doubling what you have as a starting point.
What version of opensolaris are you using? Have you considered using
COMSTAR as your iSCSI target?
--Tim
--
Sent from my mobile device
Peter Tribble
2010-02-15 21:53:51 UTC
Permalink
Post by Brian E. Imhoff
I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance.
...
Post by Brian E. Imhoff
From here, I discover the iscsi target on our Windows server 2008 R2 File server, and see the disk is attached in Disk Management.  I initialize the 10TB disk fine, and begin to quick format it.  Here is where I begin to see the poor performance issue.   The Quick Format took about 45 minutes. And once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.
Did you actually make any progress on this?

I've seen exactly the same thing. Basically, terrible transfer rates
with Windows
and the server sitting there completely idle. We had support cases open with
both Sun and Microsoft, which got nowhere.

This seems to me to be more a case of working out where the impedance
mismatch is rather than a straightforward performance issue. In my case
I could saturate the network from a Solaris client, but only maybe 2% from
a Windows box. Yes, tweaking nagle got us to almost 3%. Still nowhere
near enough to make replacing our FC SAN with X4540s an attractive
proposition.

(And I see that most of the other replies simply asserted that your zfs
configuration was bad, without either having experienced this scenario
or worked out that the actual delivered performance was an order of
magnitude or two short of what even an admittedly sub-optimal configuration
ought to have delivered.)
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Bob Beverage
2010-02-15 22:33:21 UTC
Permalink
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
I've seen exactly the same thing. Basically, terrible
transfer rates
with Windows
and the server sitting there completely idle.
I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s.
--
This message posted from opensolaris.org
Ragnar Sundblad
2010-02-16 07:34:15 UTC
Permalink
Post by Bob Beverage
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
I've seen exactly the same thing. Basically, terrible
transfer rates
with Windows
and the server sitting there completely idle.
I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s.
Wasn't zvol changed a while ago from asynchronous to
synchronous? Could that be it?

I don't understand that change at all - of course a zvol with or
without iscsi to access it should behave exactly as a (not broken)
disk, strictly obeying the protocol for write cache. cache flush etc.
Having it entirely synchronous is in many cases almost as useless
as having it asynchronous.

Just as much as zfs itself should demands this from it's disks, as it
does, I believe it should provide this itself when used as storage
for others. To me it seems that the zvol+iscsi functionality seems not
ready for production and needs more work. If anyone has any better
explanation, please share it with me!

I guess a good slog could help a bit, especially if you have a bursty
write load.

/ragge
Richard Elling
2010-02-16 15:41:46 UTC
Permalink
Post by Ragnar Sundblad
Post by Bob Beverage
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
I've seen exactly the same thing. Basically, terrible
transfer rates
with Windows
and the server sitting there completely idle.
I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s.
Wasn't zvol changed a while ago from asynchronous to
synchronous? Could that be it?
Yes.
Post by Ragnar Sundblad
I don't understand that change at all - of course a zvol with or
without iscsi to access it should behave exactly as a (not broken)
disk, strictly obeying the protocol for write cache. cache flush etc.
Having it entirely synchronous is in many cases almost as useless
as having it asynchronous.
There are two changes at work here, and OpenSolaris 2009.06 is
in the middle of them -- and therefore is at the least optimal spot.
You have the choice of moving to a later build, after b113, which
has the proper fix.
Post by Ragnar Sundblad
Just as much as zfs itself should demands this from it's disks, as it
does, I believe it should provide this itself when used as storage
for others. To me it seems that the zvol+iscsi functionality seems not
ready for production and needs more work. If anyone has any better
explanation, please share it with me!
The fix is in Solaris 10 10/09 and the OpenStorage software. For some
reason, this fix is not available in the OpenSolaris supported bug fixes.
Perhaps someone from Oracle can shed light on that (non)decision?
So until next month, you will need to use an OpenSolaris dev release
after b113.
Post by Ragnar Sundblad
I guess a good slog could help a bit, especially if you have a bursty
write load.
Yes.
-- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
Brian E. Imhoff
2010-02-16 17:44:08 UTC
Permalink
Some more back story. I initially started with Solaris 10 u8, and was getting 40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from the performance I was getting with OpenFiler. I decided to try Opensolaris 2009.06, thinking that since it was more "state of the art & up to date" then main Solaris. Perhaps there would be some performance tweaks or bug fixes which might bring performance closer to what I saw with OpenFiler. But, then on an untouched clean install of OpenSolaris 2009.06, ran into something...else...apparently causing this far far far worse performance.

But, at the end of the day, this is quite a bomb: "A single raidz2 vdev has about as many IOs per second as a single disk, which could really hurt iSCSI performance."

If I have to break 24 disks up in to multiple vdevs to get the expected performance might be a deal breaker. To keep raidz2 redundancy, I would have to lose..almost half of the available storage to get reasonable IO speeds.

Now knowing about vdev IO limitations, I believe the speeds I saw with Solaris 10u8 are inline with those limitations, and instead of fighting with whatever issue I have with this clean install of OpenSolaris, I reverted back to 10u8. I guess I'll just have to see if the speeds that Solaris ISCSI w/ZFS is capable of, is workable for what I want to do, and what the size sacrifice/performace acceptability point is at.

Thanks for all the responses and help. First time posting here, and this looks like an excellent community.
--
This message posted from opensolaris.org
Richard Elling
2010-02-17 01:35:40 UTC
Permalink
Post by Brian E. Imhoff
Some more back story. I initially started with Solaris 10 u8, and was getting 40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from the performance I was getting with OpenFiler. I decided to try Opensolaris 2009.06, thinking that since it was more "state of the art & up to date" then main Solaris. Perhaps there would be some performance tweaks or bug fixes which might bring performance closer to what I saw with OpenFiler. But, then on an untouched clean install of OpenSolaris 2009.06, ran into something...else...apparently causing this far far far worse performance.
You thought a release dated 2009.06 was further along than than a release dated
2009.10? :-) CR 6794730 was fixed in April, 2009, after the freeze for the 2009.06
release, but before the freeze for 2009.10.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6794730

The schedule is published here, so you can see that there is a freeze now for
the 2010.03 OpenSolaris release.
http://hub.opensolaris.org/bin/view/Community+Group+on/schedule

As they say in comedy, timing is everything :-(
Post by Brian E. Imhoff
But, at the end of the day, this is quite a bomb: "A single raidz2 vdev has about as many IOs per second as a single disk, which could really hurt iSCSI performance."
The context for this statement is for small, random reads. 40 MB/sec of 8KB
reads is 5,000 IOPS, or about 50 HDDs worth of small random reads @ 100 IOPS/disk,
or one decent SSD.
Post by Brian E. Imhoff
If I have to break 24 disks up in to multiple vdevs to get the expected performance might be a deal breaker. To keep raidz2 redundancy, I would have to lose..almost half of the available storage to get reasonable IO speeds.
Are your requirements for bandwidth or IOPS?
Post by Brian E. Imhoff
Now knowing about vdev IO limitations, I believe the speeds I saw with Solaris 10u8 are inline with those limitations, and instead of fighting with whatever issue I have with this clean install of OpenSolaris, I reverted back to 10u8. I guess I'll just have to see if the speeds that Solaris ISCSI w/ZFS is capable of, is workable for what I want to do, and what the size sacrifice/performace acceptability point is at.
In Solaris 10 you are stuck with the legacy iSCSI target code. In OpenSolaris, you
have the option of using COMSTAR which performs and scales better, as Roch
describes here:
http://blogs.sun.com/roch/entry/iscsi_unleashed
Post by Brian E. Imhoff
Thanks for all the responses and help. First time posting here, and this looks like an excellent community.
We try hard, and welcome the challenges :-)
-- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
Eric D. Mudama
2010-02-17 05:05:17 UTC
Permalink
But, at the end of the day, this is quite a bomb: "A single raidz2
vdev has about as many IOs per second as a single disk, which could
really hurt iSCSI performance."
If I have to break 24 disks up in to multiple vdevs to get the
expected performance might be a deal breaker. To keep raidz2
redundancy, I would have to lose..almost half of the available
storage to get reasonable IO speeds.
ZFS is quite flexible. You can put multiple vdevs in a pool, and dial
your performance/redundancy just about wherever you want them.

24 disks could be:

12x mirrored vdevs (best random IO, 50% capacity, any 1 failure absorbed, up to 12 w/ limits)
6x 4-disk raidz vdevs (75% capacity, any 1 failure absorbed, up to 6 with limits)
4x 6-disk raidz vdevs (~83% capacity, any 1 failure absorbed, up to 4 with limits)
4x 6-disk raidz2 vdevs (~66% capacity, any 2 failures absorbed, up to 8 with limits)
1x 24-disk raidz2 vdev (~92% capacity, any 2 failures absorbed, worst random IO perf)
etc.

I think the 4x 6-disk raidz2 vdev setup is quite commonly used with 24
disks available, but each application is different. We use mirrors
vdevs at work, with a separate box as a "live" backup using raidz of
larger SATA drives.

--eric
--
Eric D. Mudama
***@mail.bounceswoosh.org
Matt
2010-02-18 06:42:50 UTC
Permalink
This post might be inappropriate. Click to display it.
Brent Jones
2010-02-18 06:50:49 UTC
Permalink
Post by Matt
I've got a very similar rig to the OP showing up next week (plus an infiniband card) I'd love to get this performing up to GB Ethernet speeds, otherwise I may have to abandon the iSCSI project if I can't get it to perform.
Do you have an SSD log device? If not, try disabling the ZIL
temporarily to see if that helps. Your workload will likely benefit
from a log device.
--
Brent Jones
***@servuhome.net
Matt
2010-02-18 07:03:59 UTC
Permalink
No SSD Log device yet. I also tried disabling the ZIL, with no effect on performance.

Also - what's the best way to test local performance? I'm _somewhat_ dumb as far as opensolaris goes, so if you could provide me with an exact command line for testing my current setup (exactly as it appears above) I'd love to report the local I/O readings.
--
This message posted from opensolaris.org
Brent Jones
2010-02-18 08:16:17 UTC
Permalink
No SSD Log device yet.  I also tried disabling the ZIL, with no effect on performance.
Also - what's the best way to test local performance?  I'm _somewhat_ dumb as far as opensolaris goes, so if you could provide me with an exact command line for testing my current setup (exactly as it appears above) I'd love to report the local I/O readings.
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
No one has said if they're using dks, rdsk, or file-backed COMSTAR LUNs yet.
I'm using file-backed COMSTAR LUNs, with ZIL currently disabled.
I can get between 100-200MB/sec, depending on random/sequential and block sizes.

Using dsk/rdsk, I was not able to see that level of performance at all.
--
Brent Jones
***@servuhome.net
Markus Kovero
2010-02-18 08:37:35 UTC
Permalink
Post by Brent Jones
No one has said if they're using dks, rdsk, or file-backed COMSTAR LUNs yet.
I'm using file-backed COMSTAR LUNs, with ZIL currently disabled.
I can get between 100-200MB/sec, depending on random/sequential and block sizes.
Using dsk/rdsk, I was not able to see that level of performance at all.
--
Brent Jones
Hi, I find comstar performance very low if using zvols under dsk, somehow using them under rdsk and letting comstar to handle cache makes performance really good (disks/nics become limiting factor).

Yours
Markus Kovero
Nigel Smith
2010-02-18 10:07:24 UTC
Permalink
Hi Matt
Are the seeing low speeds on writes only or on both read AND write?

Are you seeing low speed just with iSCSI or also with NFS or CIFS?
Post by Matt
I've tried updating to COMSTAR
(although I'm not certain that I'm actually using it)
To check, do this:

# svcs -a | grep iscsi

If 'svc:/system/iscsitgt:default' is online,
you are using the old & mature 'user mode' iscsi target.

If 'svc:/network/iscsi/target:default' is online,
then you are using the new 'kernel mode' comstar iscsi target.

For another good way to monitor disk i/o, try:

# iostat -xndz 1

http://docs.sun.com/app/docs/doc/819-2240/iostat-1m?a=view

Don't just assume that your Ethernet & IP & TCP layer
are performing to the optimum - check it.

I often use 'iperf' or 'netperf' to do this:

http://blogs.sun.com/observatory/entry/netperf

(Iperf is available by installing the SUNWiperf package.
A package for netperf is in the contrib repository.)

The last time I checked, the default values used
in the OpenSolaris TCP stack are not optimum
for Gigabit speed, and need to be adjusted.
Here is some advice, I found with Google, but
there are others:

http://serverfault.com/questions/13190/what-are-good-speeds-for-iscsi-and-nfs-over-1gb-ethernet

BTW, what sort of network card are you using,
as this can make a difference.

Regards
Nigel Smith
--
This message posted from opensolaris.org
Matt
2010-02-18 15:49:39 UTC
Permalink
Post by Nigel Smith
Hi Matt
Are the seeing low speeds on writes only or on both
read AND write?
Lows speeds both reading and writing.
Post by Nigel Smith
Are you seeing low speed just with iSCSI or also with
NFS or CIFS?
Haven't gotten NFS or CIFS to work properly. Maybe I'm just too dumb to figure it out, but I'm ending up with permissions errors that don't let me do much. All testing so far has been with iSCSI.
Post by Nigel Smith
# svcs -a | grep iscsi
If 'svc:/system/iscsitgt:default' is online,
you are using the old & mature 'user mode' iscsi
target.
If 'svc:/network/iscsi/target:default' is online,
then you are using the new 'kernel mode' comstar
iscsi target.
It shows that I'm using the COMSTAR target.
Post by Nigel Smith
# iostat -xndz 1
Here's IOStat while doing writes :

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1.0 256.9 3.0 2242.9 0.3 0.1 1.3 0.5 11 12 c0t0d0
0.0 253.9 0.0 2242.9 0.3 0.1 1.0 0.4 10 11 c0t1d0
1.0 253.9 2.5 2234.4 0.2 0.1 0.9 0.4 9 11 c1t0d0
1.0 258.9 2.5 2228.9 0.3 0.1 1.3 0.5 12 13 c1t1d0

This shows about a 10-12% utilization of my gigabit network, as reported by Task Manager in Windows 7.


Here's IOStat when doing reads :

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
554.1 0.0 11256.8 0.0 3.8 0.7 6.8 1.3 68 70 c0t0d0
749.1 0.0 11003.7 0.0 2.8 0.5 3.8 0.7 51 54 c0t1d0
742.1 0.0 11333.4 0.0 2.9 0.5 3.9 0.7 51 49 c1t0d0
736.1 0.0 11045.9 0.0 2.8 0.5 3.8 0.7 53 53 c1t1d0


Which gives me about 30% utilization.

Another copy to the SAN yielded this result :

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
15.1 314.2 883.9 4106.2 0.9 0.3 2.9 0.9 28 30 c0t0d0
15.1 321.2 854.3 4106.2 0.9 0.3 2.7 0.8 26 26 c0t1d0
28.1 315.2 916.5 4101.2 0.8 0.2 2.2 0.7 22 25 c1t0d0
14.1 316.2 895.4 4097.2 0.9 0.3 2.7 0.8 26 27 c1t1d0


Which looks like writes held up at nearly 30% (doing multiple streams of data). Still not gigabit, but getting better. It also seems to be very hit-or-miss. It'll sustain 10-12% gigabit for a few minutes, have a little dip, jump up to 15% for a while, then back to 10%, then up to 20%, then up to 30%, then back down. I can't really make heads or tails of it.
Post by Nigel Smith
Don't just assume that your Ethernet & IP & TCP
layer
are performing to the optimum - check it.
http://blogs.sun.com/observatory/entry/netperf
(Iperf is available by installing the SUNWiperf
package.
A package for netperf is in the contrib repository.)
I'll look in to this, I don't have either installed right now.
Post by Nigel Smith
The last time I checked, the default values used
in the OpenSolaris TCP stack are not optimum
for Gigabit speed, and need to be adjusted.
Here is some advice, I found with Google, but
ttp://serverfault.com/questions/13190/what-are-good-sp
eeds-for-iscsi-and-nfs-over-1gb-ethernet
BTW, what sort of network card are you using,
as this can make a difference.
Current NIC is an integrated NIC on an Abit Fatality motherboard. Just your generic fare gigabit network card. I can't imagine that it would be holding me back that much though.
--
This message posted from opensolaris.org
Marc Nicholas
2010-02-18 16:01:01 UTC
Permalink
Post by Matt
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1.0 256.9 3.0 2242.9 0.3 0.1 1.3 0.5 11 12 c0t0d0
0.0 253.9 0.0 2242.9 0.3 0.1 1.0 0.4 10 11 c0t1d0
1.0 253.9 2.5 2234.4 0.2 0.1 0.9 0.4 9 11 c1t0d0
1.0 258.9 2.5 2228.9 0.3 0.1 1.3 0.5 12 13 c1t1d0
This shows about a 10-12% utilization of my gigabit network, as reported by
Task Manager in Windows 7.
Unless you are using SSDs (which I believe you're not), you're IOPS-bound on
the drives IMHO. Writes are a better test of this than reads for cache
reasons.

-marc
Nigel Smith
2010-02-18 17:04:29 UTC
Permalink
Hi Matt
Post by Matt
Haven't gotten NFS or CIFS to work properly.
Maybe I'm just too dumb to figure it out,
but I'm ending up with permissions errors that don't let me do much.
All testing so far has been with iSCSI.
So until you can test NFS or CIFS, we don't know if it's a
general performance problem, or just an iSCSI problem.

To get CIFS working, try this:

http://blogs.sun.com/observatory/entry/accessing_opensolaris_shares_from_windows
Your getting >1000 Kr/s & kw/s, so add the iostat 'M' option
to display throughput in MegaBytes per second.
Post by Matt
It'll sustain 10-12% gigabit for a few minutes, have a little dip,
I'd still be interested to see the size of the TCP buffers.
What does this report:

# ndd /dev/tcp tcp_xmit_hiwat
# ndd /dev/tcp tcp_recv_hiwat
# ndd /dev/tcp tcp_conn_req_max_q
# ndd /dev/tcp tcp_conn_req_max_q0
Post by Matt
Current NIC is an integrated NIC on an Abit Fatality motherboard.
Just your generic fare gigabit network card.
I can't imagine that it would be holding me back that much though.
Well there are sometimes bugs in the device drivers:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913756
http://sigtar.com/2009/02/12/opensolaris-rtl81118168b-issues/

That's why I say don't just assume the network is performing to the optimum.

To do a local test, direct to the hard drives, you could try 'dd',
with various transfer sizes. Some advice from BenR, here:

http://www.cuddletech.com/blog/pivot/entry.php?id=820

Regards
Nigel Smith
--
This message posted from opensolaris.org
Nigel Smith
2010-02-18 17:19:43 UTC
Permalink
Another things you could check, which has been reported to
cause a problem, is if network or disk drivers share an interrupt
with a slow device, like say a usb device. So try:

# echo ::interrupts -d | mdb -k

... and look for multiple driver names on an INT#.
Regards
Nigel Smith
--
This message posted from opensolaris.org
Matt
2010-02-18 07:21:07 UTC
Permalink
Just out of curiosity - what Supermicro chassis did you get? I've got the following items shipping to me right now, with SSD drives and 2TB main drives coming as soon as the system boots and performs normally (using 8 extra 500GB Barracuda ES.2 drives as test drives).


http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043
http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4
http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187
http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002
--
This message posted from opensolaris.org
Günther
2010-02-18 11:09:23 UTC
Permalink
hello<br>
there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core) 3
<br>
new:<br>
-bonnie benchmarks included <a href="Loading Image..." target="_blank">see screenshot</a><br>
-bug fixes<br>
<br>
if you look at the benchmark screenshot:<br>
-pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and compress enabled<br>
-pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS) edition,
dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB)
<br>
i was surprised about the seqential write/ rewrite result.
the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite
the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200 MB/s) !!!!
<br><br>
downlaod:<br>
http://www.napp-it.org<br>
<br>
howto setup<br>
http://www.napp-it.org/napp-it.pdf<br>
<br>

gea
--
This message posted from opensolaris.org
Tomas Ögren
2010-02-18 11:16:41 UTC
Permalink
Post by Günther
hello<br>
there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core) 3
<br>
new:<br>
-bonnie benchmarks included <a href="http://www.napp-it.org/bench.png" target="_blank">see screenshot</a><br>
-bug fixes<br>
<br>
if you look at the benchmark screenshot:<br>
-pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and compress enabled<br>
-pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS) edition,
dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB)
<br>
i was surprised about the seqential write/ rewrite result.
the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite
the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200 MB/s) !!!!
Most probably due to lack of ram to hold the dedup tables, which your
second version "fixes" with an l2arc.

Try the same test without dedup or same l2arc in both, instead of
comparing apples to canoes.
Post by Günther
<br><br>
downlaod:<br>
http://www.napp-it.org<br>
<br>
howto setup<br>
http://www.napp-it.org/napp-it.pdf<br>
<br>
gea
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
/Tomas
--
Tomas Ögren, ***@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
Günther
2010-02-18 12:22:04 UTC
Permalink
hello

my intention was to show , how you can tune up a pool of drives
(how much can you reach when using sas compared to 2 TB high capacity drives)


and now the other results with same config and sas drives:
<pre>

wd 2TB x 7, z3, dedup and compress on, no ssd
daten 12.6T start 2010.02.17 8G 202 MB/s 83 10 MB/s 4 4.436 MB/s 5 135 MB/s 87 761 MB/s

sas 15k, 146GB x 4, z3+dedup and compress off, no ssd:
z3nocache 544G start 2010.02.18 8G 71 MB/s 31 84 MB/s 15 47 MB/s 13 87 MB/s 55 113 MB/s

sas 15k, 146GB x 4, z3+dedup and compress on, no ssd:
z3nocache 544G start 2010.02.18 8G 218 MB/s 99 410 MB/s 92 171 MB/s 50 148 MB/s 92 578 MB/s

sas 15k, 146GB x 4, z3+dedup and compress on + ssd read cache:
z3cache 544G start 2010.02.17 8G 172 MB/s 77 205 MB/s 40 95 MB/s 27 141 MB/s 90 546 MB/s


##################### result ##################################
all pools are zfs z3
sas are Seagate 15K/m drives, 146 GB


seq-write-ch seq-write-block rewrite read-char read-block

wd 2gb x7 202 MB/s 10 MB/s 4,4 MB/s 135 MB/s 761 MB/s

sas 15k x 4 no dedup: 71 MB/s 84 MB/s 47 MB/s 87 MB/s 113 MB/s
sas 15k x 4 +dedup+comp: 218 MB/s 410 MB/s 171 MB/s 148 MB/s 578 MB/s
sas 15k x 4 +ded+ssd: 172 MB/s 205 MB/s 95 MB/s 141 MB/s 546 MB/s


conclusion:
if you need performance:
use fast sas drives
activate dedup and compress (if you have enough cpu power)
ssd read cache is not important in bonnie test

high capacity drives are very well in reading and seq. writing

</pre>
--
This message posted from opensolaris.org
Bob Friesenhahn
2010-02-18 16:05:44 UTC
Permalink
Post by Günther
i was surprised about the seqential write/ rewrite result.
the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite
the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200 MB/s) !!!!
Usually very poor re-write performance is in indication of
insufficient RAM for caching combined with imperfect alignment
between the written block size and the underlying zfs block size.
There is no doubt that an enterprise SAS drive will smoke a
high-capacity SATA "green" drive when it comes to update performance.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Eugen Leitl
2010-02-18 12:22:29 UTC
Permalink
Post by Matt
Just out of curiosity - what Supermicro chassis did you get? I've got the following items shipping to me right now, with SSD drives and 2TB main drives coming as soon as the system boots and performs normally (using 8 extra 500GB Barracuda ES.2 drives as test drives).
That looks like a sane combination. Please report how this particular
setup performs, I'm quite curious.
Post by Matt
http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043
http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4
Just this one SAS adaptor? Are you connecting to the drive
backplane with one cable for the 4 internal SAS connectors?
Are you using SAS or SATA drives? Will you be filling up 24
slots with 2 TByte drives, and are you sure you won't be
oversubscribed with just 4x SAS? And SSD, which drives are you
using and in which mounts (internal or external caddies)?
Post by Matt
http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187
http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002
--
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Matt
2010-02-18 16:00:10 UTC
Permalink
Post by Eugen Leitl
Just this one SAS adaptor? Are you connecting to the
drive
backplane with one cable for the 4 internal SAS
connectors?
Are you using SAS or SATA drives? Will you be filling
up 24
slots with 2 TByte drives, and are you sure you won't
be
oversubscribed with just 4x SAS? And SSD, which
drives are you
using and in which mounts (internal or external
caddies)?
I'm just going to use the single 4x SAS. 1200MB/sec should be a great plenty for 24 drives total. I'm going to be mounting 2x SSD for ZIL and 2x SSD for ARC, then 20-2TB drives. I'm guessing that with a random I/O workload, I'll never hit the 1200MB/sec peak that the 4x SAS can sustain.

Also - for the ZIL I will be using 2x 32GB Intel X25-E SLC drives, and for the ARC I'll be using 2x 160GB Intel X25M MLC drives. I'm hoping that the cache will allow me to saturate gigabit and eventually infiniband.
--
This message posted from opensolaris.org
Phil Harman
2010-02-18 12:55:24 UTC
Permalink
This discussion is very timely, but I don't think we're done yet. I've
been working on using NexentaStor with Sun's DVI stack. The demo I've
been playing with glues SunRays to VirtualBox instances using ZFS zvols
over iSCSI for the boot image, with all the associated ZFS
snapshot/clone goodness we all love so well.

The supported config for the ZFS storage server is Solaris 10u7 or 10u8.
When I eventually got VDI going with NexentaStor (my value add), I found
that some operations which only took 10 minutes with Solaris 10u8 were
taking over an hour with NexentaStor. Using pfiles I found that
iscsitgtd has the zvol open O_SYNC.

My hope is that COMSTAR is a lot more intelligent, and that it does
indeed support DKIOCFLUSHWRITECACHE. However, if your iSCSI client
expects all writes to be flushed synchronously, all the debate we've
seen on this list about the new wcd=false option for rdsk zvols is moot
(as using the option, when it is available, could result in data loss).

When you do iSCSI to other big brand storage appliances, you generally
have the benefit of NVRAM cacheing. As we all know, the same can be
achieved with ZFS and an SSD "Logzilla". I didn't have one at hand, and
I didn't think of disabling the ZIL (although some have reported that
this only seems to help ZFS hosted files, not zvols). Instead, since I
didn't mind losing my data, for the same of experiment, I added a TMPFS
"Logzilla" ...

# mkfile 4g /tmp/zilla
# zpool add vdipool log /tmp/zilla

WARNING: DON'T TRY THIS ON ZPOOLS YOU CARE ABOUT! However, for this
purposes of my experiment, it worked a treat, proving to me that an SSD
"Logzilla" was the way ahead.

I think a lot of the angst in this thread is because "it used to work"
(i.e. we used to get great iSCSI performance from zvols). But then Sun
fixed a glaring bug (i.e. that zvols were unsafe for synchronous writes)
and our world fell apart.

Whilst the latest bug fixes put the world to rights again with respect
to correctness, it may be that some of our performance workaround are
still unsafe (i.e. if my iSCSI client assumes all writes are
synchronised to nonvolatile storage, I'd better be pretty sure of the
failure modes before I work around that).

Right now, it seems like an SSD "Logzilla" is needed if you want
correctness and performance.

Phil Harman
Harman Holistix - focusing on the detail and the big picture
Our holistic services include: performance health checks, system tuning,
DTrace training, coding advice, developer assassinations

http://blogs.sun.com/pgdh (mothballed)
http://harmanholistix.com/mt (current)
http://linkedin.com/in/philharman
Ragnar Sundblad
2010-02-19 21:57:14 UTC
Permalink
On 18 feb 2010, at 13.55, Phil Harman wrote:

...
Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I'd better be pretty sure of the failure modes before I work around that).
But are there any clients that assume that an iSCSI volume is synchronous?

Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?

/ragge
Ross Walker
2010-02-19 22:20:03 UTC
Permalink
Post by Ragnar Sundblad
...
Post by Phil Harman
Whilst the latest bug fixes put the world to rights again with
respect to correctness, it may be that some of our performance
workaround are still unsafe (i.e. if my iSCSI client assumes all
writes are synchronised to nonvolatile storage, I'd better be
pretty sure of the failure modes before I work around that).
But are there any clients that assume that an iSCSI volume is
synchronous?
Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?
That was my argument a while back.

If you use /dev/dsk then all writes should be asynchronous and WCE
should be on and the initiator should issue a 'sync' to make sure it's
in NV storage, if you use /dev/rdsk all writes should be synchronous
and WCE should be off. RCD should be off in all cases and the ARC
should cache all it can.

Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the
initiator flags write cache is the wrong way to go about it. It's more
complicated then it needs to be and it leaves setting the storage
policy up to the system admin rather then the storage admin.

It would be better to put effort into supporting FUA and DPO options
in the target then dynamically changing a volume's cache policy from
the initiator side.

-Ross
Ragnar Sundblad
2010-02-20 01:41:20 UTC
Permalink
Post by Ross Walker
Post by Ragnar Sundblad
...
Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I'd better be pretty sure of the failure modes before I work around that).
But are there any clients that assume that an iSCSI volume is synchronous?
Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?
That was my argument a while back.
If you use /dev/dsk then all writes should be asynchronous and WCE should be on and the initiator should issue a 'sync' to make sure it's in NV storage, if you use /dev/rdsk all writes should be synchronous and WCE should be off. RCD should be off in all cases and the ARC should cache all it can.
Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the initiator flags write cache is the wrong way to go about it. It's more complicated then it needs to be and it leaves setting the storage policy up to the system admin rather then the storage admin.
It would be better to put effort into supporting FUA and DPO options in the target then dynamically changing a volume's cache policy from the initiator side.
But wouldn't the most disk like behavior then be to implement all the
FUA, DPO, cache mode page, flush cache, etc, etc, have COMSTAR implement
a cache just like disks do, maybe have a user knob to set the cache size
(typically 32 MB or so on modern disks, could probably be used here too
as a default), and still use /dev/rdsk devices?

That could seem, in my naive limited little mind and humble opinion, as
a pretty good approximation of how real disks work, and no OS should have
to be more surprised than usual of how a SCSI disk works.

Maybe COMSTAR already does this, or parts of it?

Or am I wrong?

/ragge
Phil Harman
2010-02-19 22:22:40 UTC
Permalink
Post by Ragnar Sundblad
Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I'd better be pretty sure of the failure modes before I work around that).
But are there any clients that assume that an iSCSI volume is synchronous?
Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?
/ragge
Yes, that would be nice wouldn't it? But the world is seldom that
simple, is it? For example, Sun's first implementation of zvol was
unsafe by default, with no cache flush option either.

A few years back we used to note that one of the reasons Solaris was
slower than Linux at fileystems microbenchmarks was because Linux ran
with the write caches on (whereas we would never be that foolhardy).

And then this seems to claim that NTFS may not be that smart either ...

http://blogs.sun.com/roch/entry/iscsi_unleashed

(see the WCE Settings paragraph)

I'm only going on what I've read.

Cheers,
Phil
Ragnar Sundblad
2010-02-20 02:05:56 UTC
Permalink
Post by Ragnar Sundblad
Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I'd better be pretty sure of the failure modes before I work around that).
But are there any clients that assume that an iSCSI volume is synchronous?
Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?
/ragge
Yes, that would be nice wouldn't it? But the world is seldom that simple, is it? For example, Sun's first implementation of zvol was unsafe by default, with no cache flush option either.
A few years back we used to note that one of the reasons Solaris was slower than Linux at fileystems microbenchmarks was because Linux ran with the write caches on (whereas we would never be that foolhardy).
(Exactly, and there are more "better fast than safe" evilness in that OS too, especially in the file system area. That is why I never use it for anything that should store anything.)
And then this seems to claim that NTFS may not be that smart either ...
http://blogs.sun.com/roch/entry/iscsi_unleashed
(see the WCE Settings paragraph)
I'm only going on what I've read.
But - all normal disks come with write caching enabled, so in both the Linux case and the NTFS case, this is how they always operate, with all disks, so why should an iSCSI lun behave any different?

If they can't handle the write cache (handle syncing, barriers, ordering an all that), they should turn the cache off, just as Solaris does in almost all cases except when you use an entire disk for zfs (I believe because solaris UFS was never really adapted to write caches). And they should do that for all SCSI disks.

(I seem to recall at in the bad old days you had to disable the write cache yourself if you should use a disk on SunOS, but that was probably because it wasn't standardized, and you did it with a jumper on the controller board.)

So - I just do not understand why an iSCSI lun should not try to emulate how all other SCSI disks work as much as possible? This must be the most compatible mode of operation, or am I wrong?

/ragge
Miles Nordin
2010-02-22 20:28:10 UTC
Permalink
This post might be inappropriate. Click to display it.
Ragnar Sundblad
2010-02-22 21:58:03 UTC
Permalink
Post by Miles Nordin
rs> But are there any clients that assume that an iSCSI volume is
rs> synchronous?
there will probably be clients that might seem to implicitly make this
assuption by mishandling the case where an iSCSI target goes away and
then comes back (but comes back less whatever writes were in its write
cache). Handling that case for NFS was complicated, and I bet such
complexity is just missing without any equivalent from the iSCSI spec,
but I could be wrong. I'd love to be educated.
Yes, this area may very well be a mine field of bugs. But this is
not a new phenomenon, it is the same with SAS, FC, USB, hot plug
disks, and even eSATA (and I guess with CD/DVD drives also with
SCSI with ATAPI (or rather SATAPI (does it have a name?))).

I believe the correct way of handling this in all those cases would
be having the old device instance fail, the file system being told
about it, having all current operations fail and all open files
be failed. When the disk comes back, it should get a new device
instance, and it should have to be remounted. All files will have
to be reopened. I hope no driver will just attach it again and happily
just continue without telling anyone/anything. But then again,
crazier things have been coded...
Post by Miles Nordin
Even if there is some magical thing in iSCSI to handle it, the magic
will be rarely used and often wrong until peopel learn how to test it,
which they haven't yet they way they have with NFS.
I am not sure there is anything really magic or unusual about this
really, but I certainly agree that it is a typical thing that might
not have been tested thoroughly enough.

/ragge
Kjetil Torgrim Homme
2010-02-23 00:20:09 UTC
Permalink
There will probably be clients that might seem to implicitly make this
assuption by mishandling the case where an iSCSI target goes away and
then comes back (but comes back less whatever writes were in its write
cache). Handling that case for NFS was complicated, and I bet such
complexity is just missing without any equivalent from the iSCSI spec,
but I could be wrong. I'd love to be educated.
Even if there is some magical thing in iSCSI to handle it, the magic
will be rarely used and often wrong until peopel learn how to test it,
which they haven't yet they way they have with NFS.
I decided I needed to read up on this and found RFC 3783 which is very
readable, highly recommended:

http://tools.ietf.org/html/rfc3783

basically iSCSI just defines a reliable channel for SCSI. the SCSI
layer handles the replaying of operations after a reboot or connection
failure. as far as I understand it, anyway.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
Miles Nordin
2010-02-23 07:58:55 UTC
Permalink
kth> basically iSCSI just defines a reliable channel for SCSI.

pft.

AIUI a lot of the complexity in real stacks is ancient protocol
arcania for supporting multiple initiators and TCQ regardless of
whther the physical target supports these things, multiple paths
between a single target,initiator pair, and their weird SCTP-like
notion that several physical SCSI targets ought to be combined into
multiple LUN's of a single virtual iSCSI target. I think the mapping
from iSCSI to SCSI is not usually very direct. I have not dug into it
though.

kth> the SCSI layer handles the replaying of operations after a
kth> reboot or connection failure.

how?

I do not think it is handled by SCSI layers, not for SAS nor iSCSI.

Also, remember a write command that goes into the write cache is a
SCSI command that's succeeded, even though it's not actually on disk
for sure unless you can complete a sync cache command successfully and
do so with no errors nor ``protocol events'' in the gap between the
successful write and the successful sync. A facility to replay failed
commands won't help because when a drive with write cache on reboots,
successful writes are rolled back.
Kjetil Torgrim Homme
2010-02-23 10:19:14 UTC
Permalink
Post by Miles Nordin
kth> the SCSI layer handles the replaying of operations after a
kth> reboot or connection failure.
how?
I do not think it is handled by SCSI layers, not for SAS nor iSCSI.
sorry, I was inaccurate. error reporting is done by the SCSI layer, and
the filesystem handles it by retrying whatever outstanding operations it
has.
Post by Miles Nordin
Also, remember a write command that goes into the write cache is a
SCSI command that's succeeded, even though it's not actually on disk
for sure unless you can complete a sync cache command successfully and
do so with no errors nor ``protocol events'' in the gap between the
successful write and the successful sync. A facility to replay failed
commands won't help because when a drive with write cache on reboots,
successful writes are rolled back.
this is true, sorry about my lack of precision. the SCSI layer can't do
this on its own.
--
Kjetil T. Homme
Redpill Linpro AS - Changing the game
Richard Elling
2010-02-20 19:21:32 UTC
Permalink
This discussion is very timely, but I don't think we're done yet. I've been working on using NexentaStor with Sun's DVI stack. The demo I've been playing with glues SunRays to VirtualBox instances using ZFS zvols over iSCSI for the boot image, with all the associated ZFS snapshot/clone goodness we all love so well.
The supported config for the ZFS storage server is Solaris 10u7 or 10u8. When I eventually got VDI going with NexentaStor (my value add), I found that some operations which only took 10 minutes with Solaris 10u8 were taking over an hour with NexentaStor. Using pfiles I found that iscsitgtd has the zvol open O_SYNC.
You need the COMSTAR plugin for NexentaStor (no need to beat the dead horse :-)
-- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
Matt
2010-02-18 16:05:18 UTC
Permalink
Also - still looking for the best way to test local performance - I'd love to make sure that the volume is actually able to perform at a level locally to saturate gigabit. If it can't do it internally, why should I expect it to work over GbE?
--
This message posted from opensolaris.org
Marc Nicholas
2010-02-18 16:30:27 UTC
Permalink
Run Bonnie++. You can install it with the Sun package manger and it'll
appear under /usr/benchmarks/bonnie++

Look for the command line I posted a couple of days back for a decent set of
flags to truly rate performance (using sync writes).

-marc
Post by Matt
Also - still looking for the best way to test local performance - I'd love
to make sure that the volume is actually able to perform at a level locally
to saturate gigabit. If it can't do it internally, why should I expect it
to work over GbE?
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Loading...