ZFS very slow under xVM

Discussion:

Martin

2007-11-01 20:36:06 UTC

Hello

I've got Solaris Express Community Edition build 75 (75a) installed on an Asus P5K-E/WiFI-AP (ip35/ICH9R based) board. CPU=Q6700, RAM=8Gb, disk=Samsung HD501LJ and (older) Maxtor 6H500F0.

When the O/S is running on bare metal, ie no xVM/Xen hypervisor, then everything is fine.

When it's booted up running xVM and the hypervisor, then unlike plain disk I/O, and unlike svm volumes, zfs is around 20 time slower.

Specifically, with either a plain ufs on a raw/block disk device, or ufs on a svn meta device, a command such as dd if=/dev/zero of=2g.5ish.dat bs=16k count=150000 takes less than a minute, with an I/O rate of around 30-50Mb/s.

Similary, when running on bare metal, output to a zfs volume, as reported by zpool iostat, shows a similar high output rate. (also takes less than a minute to complete).

But, when running under xVM and a hypervisor, although the ufs rates are still good, the zfs rate drops after around 500Mb.

For instance, if a window is left running zpool iostat 1 1000, then after the "dd" command above has been run, there are about 7 lines showing a rate of 70Mbs, then the rate drops to around 2.5Mb/s until the entire file is written. Since the dd command initially completes and returns control back to the shell in around 5 seconds, the 2 gig of data is cached and is being written out. It's similar with either the Samsung or Maxtor disks (though the Samsung are slightly faster).

Although previous releases running on bare metal (with xVM/Xen) have been fine, the same problem exists with the earlier b66-0624-xen drop of Open Solaris

This message posted from opensolaris.org

Nathan Kroenert

2007-11-01 22:48:11 UTC

Permalink

I observed something like this a while ago, but assumed it was something
I did. (It usually is... ;)

Tell me - If you watch with an iostat -x 1, do you see bursts of I/O
then periods of nothing, or just a slow stream of data?

I was seeing intermittent stoppages in I/O, with bursts of data on
occasion...

Maybe it's not just me... Unfortunately, I'm still running old nv and
xen bits, so I can't speak to the 'current' situation...

Cheers.

Nathan.

Post by Martin
Hello
I've got Solaris Express Community Edition build 75 (75a) installed on an Asus P5K-E/WiFI-AP (ip35/ICH9R based) board. CPU=Q6700, RAM=8Gb, disk=Samsung HD501LJ and (older) Maxtor 6H500F0.
When the O/S is running on bare metal, ie no xVM/Xen hypervisor, then everything is fine.
When it's booted up running xVM and the hypervisor, then unlike plain disk I/O, and unlike svm volumes, zfs is around 20 time slower.
Specifically, with either a plain ufs on a raw/block disk device, or ufs on a svn meta device, a command such as dd if=/dev/zero of=2g.5ish.dat bs=16k count=150000 takes less than a minute, with an I/O rate of around 30-50Mb/s.
Similary, when running on bare metal, output to a zfs volume, as reported by zpool iostat, shows a similar high output rate. (also takes less than a minute to complete).
But, when running under xVM and a hypervisor, although the ufs rates are still good, the zfs rate drops after around 500Mb.
For instance, if a window is left running zpool iostat 1 1000, then after the "dd" command above has been run, there are about 7 lines showing a rate of 70Mbs, then the rate drops to around 2.5Mb/s until the entire file is written. Since the dd command initially completes and returns control back to the shell in around 5 seconds, the 2 gig of data is cached and is being written out. It's similar with either the Samsung or Maxtor disks (though the Samsung are slightly faster).
Although previous releases running on bare metal (with xVM/Xen) have been fine, the same problem exists with the earlier b66-0624-xen drop of Open Solaris
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Paul Kraus

2007-11-02 11:27:27 UTC

Permalink

Post by Nathan Kroenert
Tell me - If you watch with an iostat -x 1, do you see bursts of I/O
then periods of nothing, or just a slow stream of data?
I was seeing intermittent stoppages in I/O, with bursts of data on
occasion...

I have seen this with ZFS under 10U3, both SPARC and x86,
although the cycle rate differed. Basically, no i/o reported via zpool
iostat 1 or iostat -xn 1 (to the raw devices) for a period of time
followed by a second of ramp up, one or more seconds of excellent
throughput (given the underlying disk systems), a second of slow down,
then more samples with no i/o. The period between peaks was 10 seconds
in one case and 7 in the other. I forget which was SPARC and which was
x86.

I assumed this had to do with ZFS caching i/o until it had a
large enough block to be worth writing. In some cases the data was
coming in via the network (NFS in one SMB in the other), but in
neither case was the network interface saturated (in fact, I saw
similar periods of no activity on the network) and the did not seem to
be a CPU limitation (load was low and idle time high). I have also
seen this with local disk to disk copies (from UFS to ZFS or ZFS to
ZFS).

--
Paul Kraus
Albacon 2008 Facilities

Jürgen Keil

2007-11-02 12:18:04 UTC

Permalink

Post by Martin
I've got Solaris Express Community Edition build 75
(75a) installed on an Asus P5K-E/WiFI-AP (ip35/ICH9R
based) board. CPU=Q6700, RAM=8Gb, disk=Samsung
HD501LJ and (older) Maxtor 6H500F0.
When the O/S is running on bare metal, ie no xVM/Xen
hypervisor, then everything is fine.
When it's booted up running xVM and the hypervisor,
then unlike plain disk I/O, and unlike svm volumes,
zfs is around 20 time slower.

Just a wild guess, but since we're just seeing a similar
strange performance problem on an Intel quadcore system
with 8GB or memory....

Can you try to remove some part of the ram, so that the
system runs on 4GB instead of 8GB? Or use xen /
solaris boot options to restrict physical memory usage to
the low 4GB range?

It seems that on certain mainboards [*] the bios is unable to
install mtrr cachable ranges for all of the 8GB system ram,
when when some important stuff ends up in uncachable ram,
performance gets *really* bad.

[*] http://lkml.org/lkml/2007/6/1/231

This message posted from opensolaris.org

Martin

2007-11-02 17:46:56 UTC

Permalink

I've removed half the memory, leaving 4Gb, and rebooted into "Solaris xVM", and re-tried under Dom0. Sadly, I still get a similar problem. With "dd if=/dev/zero of=myfile bs=16k count=150000" I get command returning in 15 seconds, and "zpool iostat 1 1000" shows 22 records with an IO rate of around 80M, then 209 records of 2.5M (pretty consistent), then the final 11 records climbing to 2.82, 3.29, 3.05, 3.32, 3.17, 3.20, 3.33, 4.41, 5.44, 8.11

regards

Martin

This message posted from opensolaris.org

Gary Pennington

2007-11-02 18:21:12 UTC

Permalink

Hmm, I just repeated this test on my system:

bash-3.2# uname -a
SunOS soe-x4200m2-6 5.11 onnv-gate:2007-11-02 i86pc i386 i86xpv

bash-3.2# prtconf | more
System Configuration: Sun Microsystems i86pc
Memory size: 7945 Megabytes

bash-3.2# prtdiag | more
System Configuration: Sun Microsystems Sun Fire X4200 M2
BIOS Configuration: American Megatrends Inc. 080012 02/02/2007
BMC Configuration: IPMI 1.5 (KCS: Keyboard Controller Style)

bash-3.2# ptime dd if=/dev/zero of=/xen/myfile bs=16k count=150000
150000+0 records in
150000+0 records out

real 31.927
user 0.689
sys 15.750

bash-3.2# zpool iostat 1

capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
xen 15.3G 121G 0 261 0 32.7M
xen 15.3G 121G 0 350 0 43.8M
xen 15.3G 121G 0 392 0 48.9M
xen 15.3G 121G 0 631 0 79.0M
xen 15.5G 121G 0 532 0 60.1M
xen 15.6G 120G 0 570 0 65.1M
xen 15.6G 120G 0 645 0 80.7M
xen 15.6G 120G 0 516 0 63.6M
xen 15.7G 120G 0 403 0 39.9M
xen 15.7G 120G 0 585 0 73.1M
xen 15.7G 120G 0 573 0 71.7M
xen 15.7G 120G 0 579 0 72.4M
xen 15.7G 120G 0 583 0 72.9M
xen 15.7G 120G 0 568 0 71.1M
xen 16.1G 120G 0 400 0 39.0M
xen 16.1G 120G 0 584 0 73.0M
xen 16.1G 120G 0 568 0 71.0M
xen 16.1G 120G 0 585 0 73.1M
xen 16.1G 120G 0 583 0 72.8M
xen 16.1G 120G 0 665 0 83.2M
xen 16.1G 120G 0 643 0 80.4M
xen 16.1G 120G 0 603 0 75.0M
xen 16.1G 120G 5 526 320K 64.9M
xen 16.7G 119G 0 582 0 68.0M
xen 16.7G 119G 0 639 0 78.5M
xen 16.7G 119G 0 641 0 80.2M
xen 16.7G 119G 0 664 0 83.0M
xen 16.7G 119G 0 629 0 78.5M
xen 16.7G 119G 0 654 0 81.7M
xen 17.2G 119G 0 563 63.4K 63.5M
xen 17.3G 119G 0 525 0 59.2M
xen 17.3G 119G 0 619 0 71.4M
xen 17.4G 119G 0 7 0 448K
xen 17.4G 119G 0 0 0 0
xen 17.4G 119G 0 408 0 51.1M
xen 17.4G 119G 0 618 0 76.5M
xen 17.6G 118G 0 264 0 27.4M
xen 17.6G 118G 0 0 0 0
xen 17.6G 118G 0 0 0 0
xen 17.6G 118G 0 0 0 0
...<ad infinitum>

I don't seem to be experiencing the same result as yourself.

The behaviour of ZFS might vary between invocations, but I don't think that
is related to xVM. Can you get the results to vary when just booting under
"bare metal"?

Gary

Post by Martin
I've removed half the memory, leaving 4Gb, and rebooted into "Solaris xVM", and re-tried under Dom0. Sadly, I still get a similar problem. With "dd if=/dev/zero of=myfile bs=16k count=150000" I get command returning in 15 seconds, and "zpool iostat 1 1000" shows 22 records with an IO rate of around 80M, then 209 records of 2.5M (pretty consistent), then the final 11 records climbing to 2.82, 3.29, 3.05, 3.32, 3.17, 3.20, 3.33, 4.41, 5.44, 8.11
regards
Martin
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Gary Pennington
Solaris Core OS
Sun Microsystems
***@sun.com

Martin

2007-11-03 09:15:07 UTC

Permalink

Post by Gary Pennington
The behaviour of ZFS might vary between invocations, but I don't think that
is related to xVM. Can you get the results to vary when just booting under
"bare metal"?

It's pretty consistently displays the behaviors of good IO (approx 60Mb/s - 80Mb/s) for about 10-20 seconds, then always drops to approx 2.5 Mb/s for virtually all of the rest of the output. It always displays this when running under xVM/Xen with Dom0, and never on bare metal when xVM/Xen isn't booted.

This message posted from opensolaris.org

Erblichs

2007-11-03 21:46:29 UTC

Permalink

Martin,

This is a shot in the dark, but, this seems to be a IO scheduling
issue.

Since, i am late on this thread, what is the characteristic of
the IO: read mostly, appending writes, read, modify write,
sequentiality, random, single large file, multiple files.

And have you tracked whether any IO is aged much beyond 30
seconds if we are talking about writes.

If we were talking about Xen by itself, I am sure their is
some type of schedular involvement, that COULD slow down your
IO due to fairness or some specified weight against other
processes/ threads / tasks.

Can you boost the scheduling of the IO task, by making it
realtime or giving it a niceness or .. in a experimental
environment and comparing stats.

Whether this is the bottleneck of your problem would take
a closer examination of the various metrics of the system.

Mitchell Erblich
-----------------

Post by Martin

Post by Gary Pennington
The behaviour of ZFS might vary between invocations, but I don't think that
is related to xVM. Can you get the results to vary when just booting under
"bare metal"?

Martin

2007-11-04 09:49:25 UTC

Permalink

Mitchell

The problem seems to occur with various IO patterns. I first noticed it after using ZFS based storage for a disk image for a xVM/Xen virtual domain, and then, while tracking ti down, observed that either "cp" of a large .iso disk image would reproduce the problem, and more later, a single "dd if=/dev/zero of=myfile bs=16k count=150000" would. So I guess this latter case is a mostly write pattern to the disk, especially after it is noted that the command returns after around 5 seconds, leaving the rest buffered in memory.

best regards

Martin

This message posted from opensolaris.org

Kugutsumen

2007-11-06 05:25:19 UTC

Permalink

I had a similar problem on a quad core amd box with 8 gig of ram...
The performance was nice for a few minutes but then the system will
crawl to a halt.
The problem was that the areca SATA drivers can't do DMA when the dom0
memory wasn't at 3 gig or lower.

Post by Erblichs
Mitchell
The problem seems to occur with various IO patterns. I first
noticed it after using ZFS based storage for a disk image for a xVM/
Xen virtual domain, and then, while tracking ti down, observed that
either "cp" of a large .iso disk image would reproduce the problem,
and more later, a single "dd if=/dev/zero of=myfile bs=16k
count=150000" would. So I guess this latter case is a mostly write
pattern to the disk, especially after it is noted that the command
returns after around 5 seconds, leaving the rest buffered in memory.
best regards
Martin
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Martin

2007-11-06 13:33:03 UTC

Permalink

kugutsum

I tried with just 4Gb in the system, and the same issue. I'll try 2Gb tomorrow and see if any better. (ps, how did you determine that was the problem in your case)

cheers

Martin

This message posted from opensolaris.org

2007-11-27 10:17:11 UTC

Permalink

Post by Martin
kugutsum
I tried with just 4Gb in the system, and the same issue. I'll try
2Gb tomorrow and see if any better. (ps, how did you determine
that was the problem in your case)

sorry, I wasn't monitoring this list for a while. My machine has 8GB
of ram and I remembered that some drivers had issues doing DMA access
over the 4GB limit. Someone noted that even if you have exactly 4GB
you will still run into this issue because a lot of address spaces in
already mapped or reserved by Xen.

Mark Johnson said that the xen team should try to do a better job
allocating low memory.

What you should do is set the dom0_mem to 1 or 2 gig and you can still
use the rest of the memory in your domU.

Mark Johnson suggested to limit
the amount of memory zfs uses. e.g. set the following
in /etc/system.

set zfs:zfs_arc_max = 0x10000000

Post by Martin
cheers
Martin
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Martin

2007-11-29 11:26:32 UTC

Permalink

I set /etc/system's zfs:zfs_arc_max = 0x10000000 and it seems better now.

I had previously tried setting it to 2Gb rather than 256Mb as above without success... I should have tried much lower!

It "seems" that when I perform I/O though a WindowsXP hvm, I get a "reasonable" I/O rate, but I'm not sure at this point in time. When a write is made from within the hvm VM, would I expect for the same DMA issue to arise? (I can't really tell either way aty the moment because it's not super fast anyway)

This message posted from opensolaris.org

James Dickens

2007-12-26 01:21:03 UTC

Permalink

Post by Martin

Post by Martin
kugutsum
I tried with just 4Gb in the system, and the same

issue. I'll try

Post by Martin
2Gb tomorrow and see if any better. (ps, how did

you determine

Post by Martin
that was the problem in your case)

sorry, I wasn't monitoring this list for a while. My
machine has 8GB
of ram and I remembered that some drivers had issues
doing DMA access
over the 4GB limit. Someone noted that even if you
have exactly 4GB
you will still run into this issue because a lot of
address spaces in
already mapped or reserved by Xen.
Mark Johnson said that the xen team should try to do
a better job
allocating low memory.
What you should do is set the dom0_mem to 1 or 2 gig
and you can still
use the rest of the memory in your domU.
Mark Johnson suggested to limit
the amount of memory zfs uses. e.g. set the following
in /etc/system.
set zfs:zfs_arc_max = 0x10000000

has anyone tried boosting this number, 256MB seems pretty low perhaps someone has tested with 512 or 768MB since someone mentions having tried with 1GB and it didn't work. Machine feels sluggish with this setting. Has anyone tested the 1 or 2GB Domain-0 limit, was it better than the zfs_arc fix? I am hoping to use my machine as a xVM/ZFS server.

Is this being tracked by a bug report so we can get more information as to what is the root cause and when this is fixed so we can remove the work around from our systems.

If an engineers help debugging this I'm handy with dtrace and the machine is not in use so I would be more than happy to investigate any ideas or fixes.

James Dickens
uadmin.blogspot.com

Post by Martin

Post by Martin
cheers
Martin
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list

http://mail.opensolaris.org/mailman/listinfo/zfs-discu
ss
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discu
ss

Has anyone tried boosting the zil cache number

This message posted from opensolaris.org

Menno Lageman

2007-12-26 09:07:03 UTC

Permalink

Post by James Dickens

Post by Martin

Post by Martin
kugutsum
I tried with just 4Gb in the system, and the same

issue. I'll try

Post by Martin
2Gb tomorrow and see if any better. (ps, how did

you determine

Post by Martin
that was the problem in your case)

sorry, I wasn't monitoring this list for a while. My
machine has 8GB
of ram and I remembered that some drivers had issues
doing DMA access
over the 4GB limit. Someone noted that even if you
have exactly 4GB
you will still run into this issue because a lot of
address spaces in
already mapped or reserved by Xen.
Mark Johnson said that the xen team should try to do
a better job
allocating low memory.
What you should do is set the dom0_mem to 1 or 2 gig
and you can still
use the rest of the memory in your domU.
Mark Johnson suggested to limit
the amount of memory zfs uses. e.g. set the following
in /etc/system.
set zfs:zfs_arc_max = 0x10000000

has anyone tried boosting this number, 256MB seems pretty low perhaps someone has tested with 512 or 768MB since someone mentions having tried with 1GB and it didn't work. Machine feels sluggish with this setting. Has anyone tested the 1 or 2GB Domain-0 limit, was it better than the zfs_arc fix? I am hoping to use my machine as a xVM/ZFS server.
Is this being tracked by a bug report so we can get more information as to what is the root cause and when this is fixed so we can remove the work around from our systems.
If an engineers help debugging this I'm handy with dtrace and the machine is not in use so I would be more than happy to investigate any ideas or fixes.
James Dickens
uadmin.blogspot.com

When I upgraded from 2 GB to 8GB, I found that my system became very
sluggish when doing disk intensive jobs (such as building ON). Limiting
Dom0 to 2 GB seems to have fixed that.

Menno

--
Menno Lageman - Sun Microsystems - http://blogs.sun.com/menno

Martin

2007-11-08 14:02:55 UTC

Permalink

Well, I've tried the latest OpenSolaris snv_76 release, and it displays the same symptoms.
(so b66-0624-xen, 75a and 76 all have the same problem)

But, the good news is that is behaves well if there is only 2Gb of memory in the system.

So, in summary

The command time dd if=/dev/zero of=myfile.dat bs=16k count=150000

...takes around 30 seconds if running on "bare metal" (ie when the Grub menu does *not* select xVM/Xen ... ie when not running under Dom0)

...takes around 30 seconds in Dom0 when the Grub boot selected Xen (but only if 2Gb memory)

...takes "forever", with the IO rate dropping from an initial 70Mb/s to around 1M/s, if booted under Xen, and executed within Dom0, and there is either 4Gb (2 DIMMs, single channel), 4Gb (2 Dimmon, dual channel), or 8Gb (dual channel).

Anyone else using an IP35 based board and >2Gb memory?

Anyone using 8Gb memory with Xen on a "retail" based motherboard?

This message posted from opensolaris.org

Franco Barber

2007-11-11 18:28:28 UTC

Permalink

Martin,
I've got the same Asus board as you do and 4GB of ram, but I haven't gotten to the point of using ZFS or Xen/xVM yet largely because I've been sidetracked getting the Marvell ethernet to work under b75a; I keep having errors getting the myk driver to load.
A few weeks ago I had the myk driver working under b70b, so I was surprised to have this problem.

What ethernet are you using, and if it's the one built in to the P5K-E motherboard, what driver?

Thanks,
Franco

This message posted from opensolaris.org

Martin

2007-11-12 12:13:07 UTC

Permalink

IIn this PC, I'm using the PCI card http://www.intel.com/network/connectivity/products/pro1000gt_desktop_adapter.htm , but, more recentlyI'm using the PCI Express card http://www.intel.com/network/connectivity/products/pro1000pt_desktop_adapter.htm

Note that the latter didn't have PXE and the boot ROM enabled (for JumpStart), contrary the the documentation, and I had to download the DOS program from the Intel site to enable it. (please ask if anyone needs the URL)

...so, for an easy life, I recommend the Intel PRO/ 1000 GT Desktop

This message posted from opensolaris.org

James Dickens

2007-12-29 03:24:08 UTC

Permalink

one more time....

I use the following on my snv_77 system with 2
internal SATA drives
that show up with the 'ahci' driver.

Thanks for the tip! I saw my BIOS had a setting for
SATA mode, but the selections are "IDE or RAID". It
was in IDE and I figured RAID mode just enabled one
of those silly low performance 0/1 settings...Didn't
know it kicked it into AHCI...But it did!
Unfortunetly my drives aren't recognized now...I've
asked over in the device list what's up....

That's the expected behaviour :-/ The physical device path
for the root disk has changed by switching the S-ATA
controller between P-ATA emulation and AHCI mode, and
for that reason the root disk now uses a different device name
(e.g. /dev/dsk/c2t0d0s0 instead of /dev/dsk/c0d0s0).
The old device name that can be found in /etc/vfstab isn't
valid any more.
If you have no separate /usr filesystem, this can be fixed
- start boot from the hdd, this fails when trying to remount the
root filesystem in read/write mode, and offers a single user login
- login with the root password
- remount the root filesystem read/write, using the physical
device path for the disk from the /devices/... filesystem.
The "mount" command should show you the physical device
path that was used to mount the "/" filesystem read only.
- Run "devfsadm -v" to create new /dev links for the disks (on
the ahci controller)
- run "format"; the "AVAILABLE DISK SELECTIONS" menu should show
you the new device name for the disk
# format
Searching for disks...done
0. c2t0d0 <DEFAULT cyl 48638 alt 2 hd 255 sec 63>
- now that we know the new disk device name, edit /etc/vfstab
and update all entries that reference the old name with the new
name
- reboot
Excellent tip, appears to have solved my problem. though my mother board

had a 3rd option that was AHCI mode, the raid mode didn't work with solaris
See my blog for system details.
James Dickens
uadmin.blogspot.com

This message posted from opensolaris.org
_______________________________________________
xen-discuss mailing list