Discussion:
[zfs-discuss] Using ZFS on Linux for video edit storage
Alex Gardiner
2015-01-01 17:36:04 UTC
Permalink
Hello list,

Is anybody else is using ZFS on Linux to store and stream video files?

In my case I am serving half a dozen streams between 100-200Mbps and offering them to my lab over SMB/AFP.

Although traditionally I have used hardware solutions by 3ware and HP, both of which boast cache profiles and firmware that claim to be optimised for this use case, I wonder if there are any equivalent approaches for ZFS?

From the HP documentation about "Video on Demand":
http://h20565.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c00687518

“Decreasing the maximum latency of block requests to the Smart Array is one of the key goals for VOD…. Other improvements include changing the cache ratio to be 0 % read and 100 % write. Since VOD operations are 99% random, any read-ahead operation would penalize performance. You want to post the writes so that they have the least impact on the reads”

Spec wise my lab box is a 2U Supermicro 825TQ running an 8 drive Z2 (4TB WD RED/LSI 9211-8i/16GB ECC). Actually, the performance is pretty impressive right out of the box, it just feels too easy :-)

From what I can see video streaming workloads do not seem to benefit all that much from adding lots of ARC (at least not much beyond what I already have). Similarly due to the shape of this kind of workload, L2ARC/SLOG do not seem to be especially effective, although can save some IOPS with smaller files.

Ultimately what I am asking is should I just relax and trust ZFS to be smart, or can you think of any specific tunings that may be useful?

Many thanks and all the best for 2015.

Alex

PS. I have not been brave enough to use SATA discs with a SAS port expander due reading a few horror stories. Am I still right to avoid this as much as possible?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-01 18:12:48 UTC
Permalink
Post by Alex Gardiner
Hello list,
Is anybody else is using ZFS on Linux to store and stream video files?
I am using zfs-fuse on my QNAP TS-421 running RedSleeve Linux, acting as my
plex server containing some 4-figured amount of DVD extracts (shelf space
requirements for DVDs started being prohibitive quite some time ago).
Post by Alex Gardiner
In my case I am serving half a dozen streams between 100-200Mbps and
offering them to my lab over SMB/AFP.
100-200Mbits _each_?
Post by Alex Gardiner
Although traditionally I have used hardware solutions by 3ware and HP,
both of which boast cache profiles and firmware that claim to be optimised
for this use case, I wonder if there are any equivalent approaches for ZFS?
Any "optimization" for bulk linear reads is, IMO, marketing more than
substance. Most solutions will do well at this kind of load, even if you
might want to crank up prefetching by a lot for multiple concurrent streams
to minimize the seek:read ratio.
Post by Alex Gardiner
http://h20565.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c00687518
“Decreasing the maximum latency of block requests to the Smart Array is
one of the key goals for VOD
. Other improvements include changing the
cache ratio to be 0 % read and 100 % write. Since VOD operations are 99%
random, any read-ahead operation would penalize performance. You want to
post the writes so that they have the least impact on the reads”
That seems completely backwards to me. Surely VOD is a read-only operation
and unless you are using solid state storage you want to crank up prefetch
sky high to ensure that linear reads are dominant to seeking between
streams.
Post by Alex Gardiner
Spec wise my lab box is a 2U Supermicro 825TQ running an 8 drive Z2 (4TB
WD RED/LSI 9211-8i/16GB ECC). Actually, the performance is pretty
impressive right out of the box, it just feels too easy :-)
Aren't WD Red drives "intellipower" (WDs marketing speak for "painfully
slow")?
Post by Alex Gardiner
From what I can see video streaming workloads do not seem to benefit all
that much from adding lots of ARC (at least not much beyond what I already
have). Similarly due to the shape of this kind of workload, L2ARC/SLOG do
not seem to be especially effective, although can save some IOPS with
smaller files.
It depends on whether you have a clear divide between hot and cold data.
When I was working for a large broadcaster we used a huge SSD-only
stripe-of-mirrors pool that was manually (well, using scripts) primed with
the latest, highest demand data, every day, and an enormous mechanical
drive pool that contained everything. The front end would check if a file
is on the SSD pool path and if so, serve it from there, and if it isn't, it
would serve it from the HDD pool path.


PS. I have not been brave enough to use SATA discs with a SAS port expander
Post by Alex Gardiner
due reading a few horror stories. Am I still right to avoid this as much as
possible?
My experience is that this is purely superstition and the problems arise
between disks and expanders regardless of the type of disks. Unless you are
paying a big multiple in price for the disks "certified" by the chassis
manufacturer, results will vary wildly, even if you are using SAS disks.
Some chassies even contain disk firmwares for the disks they recognize, and
if the model number matches they'll blow away the firmware on the disk
without any inervention to make it the version they expect (yes, this is
eyewateringly dangerous, but you aren't expected to put anything but
certified disks in them). We went through just about every make and model
of SSDs when we were testing what was compatible with the chassies we used,
and eventually the only thing that worked at the time was Kingstons (and
they did work flawlessly) - everything else would crash the expander bus
within seconds of a load test. This was a couple of years ago, so you would
have to re-test with the current disks to make sure.

As I said here previously, by far the most trouble-free operation I have
had to date has been with SATA disks and SATA PMPs. That generally "just
works" once you find a combination that works well (for me it's Marvell
88SX7042 SATA controllers and SIL7326 5-port PMPs). Every SAS based
solution I have tried has been quirky in it's own peculiar way, some more
livable with than others.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Alex Gardiner
2015-01-01 22:07:47 UTC
Permalink
Thanks for the detailed reply Gordan.
Post by Gordan Bobic
100-200Mbits _each_?
Yep.

In tv/doc post we often work with the raw footage acquired by the production.

It is not uncommon for streams to be as big as 220Mbps.
Post by Gordan Bobic
Post by Alex Gardiner
“Decreasing the maximum latency of block requests to the Smart Array is one of the key goals for VOD…. Other improvements include changing the cache ratio to be 0 % read and 100 % write. Since VOD operations are 99% random, any read-ahead operation would penalize performance. You want to post the writes so that they have the least impact on the reads”
That seems completely backwards to me. Surely VOD is a read-only operation and unless you are using solid state storage you want to crank up prefetch sky high to ensure that linear reads are dominant to seeking between streams.
Thats a good point, I suppose it does make sense to grab more data once you’ve accepted the time it takes to move the heads into place..

I wonder what HP are banging on about...
Post by Gordan Bobic
Aren't WD Red drives "intellipower" (WDs marketing speak for "painfully slow”)?
I admit REDs are not the fastest, but they run cool and for this kind of linear workload they seem to perform pretty well.
Post by Gordan Bobic
It depends on whether you have a clear divide between hot and cold data. When I was working for a large broadcaster we used a huge SSD-only stripe-of-mirrors pool that was manually (well, using scripts) primed with the latest, highest demand data, every day, and an enormous mechanical drive pool that contained everything. The front end would check if a file is on the SSD pool path and if so, serve it from there, and if it isn't, it would serve it from the HDD pool path.
In this case it is probably not possible to separate hot/cold data.

If you had a way to predict the files/media required, as you say - maybe in some kind of play out application, then I can see how that might work, but in most edit workflows (imagine making a documentary or TV show) you will probably want everything on hand (for creative reasons).

Furthermore, many NLE editing applications (and at the very least their operators) file things without much attention to organisation, so its often the case that you need access to a wide pool of media - in order to get at the little you really need.

As such I fear this would not be an option...
Post by Gordan Bobic
Post by Alex Gardiner
PS. I have not been brave enough to use SATA disks with a SAS port expander due reading a few horror stories. Am I still right to avoid this as much as possible?
My experience is that this is purely superstition and the problems arise between disks and expanders regardless of the type of disks. Unless you are paying a big multiple in price for the disks "certified" by the chassis manufacturer, results will vary wildly, even if you are using SAS disks. Some chassies even contain disk firmwares for the disks they recognize, and if the model number matches they'll blow away the firmware on the disk without any inervention to make it the version they expect (yes, this is eyewateringly dangerous, but you aren't expected to put anything but certified disks in them). We went through just about every make and model of SSDs when we were testing what was compatible with the chassies we used, and eventually the only thing that worked at the time was Kingstons (and they did work flawlessly) - everything else would crash the expander bus within seconds of a load test. This was a couple of years ago, so you would have to re-test with the current disks to make sure.
As I said here previously, by far the most trouble-free operation I have had to date has been with SATA disks and SATA PMPs. That generally "just works" once you find a combination that works well (for me it's Marvell 88SX7042 SATA controllers and SIL7326 5-port PMPs). Every SAS based solution I have tried has been quirky in it's own peculiar way, some more livable with than others.
Thanks for the heads up on those, although I would probably want more than 4/5 drives and probably don’t have the space for multiple cards - I’ll take a look though.

To be more specific I wondered about using something like a supermicro chassis (such as the 826TQ, which is a 12 drive 2U unit).

These have a built in expander, so it is very tempting to go with a HBA like the LSI 9211-4i (running IT Firmware) and simply populate the chassis with cheap SATA drives. Looking around there appear to be lots of people using this kind of setup, but the likes of the following really make me question if its a route I want to look at:
http://garrett.damore.org/2010/08/why-sas-sata-is-not-such-great-idea.html

Maybe the message is just to go SAS and sleep easy?

Cheers,

Alex

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-02 12:12:35 UTC
Permalink
Post by Alex Gardiner
Post by Gordan Bobic
Post by Alex Gardiner
PS. I have not been brave enough to use SATA disks with a SAS port
expander due reading a few horror stories. Am I still right to avoid this
as much as possible?
Post by Gordan Bobic
My experience is that this is purely superstition and the problems arise
between disks and expanders regardless of the type of disks. Unless you are
paying a big multiple in price for the disks "certified" by the chassis
manufacturer, results will vary wildly, even if you are using SAS disks.
Some chassies even contain disk firmwares for the disks they recognize, and
if the model number matches they'll blow away the firmware on the disk
without any inervention to make it the version they expect (yes, this is
eyewateringly dangerous, but you aren't expected to put anything but
certified disks in them). We went through just about every make and model
of SSDs when we were testing what was compatible with the chassies we used,
and eventually the only thing that worked at the time was Kingstons (and
they did work flawlessly) - everything else would crash the expander bus
within seconds of a load test. This was a couple of years ago, so you would
have to re-test with the current disks to make sure.
Post by Gordan Bobic
As I said here previously, by far the most trouble-free operation I have
had to date has been with SATA disks and SATA PMPs. That generally "just
works" once you find a combination that works well (for me it's Marvell
88SX7042 SATA controllers and SIL7326 5-port PMPs). Every SAS based
solution I have tried has been quirky in it's own peculiar way, some more
livable with than others.
Thanks for the heads up on those, although I would probably want more than
4/5 drives and probably don’t have the space for multiple cards - I’ll take
a look though.
With a 5:1 PMP and a 4-port card you need 1 PCIe slot for 20 disks. You do
need somewhere to put the PMPs, of course, but I find that using thick
double-sided foam tape works very well for just sticking them to the side
of the chassis where they are out of the way and don't use up any slot
space.
Post by Alex Gardiner
To be more specific I wondered about using something like a supermicro
chassis (such as the 826TQ, which is a 12 drive 2U unit).
These have a built in expander, so it is very tempting to go with a HBA
like the LSI 9211-4i (running IT Firmware) and simply populate the chassis
with cheap SATA drives. Looking around there appear to be lots of people
using this kind of setup, but the likes of the following really make me
http://garrett.damore.org/2010/08/why-sas-sata-is-not-such-great-idea.html
This set of my "magazine educated 'expert'" sense off quite early on.
There's a lot of hand waving and nebulosity with very little real substance
in the article.

My basis for disagreement is that I have seen as many compatibility issues
between SAS drives and SAS chassies, especially big brand ones, as I have
between the similar chassies and SATA drives. The problem arises from the
fact that such chassies tend to be tested with a much narrower selection of
drives. When the vendor finds a combination that works properly, that is
what they will sell, and charge you a huge premium for having tested it.
With anything else you are on your own. With more generic white-label
chassies/expanders there is generally more testing that takes place because
having a 25% retail return rate would be disastrous, so more
incompatibilities and bugs get caught during testing. With a big brand
vendor, incompatibility with cheap components works in their favour because
once you have bought a chassis at an inflated price, they want you to go
and buy disks from them at inflated prices, too.
Post by Alex Gardiner
Maybe the message is just to go SAS and sleep easy?
Sure, enjoy your placebo. :)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Alex Gardiner
2015-01-04 18:35:48 UTC
Permalink
With a 5:1 PMP and a 4-port card you need 1 PCIe slot for 20 disks. You do need somewhere to put the PMPs, of course, but I find that using thick double-sided foam tape works very well for just sticking them to the side of the chassis where they are out of the way and don't use up any slot space.
Actually this looks pretty cost effective (thanks for the hint), but also means a lot of extra cabling...

I guess I was attracted to the option of a chassis with a built in SAS/SATA expander (it seems neat), but was a little worried to read about others coming unstuck.
This set of my "magazine educated 'expert'" sense off quite early on. There's a lot of hand waving and nebulosity with very little real substance in the article.
My basis for disagreement is that I have seen as many compatibility issues between SAS drives and SAS chassies, especially big brand ones, as I have between the similar chassies and SATA drives. The problem arises from the fact that such chassies tend to be tested with a much narrower selection of drives. When the vendor finds a combination that works properly, that is what they will sell, and charge you a huge premium for having tested it. With anything else you are on your own. With more generic white-label chassies/expanders there is generally more testing that takes place because having a 25% retail return rate would be disastrous, so more incompatibilities and bugs get caught during testing. With a big brand vendor, incompatibility with cheap components works in their favour because once you have bought a chassis at an inflated price, they want you to go and buy disks from them at inflated prices, too.
The main concern was to sound out what you guys thought of the situation - folks here are knowledgable and level headed on such things :-)

That said I’ve previously always used AHCI with my gen8 microserver and never had an issue - its obviously just a different story when you start looking at bigger solutions/more that a few drives.
Post by Alex Gardiner
Maybe the message is just to go SAS and sleep easy?
Sure, enjoy your placebo. :)
hehe, well I’d rather use SATA drives - they are somewhat cheaper and my budget for home/lab use is relatively modest.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-05 09:45:10 UTC
Permalink
Post by Gordan Bobic
With a 5:1 PMP and a 4-port card you need 1 PCIe slot for 20 disks. You
do need somewhere to put the PMPs, of course, but I find that using thick
double-sided foam tape works very well for just sticking them to the side
of the chassis where they are out of the way and don't use up any slot
space.
Actually this looks pretty cost effective (thanks for the hint), but also
means a lot of extra cabling...
It is a little cheaper per port than, for example, a 2nd hand 3ware 16 or
24 port SAS card. The advantage, however, is that unlike on the 3ware card
all the usual tools like hdparm and smartctl "just work". On the 3ware card
you can just about manage the most basic of things that hdparm lets you do
using sdparm (useful for little more than enabling write cache on the
disks), and you need vendor provided utilities, often binary-only, which
also provide the necessary devic nodes to make smartctl work with the disks
attached to the card. And replacing disks without rebooting is just as
problematic - you have to use 3ware utilities to rescan the buses, then
manually get udev to create the /dev/disk/* symlinks. And you never get
wwn-* nodes.

All this compared to a SATA+PMP solution that is cheaper to begin with and
on which all of the above "just works".

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Philip Seeger
2015-01-06 17:30:02 UTC
Permalink
I'm not streaming hd movies from my ZFS storage (just a few self-made
ones, which can be streamed over the local network just fine), nor do I
do any video editing - so this might be somewhat off-topic -, but I
wanted to comment on the port multiplier issue (how to connect the drives).

I have a zpool with 10 drives, 5 drives per hot-swap enclosure, each
enclosure has a SATA port multiplier, so they're each connected to a
single eSATA port (using no more than 2 eSATA ports on the host
computer). Those are the kind of affordable (home-use) RAID enclosures,
but I've disabled their hardware RAID functionality, passing all the
drives through to the host, because I want ZFS to be able to fix errors
(I've tried the hardware RAID feature, good thing I had a backup when
the first drive started failing and the enclosure just went off).

This setup works for many months, until one drive starts
failing/misbehaving. In most cases, that caused read/write errors on all
5 drives in the affected enclosure, even though the remaining 4 drives
were actually fine. Of course, r/w errors on 5 drives (in a RAIDZ-3
pool) froze the pool to protect it from further damage. After a reboot,
everything was working again (ZFS could fix errors thanks to sufficient
redundancy) and after replacing the bad drive, the system would be
running for another year until the next drive would go bad.

It might be just a bad/cheap port multiplier in the enclosures that I'm
using, but at least in my case, I've come to the conclusion that I'd
rather avoid one (sort of a single point of failure) and try to connect
all the drives individually.
Lately, I've been experimenting with SAS cards (connecting the SATA
drives directly, which are in a different kind of enclosure with one
SATA connector for each drive). I've connected bad drives that I've
taken out of my old system previously (these drives caused crashes, as
described above), put the test system under load and I've not yet had
anything unexpected happen to me so far (no crashes). Single drives
(with smart errors, for example) would suddenly be listed as "UNAVAIL"
or "too many errors" (and positive error counts), but not affecting any
of the good drives. This is eventually better, because a single bad
drive (which is failing, corrupting data), should obviously not crash a
RAID system (with enough redundancy).
smartctl works fine too. And the /dev/disk/by-id/ata* symlinks also have
the exact same names as with the port multiplier setup (they usually
contain the drive's serial, which makes identifying/replacing drives
really easy).

(Also, someone mentioned fixing the drives with foam tape, but I still
prefer hot-swap enclosures - just in this case, ones with one data port
per drive.)

So, again, maybe my old enclosures are just too cheap and this doesn't
happen with other port multipliers. But I wanted to mention that it
might be a problem to use them sometimes, rather than connecting the
drives individually.
Post by Gordan Bobic
With a 5:1 PMP and a 4-port card you need 1 PCIe slot for 20
disks. You do need somewhere to put the PMPs, of course, but I
find that using thick double-sided foam tape works very well for
just sticking them to the side of the chassis where they are out
of the way and don't use up any slot space.
Actually this looks pretty cost effective (thanks for the hint),
but also means a lot of extra cabling...
It is a little cheaper per port than, for example, a 2nd hand 3ware 16
or 24 port SAS card. The advantage, however, is that unlike on the
3ware card all the usual tools like hdparm and smartctl "just work".
On the 3ware card you can just about manage the most basic of things
that hdparm lets you do using sdparm (useful for little more than
enabling write cache on the disks), and you need vendor provided
utilities, often binary-only, which also provide the necessary devic
nodes to make smartctl work with the disks attached to the card. And
replacing disks without rebooting is just as problematic - you have to
use 3ware utilities to rescan the buses, then manually get udev to
create the /dev/disk/* symlinks. And you never get wwn-* nodes.
All this compared to a SATA+PMP solution that is cheaper to begin with
and on which all of the above "just works".
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-06 18:25:23 UTC
Permalink
Post by Philip Seeger
This setup works for many months, until one drive starts
failing/misbehaving. In most cases, that caused read/write errors on all 5
drives in the affected enclosure, even though the remaining 4 drives were
actually fine. Of course, r/w errors on 5 drives (in a RAIDZ-3 pool) froze
the pool to protect it from further damage. After a reboot, everything was
working again (ZFS could fix errors thanks to sufficient redundancy) and
after replacing the bad drive, the system would be running for another year
until the next drive would go bad.
I have seen the same thing happen with SAS cards, particularly where
expanders are involved.

The key ingredients to preventing this kind of issue are:
1) Make sure your drives support TLER (all of mine do, and these days I
refuse to buy Seagate and WD)
2) Make sure your PMPs and SATA controllers support FIS based switching

In my experience most eSATA ports on motherboards are supplied by
controllers that do not support FIS based switching.
Post by Philip Seeger
It might be just a bad/cheap port multiplier in the enclosures that I'm
using, but at least in my case, I've come to the conclusion that I'd rather
avoid one (sort of a single point of failure) and try to connect all the
drives individually.
It could at least as likely be SATA controller that doesn't support FBS.
Post by Philip Seeger
Lately, I've been experimenting with SAS cards (connecting the SATA drives
directly, which are in a different kind of enclosure with one SATA
connector for each drive). I've connected bad drives that I've taken out of
my old system previously (these drives caused crashes, as described above),
put the test system under load and I've not yet had anything unexpected
happen to me so far (no crashes). Single drives (with smart errors, for
example) would suddenly be listed as "UNAVAIL" or "too many errors" (and
positive error counts), but not affecting any of the good drives. This is
eventually better, because a single bad drive (which is failing, corrupting
data), should obviously not crash a RAID system (with enough redundancy).
smartctl works fine too. And the /dev/disk/by-id/ata* symlinks also have
the exact same names as with the port multiplier setup (they usually
contain the drive's serial, which makes identifying/replacing drives really
easy).
Some SAS cards are good like that. Most are not. Cards with on-board caches
are generally the bad sort (I have a caching LSI card and a caching Adaptec
card that fit this description). But just because a card is a straight HBA
doesn't mean the kernel driver will act in a way that provides seamless
operation with smartctl and hdparm (3ware cards are a typical example of
this).

Also, if you can, try to avoid cards that are based on PCI-X ASICs with
PCI-X to PCIe bridges. Most SAS cards that aren't of the very latest
generation are like that, and this regularly plays havoc when you try to
use virtualization (e.g. I have such an LSI card and all the disks become
inaccessible as soon as I boot with intel_iommu=on).

I have never seen such issues with SATA+PMP solutions.
Post by Philip Seeger
(Also, someone mentioned fixing the drives with foam tape, but I still
prefer hot-swap enclosures - just in this case, ones with one data port per
drive.)
I mentioned double sided foam tape, but not for attaching drives - I was
referring to sticking PMPs to the side of the case so they aren't using up
slots.
Post by Philip Seeger
So, again, maybe my old enclosures are just too cheap and this doesn't
happen with other port multipliers. But I wanted to mention that it might
be a problem to use them sometimes, rather than connecting the drives
individually.
It's important to be very careful with anecdotal "evidence" like this.
Simply saying "SATA+PMP" is problematic without listing the exact SATA
controller and PMP chip models (and sometimes firmware versions are
relevant, too) is not particularly useful. As I explained above, I have a
stack of SAS cards all of which are sub-optimal or otherwise problematic in
many ways.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Philip Seeger
2015-01-06 20:08:22 UTC
Permalink
Post by Gordan Bobic
1) Make sure your drives support TLER (all of mine do, and these days
I refuse to buy Seagate and WD)
2) Make sure your PMPs and SATA controllers support FIS based switching
In my experience most eSATA ports on motherboards are supplied by
controllers that do not support FIS based switching.
Thanks for these tips. TLER/SCT is indeed important and I've only
recently started using a script that sets the timeout value on the
drives that support it (smartctl -l scterc,70,70). My WD RED drives
support this, my Seagate drives don't. I've started replacing those
drives one by one.
As for FIS-based switching - thanks for this important tip. I definitely
have to look into this.
Post by Gordan Bobic
Some SAS cards are good like that. Most are not. Cards with on-board
caches are generally the bad sort (I have a caching LSI card and a
caching Adaptec card that fit this description). But just because a
card is a straight HBA doesn't mean the kernel driver will act in a
way that provides seamless operation with smartctl and hdparm (3ware
cards are a typical example of this).
I'm currently testing these two cards: Supermicro AOC-SASLP-MV8 and LSI
9201-16i
Post by Gordan Bobic
It's important to be very careful with anecdotal "evidence" like this.
Simply saying "SATA+PMP" is problematic without listing the exact SATA
controller and PMP chip models (and sometimes firmware versions are
relevant, too) is not particularly useful. As I explained above, I
have a stack of SAS cards all of which are sub-optimal or otherwise
problematic in many ways.
You are right and I certainly don't want to tell people to avoid
"SATA+PMP" - i'm just experiencing this behavior in my specific setup.
The chip in the enclosure is a JMicron JMB394 (if I'm not mistaken).
It's connected to the onboard esata ports, the controller is a JMicron
JMB362 which, as it turns out, uses command-based switching. The
behavior might be different with a SATA controller that uses FIS-based
switching.



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-06 23:06:04 UTC
Permalink
Post by Gordan Bobic
1) Make sure your drives support TLER (all of mine do, and these days I
refuse to buy Seagate and WD)
2) Make sure your PMPs and SATA controllers support FIS based switching
In my experience most eSATA ports on motherboards are supplied by
controllers that do not support FIS based switching.
Thanks for these tips. TLER/SCT is indeed important and I've only recently
started using a script that sets the timeout value on the drives that
support it (smartctl -l scterc,70,70). My WD RED drives support this, my
Seagate drives don't. I've started replacing those drives one by one.
As for FIS-based switching - thanks for this important tip. I definitely
have to look into this.
FBS is important for two reasons:
1) Without it NCQ doesn't work, which means that as you add more drives the
performance scales inversely.
2) If one disk locks up on a command, no other disks will be accessible
until that times out.

Given that disks without TLER will hang the bus until they time out (no, I
am not talking about the SCSI timeout here, the TLER-crippled disks go away
hanging the SATA bus for potentially minutes at a time), you can probably
see how the combination of a lack of TLER and FBS will result in the entire
bus going away. Having just one is probably sufficient to avoid a complete
bus loss when a disk starts to go south, but having both is better.
Post by Gordan Bobic
Some SAS cards are good like that. Most are not. Cards with on-board
caches are generally the bad sort (I have a caching LSI card and a caching
Adaptec card that fit this description). But just because a card is a
straight HBA doesn't mean the kernel driver will act in a way that provides
seamless operation with smartctl and hdparm (3ware cards are a typical
example of this).
I'm currently testing these two cards: Supermicro AOC-SASLP-MV8 and LSI
9201-16i
I have neither of those two.
Post by Gordan Bobic
It's important to be very careful with anecdotal "evidence" like this.
Simply saying "SATA+PMP" is problematic without listing the exact SATA
controller and PMP chip models (and sometimes firmware versions are
relevant, too) is not particularly useful. As I explained above, I have a
stack of SAS cards all of which are sub-optimal or otherwise problematic in
many ways.
You are right and I certainly don't want to tell people to avoid
"SATA+PMP" - i'm just experiencing this behavior in my specific setup. The
chip in the enclosure is a JMicron JMB394 (if I'm not mistaken). It's
connected to the onboard esata ports, the controller is a JMicron JMB362
which, as it turns out, uses command-based switching. The behavior might be
different with a SATA controller that uses FIS-based switching.
Indeed. JMicron controllers don't support FBS, which is almost certainly a
part of the cause of the problem you are seeing.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Alex Gardiner
2015-01-06 22:18:27 UTC
Permalink
It's important to be very careful with anecdotal "evidence" like this. Simply saying "SATA+PMP" is problematic without listing the exact SATA controller and PMP chip models (and sometimes firmware versions are relevant, too) is not particularly useful. As I explained above, I have a stack of SAS cards all of which are sub-optimal or otherwise problematic in many ways.
Gordan, could I ask which brand/model of SATA PMP do you prefer?
This setup works for many months, until one drive starts failing/misbehaving. In most cases, that caused read/write errors on all 5 drives in the affected enclosure, even though the remaining 4 drives were actually fine. Of course, r/w errors on 5 drives (in a RAIDZ-3 pool) froze the pool to protect it from further damage. After a reboot, everything was working again (ZFS could fix errors thanks to sufficient redundancy) and after replacing the bad drive, the system would be running for another year until the next drive would go bad.
I have seen the same thing happen with SAS cards, particularly where expanders are involved.
1) Make sure your drives support TLER (all of mine do, and these days I refuse to buy Seagate and WD)
2) Make sure your PMPs and SATA controllers support FIS based switching
In my experience most eSATA ports on motherboards are supplied by controllers that do not support FIS based switching.
It might be just a bad/cheap port multiplier in the enclosures that I'm using, but at least in my case, I've come to the conclusion that I'd rather avoid one (sort of a single point of failure) and try to connect all the drives individually.
It could at least as likely be SATA controller that doesn't support FBS.
Lately, I've been experimenting with SAS cards (connecting the SATA drives directly, which are in a different kind of enclosure with one SATA connector for each drive). I've connected bad drives that I've taken out of my old system previously (these drives caused crashes, as described above), put the test system under load and I've not yet had anything unexpected happen to me so far (no crashes). Single drives (with smart errors, for example) would suddenly be listed as "UNAVAIL" or "too many errors" (and positive error counts), but not affecting any of the good drives. This is eventually better, because a single bad drive (which is failing, corrupting data), should obviously not crash a RAID system (with enough redundancy).
smartctl works fine too. And the /dev/disk/by-id/ata* symlinks also have the exact same names as with the port multiplier setup (they usually contain the drive's serial, which makes identifying/replacing drives really easy).
Some SAS cards are good like that. Most are not. Cards with on-board caches are generally the bad sort (I have a caching LSI card and a caching Adaptec card that fit this description). But just because a card is a straight HBA doesn't mean the kernel driver will act in a way that provides seamless operation with smartctl and hdparm (3ware cards are a typical example of this).
Also, if you can, try to avoid cards that are based on PCI-X ASICs with PCI-X to PCIe bridges. Most SAS cards that aren't of the very latest generation are like that, and this regularly plays havoc when you try to use virtualization (e.g. I have such an LSI card and all the disks become inaccessible as soon as I boot with intel_iommu=on).
I have never seen such issues with SATA+PMP solutions.
(Also, someone mentioned fixing the drives with foam tape, but I still prefer hot-swap enclosures - just in this case, ones with one data port per drive.)
I mentioned double sided foam tape, but not for attaching drives - I was referring to sticking PMPs to the side of the case so they aren't using up slots.
So, again, maybe my old enclosures are just too cheap and this doesn't happen with other port multipliers. But I wanted to mention that it might be a problem to use them sometimes, rather than connecting the drives individually.
It's important to be very careful with anecdotal "evidence" like this. Simply saying "SATA+PMP" is problematic without listing the exact SATA controller and PMP chip models (and sometimes firmware versions are relevant, too) is not particularly useful. As I explained above, I have a stack of SAS cards all of which are sub-optimal or otherwise problematic in many ways.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Liam Slusser
2015-01-06 22:37:00 UTC
Permalink
I have a ZFS server which we use to store audio. The audio is generally
stored in raw and uncompressed so some of the files are quite large
100-200megs in size. They are shared out to a server farm via NFS.
Currently our filer is 665T in size using all Dell hardware. Dell and
Nexenta have some reference hardware specs so I configured my servers
similar to what they recommended. So far, so good, performance is great.

Generally though it's all 4T SAS drives in Dell MD1200 enclosures plugged
into LSI 9207-8e cards.

thanks,
liam
Post by Gordan Bobic
Post by Gordan Bobic
It's important to be very careful with anecdotal "evidence" like this.
Simply saying "SATA+PMP" is problematic without listing the exact SATA
controller and PMP chip models (and sometimes firmware versions are
relevant, too) is not particularly useful. As I explained above, I have a
stack of SAS cards all of which are sub-optimal or otherwise problematic in
many ways.
Gordan, could I ask which brand/model of SATA PMP do you prefer?
Post by Gordan Bobic
This setup works for many months, until one drive starts
failing/misbehaving. In most cases, that caused read/write errors on all 5
drives in the affected enclosure, even though the remaining 4 drives were
actually fine. Of course, r/w errors on 5 drives (in a RAIDZ-3 pool) froze
the pool to protect it from further damage. After a reboot, everything was
working again (ZFS could fix errors thanks to sufficient redundancy) and
after replacing the bad drive, the system would be running for another year
until the next drive would go bad.
Post by Gordan Bobic
I have seen the same thing happen with SAS cards, particularly where
expanders are involved.
Post by Gordan Bobic
1) Make sure your drives support TLER (all of mine do, and these days I
refuse to buy Seagate and WD)
Post by Gordan Bobic
2) Make sure your PMPs and SATA controllers support FIS based switching
In my experience most eSATA ports on motherboards are supplied by
controllers that do not support FIS based switching.
Post by Gordan Bobic
It might be just a bad/cheap port multiplier in the enclosures that I'm
using, but at least in my case, I've come to the conclusion that I'd rather
avoid one (sort of a single point of failure) and try to connect all the
drives individually.
Post by Gordan Bobic
It could at least as likely be SATA controller that doesn't support FBS.
Lately, I've been experimenting with SAS cards (connecting the SATA
drives directly, which are in a different kind of enclosure with one SATA
connector for each drive). I've connected bad drives that I've taken out of
my old system previously (these drives caused crashes, as described above),
put the test system under load and I've not yet had anything unexpected
happen to me so far (no crashes). Single drives (with smart errors, for
example) would suddenly be listed as "UNAVAIL" or "too many errors" (and
positive error counts), but not affecting any of the good drives. This is
eventually better, because a single bad drive (which is failing, corrupting
data), should obviously not crash a RAID system (with enough redundancy).
Post by Gordan Bobic
smartctl works fine too. And the /dev/disk/by-id/ata* symlinks also have
the exact same names as with the port multiplier setup (they usually
contain the drive's serial, which makes identifying/replacing drives really
easy).
Post by Gordan Bobic
Some SAS cards are good like that. Most are not. Cards with on-board
caches are generally the bad sort (I have a caching LSI card and a caching
Adaptec card that fit this description). But just because a card is a
straight HBA doesn't mean the kernel driver will act in a way that provides
seamless operation with smartctl and hdparm (3ware cards are a typical
example of this).
Post by Gordan Bobic
Also, if you can, try to avoid cards that are based on PCI-X ASICs with
PCI-X to PCIe bridges. Most SAS cards that aren't of the very latest
generation are like that, and this regularly plays havoc when you try to
use virtualization (e.g. I have such an LSI card and all the disks become
inaccessible as soon as I boot with intel_iommu=on).
Post by Gordan Bobic
I have never seen such issues with SATA+PMP solutions.
(Also, someone mentioned fixing the drives with foam tape, but I still
prefer hot-swap enclosures - just in this case, ones with one data port per
drive.)
Post by Gordan Bobic
I mentioned double sided foam tape, but not for attaching drives - I was
referring to sticking PMPs to the side of the case so they aren't using up
slots.
Post by Gordan Bobic
So, again, maybe my old enclosures are just too cheap and this doesn't
happen with other port multipliers. But I wanted to mention that it might
be a problem to use them sometimes, rather than connecting the drives
individually.
Post by Gordan Bobic
It's important to be very careful with anecdotal "evidence" like this.
Simply saying "SATA+PMP" is problematic without listing the exact SATA
controller and PMP chip models (and sometimes firmware versions are
relevant, too) is not particularly useful. As I explained above, I have a
stack of SAS cards all of which are sub-optimal or otherwise problematic in
many ways.
Post by Gordan Bobic
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Tamas Papp
2015-01-06 22:43:21 UTC
Permalink
I have a ZFS server which we use to store audio. The audio is
generally stored in raw and uncompressed so some of the files are
quite large 100-200megs in size. They are shared out to a server farm
via NFS. Currently our filer is 665T in size using all Dell
hardware. Dell and Nexenta have some reference hardware specs so I
configured my servers similar to what they recommended. So far, so
good, performance is great.
Generally though it's all 4T SAS drives in Dell MD1200 enclosures
plugged into LSI 9207-8e cards.
Is it one single server with many disks?
May I ask, how many clients?

10x
tamas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Liam Slusser
2015-01-06 23:11:01 UTC
Permalink
Tamas -

Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200 has 12
disks. So 240 disks. Most are 4T SAS drives however some of the earlier
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced with 2 x
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc cache.

It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more drives
this month. :-)

It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write about
20T and read that 20T back and write it back elsewhere to another storage
system in many different codecs. It's basically write once read once -
although sometimes we go back and transcode audio into another codec so
we're reread everything.

We have an identical ZFS server (same number of drives) that we replicate
(zfs send | zfs recv) to every few minutes that we use as backup incase
something bad happens to our master.

thanks,
liam
Post by Tamas Papp
Is it one single server with many disks?
May I ask, how many clients?
10x
tamas
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Tamas Papp
2015-01-07 08:11:01 UTC
Permalink
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200
has 12 disks. So 240 disks. Most are 4T SAS drives however some of
the earlier drives are 3T SAS. The actual r720 server has two 250g
boot disks (mirrored) and the front 12 disks have been removed and
replaced with 2 x Samsung 840 PRO 256g for SLOG and 2 x Samsung 840
PRO 512g for l2arc cache.
It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more
drives this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write
about 20T and read that 20T back and write it back elsewhere to
another storage system in many different codecs. It's basically write
once read once - although sometimes we go back and transcode audio
into another codec so we're reread everything.
We have an identical ZFS server (same number of drives) that we
replicate (zfs send | zfs recv) to every few minutes that we use as
backup incase something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are about
~20 clients (linux, windows, mac). It's a vfx company and they are
working online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.

Thanks for the info!

Cheers,
tamas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Liam Slusser
2015-01-07 09:10:40 UTC
Permalink
This post might be inappropriate. Click to display it.
Tamas Papp
2015-01-07 09:18:17 UTC
Permalink
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our
large ZFS server replaced a glusterfs cluster with 4 servers - which
like you have, there was always something wrong. We had XFS file
corruption issues, hugely long rebuild times, split-brain issues
constantly, and on top of all that performance issues. Once we got
everything on ZFS and got the bugs out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to
OmniOS, but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to
figure that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
It's very helpful, despite it's not a ZoL setup.

Thanks very much!

tamas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Alex Gardiner
2015-01-07 09:32:25 UTC
Permalink
Liam thanks for sharing your experience. It sounds like you (and a few others on
this thread) are using ZFS in a much larger post production context that myself.
I will only have a handful of editors running online DNxHD120-185,.
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any.
Beyond that, the rest of the settings are pretty out of the box.
I'm especially interested to read these two remarks - aside from some minor
tweaks - it sound like you are letting ZFS take care of things.

In my use case we do not use NFS, but do have lots of SMB and AFP shares.

Furthermore most of the access will be async, so I now expect (after some help
from all here) that caching devices would not help with the streaming. However,
I did wonder about implementing them to catch things such as media databases,
that can be maintained quite a lot (Avid MC) and also project files/bins. The
latter have their permissions checked quite regular in our setup, its just the
way the storage software works. As such I think that SLOG/L2ARC might help us
gain IOPS that should hopefully avoid situations where we have to interrupt
streaming to service these smaller files.

One thing I am slightly unclear on is if SMB/AFP are always async - but I can no
doubt find that information somewhere else :-)

At the end of the day I think the most important features of video edit storage
are:

- streaming is ideally not interrupted by other demands on the file system
(dropped frames really annoy editors)
- there is minimal latency when starting playback in the NLE. I believe this may
be why companies like HP tend to bias cache towards write, instead of read - but
our specific use case is seldom examined online.

Best,

Alex
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our large ZFS
server replaced a glusterfs cluster with 4 servers - which like you have,
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top of all
that performance issues. Once we got everything on ZFS and got the bugs
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to OmniOS,
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to figure
that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200 has 12
disks. So 240 disks. Most are 4T SAS drives however some of the earlier
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced with 2 x
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc cache.
It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more drives
this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write about
20T and read that 20T back and write it back elsewhere to another storage
system in many different codecs. It's basically write once read once -
although sometimes we go back and transcode audio into another codec so
we're reread everything.
We have an identical ZFS server (same number of drives) that we replicate
(zfs send | zfs recv) to every few minutes that we use as backup incase
something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are about ~20
clients (linux, windows, mac). It's a vfx company and they are working
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
--
Indiestor Director/System Engineer
Open Source Avid Project Sharing
Mobile: +447961751453

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Liam Slusser
2015-01-07 11:12:56 UTC
Permalink
Alex - Yep, ZFS is a great filesystem right out of the box. You will
notice that SMB on ZFS is very different than NFS on ZFS from a performance
standpoint. ZFS and NFS can be tricky because NFS on every write will ask
for a flush of the ZFS ZIL, while SMB won't. We saw this same behavior.
When we were migrating from our old storage to our new ZFS storage, I was
copying data over the network at around 3gigabit/sec. During this copy
time I was doing NFS testing to make sure permissions etc were all
correct. I noticed that whenever I saved a file via nfs, even touching a
zero byte file on the filesystem, it took upwards of 5+ seconds for the
operation to complete while doing the same thing from a location command
prompt was almost instantaneous. Richard Elling wrote an excellent tool
called zilstat which showed exactly what I thought was happening - whenever
a NFS flush operation happened the ZIL would dump a whole bunch of data.
This is where a super fast SLOG helps.

Do search around the interweb on ZFS ZIL interactions with SMB/NFS etc.
It's good knowledge to have.

Another interesting option that might help you is the Logbias ZFS option.
Basically the default, logbias=latency, is to use a separate ZIL, like a
SLOG device, which improves latency for synchronous writes for the
application. If you set logbias=throughput, then no separate ZIL log is
used. Instead, ZFS will allocate intent log blocks from the main pool,
writing data immediately and spreading the load across the pool's devices.
This improves bandwidth at the expense of latency for synchronous writes.

thanks,
liam
Post by Alex Gardiner
Liam thanks for sharing your experience. It sounds like you (and a few others on
this thread) are using ZFS in a much larger post production context that myself.
I will only have a handful of editors running online DNxHD120-185,.
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any.
Beyond that, the rest of the settings are pretty out of the box.
I'm especially interested to read these two remarks - aside from some minor
tweaks - it sound like you are letting ZFS take care of things.
In my use case we do not use NFS, but do have lots of SMB and AFP shares.
Furthermore most of the access will be async, so I now expect (after some help
from all here) that caching devices would not help with the streaming. However,
I did wonder about implementing them to catch things such as media databases,
that can be maintained quite a lot (Avid MC) and also project files/bins. The
latter have their permissions checked quite regular in our setup, its just the
way the storage software works. As such I think that SLOG/L2ARC might help us
gain IOPS that should hopefully avoid situations where we have to interrupt
streaming to service these smaller files.
One thing I am slightly unclear on is if SMB/AFP are always async - but I can no
doubt find that information somewhere else :-)
At the end of the day I think the most important features of video edit storage
- streaming is ideally not interrupted by other demands on the file system
(dropped frames really annoy editors)
- there is minimal latency when starting playback in the NLE. I believe this may
be why companies like HP tend to bias cache towards write, instead of read - but
our specific use case is seldom examined online.
Best,
Alex
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our large
ZFS
Post by Liam Slusser
server replaced a glusterfs cluster with 4 servers - which like you have,
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top of
all
Post by Liam Slusser
that performance issues. Once we got everything on ZFS and got the bugs
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to OmniOS,
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to figure
that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200
has 12
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
disks. So 240 disks. Most are 4T SAS drives however some of the
earlier
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced with
2 x
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc
cache.
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more
drives
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write
about
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
20T and read that 20T back and write it back elsewhere to another
storage
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
system in many different codecs. It's basically write once read once
-
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
although sometimes we go back and transcode audio into another codec
so
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
we're reread everything.
We have an identical ZFS server (same number of drives) that we
replicate
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
(zfs send | zfs recv) to every few minutes that we use as backup
incase
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are about
~20
Post by Liam Slusser
Post by Tamas Papp
clients (linux, windows, mac). It's a vfx company and they are working
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it, send
an
Post by Liam Slusser
To unsubscribe from this group and stop receiving emails from it, send an
--
Indiestor Director/System Engineer
Open Source Avid Project Sharing
Mobile: +447961751453
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-07 16:13:44 UTC
Permalink
From what is being described, it seems that comparing NFS with async export
option is a more like-for-like comparison to SMB. Is that not the case?
Post by Liam Slusser
Alex - Yep, ZFS is a great filesystem right out of the box. You will
notice that SMB on ZFS is very different than NFS on ZFS from a performance
standpoint. ZFS and NFS can be tricky because NFS on every write will ask
for a flush of the ZFS ZIL, while SMB won't. We saw this same behavior.
When we were migrating from our old storage to our new ZFS storage, I was
copying data over the network at around 3gigabit/sec. During this copy
time I was doing NFS testing to make sure permissions etc were all
correct. I noticed that whenever I saved a file via nfs, even touching a
zero byte file on the filesystem, it took upwards of 5+ seconds for the
operation to complete while doing the same thing from a location command
prompt was almost instantaneous. Richard Elling wrote an excellent tool
called zilstat which showed exactly what I thought was happening - whenever
a NFS flush operation happened the ZIL would dump a whole bunch of data.
This is where a super fast SLOG helps.
Do search around the interweb on ZFS ZIL interactions with SMB/NFS etc.
It's good knowledge to have.
Another interesting option that might help you is the Logbias ZFS option.
Basically the default, logbias=latency, is to use a separate ZIL, like a
SLOG device, which improves latency for synchronous writes for the
application. If you set logbias=throughput, then no separate ZIL log is
used. Instead, ZFS will allocate intent log blocks from the main pool,
writing data immediately and spreading the load across the pool's devices.
This improves bandwidth at the expense of latency for synchronous writes.
thanks,
liam
Post by Alex Gardiner
Liam thanks for sharing your experience. It sounds like you (and a few others on
this thread) are using ZFS in a much larger post production context that myself.
I will only have a handful of editors running online DNxHD120-185,.
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any.
Beyond that, the rest of the settings are pretty out of the box.
I'm especially interested to read these two remarks - aside from some minor
tweaks - it sound like you are letting ZFS take care of things.
In my use case we do not use NFS, but do have lots of SMB and AFP shares.
Furthermore most of the access will be async, so I now expect (after some help
from all here) that caching devices would not help with the streaming. However,
I did wonder about implementing them to catch things such as media databases,
that can be maintained quite a lot (Avid MC) and also project files/bins. The
latter have their permissions checked quite regular in our setup, its just the
way the storage software works. As such I think that SLOG/L2ARC might help us
gain IOPS that should hopefully avoid situations where we have to interrupt
streaming to service these smaller files.
One thing I am slightly unclear on is if SMB/AFP are always async - but I can no
doubt find that information somewhere else :-)
At the end of the day I think the most important features of video edit storage
- streaming is ideally not interrupted by other demands on the file system
(dropped frames really annoy editors)
- there is minimal latency when starting playback in the NLE. I believe this may
be why companies like HP tend to bias cache towards write, instead of read - but
our specific use case is seldom examined online.
Best,
Alex
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our large
ZFS
Post by Liam Slusser
server replaced a glusterfs cluster with 4 servers - which like you
have,
Post by Liam Slusser
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top of
all
Post by Liam Slusser
that performance issues. Once we got everything on ZFS and got the bugs
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to OmniOS,
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to
figure
Post by Liam Slusser
that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200
has 12
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
disks. So 240 disks. Most are 4T SAS drives however some of the
earlier
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced
with 2 x
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc
cache.
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
It's one large zpool array. Each MD1200 is a 12 disk raidz2.
Current
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
volume size is 665T and is at 91% capacity - I need to order more
drives
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write
about
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
20T and read that 20T back and write it back elsewhere to another
storage
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
system in many different codecs. It's basically write once read
once -
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
although sometimes we go back and transcode audio into another codec
so
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
we're reread everything.
We have an identical ZFS server (same number of drives) that we
replicate
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
(zfs send | zfs recv) to every few minutes that we use as backup
incase
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are
about ~20
Post by Liam Slusser
Post by Tamas Papp
clients (linux, windows, mac). It's a vfx company and they are working
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it,
send an
Post by Liam Slusser
To unsubscribe from this group and stop receiving emails from it, send
an
--
Indiestor Director/System Engineer
Open Source Avid Project Sharing
Mobile: +447961751453
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Alex Gardiner
2015-01-09 08:22:40 UTC
Permalink
Hello guys,

just to briefly touch on this again…
Alex - Yep, ZFS is a great filesystem right out of the box. You will notice that SMB on ZFS is very different than NFS on ZFS from a performance standpoint. ZFS and NFS can be tricky because NFS on every write will ask for a flush of the ZFS ZIL, while SMB won't. We saw this same behavior. When we were migrating from our old storage to our new ZFS storage, I was copying data over the network at around 3gigabit/sec. During this copy time I was doing NFS testing to make sure permissions etc were all correct. I noticed that whenever I saved a file via nfs, even touching a zero byte file on the filesystem, it took upwards of 5+ seconds for the operation to complete while doing the same thing from a location command prompt was almost instantaneous. Richard Elling wrote an excellent tool called zilstat which showed exactly what I thought was happening - whenever a NFS flush operation happened the ZIL would dump a whole bunch of data. This is where a super fast SLOG helps.
Do search around the interweb on ZFS ZIL interactions with SMB/NFS etc. It's good knowledge to have.
Another interesting option that might help you is the Logbias ZFS option. Basically the default, logbias=latency, is to use a separate ZIL, like a SLOG device, which improves latency for synchronous writes for the application. If you set logbias=throughput, then no separate ZIL log is used. Instead, ZFS will allocate intent log blocks from the main pool, writing data immediately and spreading the load across the pool's devices. This improves bandwidth at the expense of latency for synchronous writes.
thanks,
liam
Liam, thanks again for those settings - I’m giving them a test on my little test system.

In line with this discussion, Samba looks like the following options are of interest:

aio read size = 16384 # Use asynchronous I/O for reads bigger than 16KB request size
aio write size = 16384 # Use asynchronous I/O for writes bigger than 16KB request size

Samba has to be compiled to support this, which I used smbd -b to check (Debian Wheezy). Its worth noting I configure smb.conf directly, not offering shares via zfs (yet to try this).

NB: I note that Samba4 performs far better with recent OSX clients.

I have no significant data from ZFS testing yet, but hopefully soon.
This happens when you hit the old write throttle. Also on a OmniOS/Solaris/illumos box
with other OOB settings, each throttle will cost 10ms, which really kills streaming performance.
I'm pretty sure we'll see an end to tuning zfs_txg_synctime_ms in modern systems and the
new write throttle ;-)
Richard thanks for going further on this one again - it kind of went over my head first time.

I’ve spoken to a couple of folks who seem to also recommend this tweak in a streaming environment.

I also wonder about tuning zfs_vdev_max_pending. Reports seem to suggest using a value of “1” helps with SATA disks (which I have).
Alex - Yep, ZFS is a great filesystem right out of the box. You will notice that SMB on ZFS is very different than NFS on ZFS from a performance standpoint. ZFS and NFS can be tricky because NFS on every write will ask for a flush of the ZFS ZIL, while SMB won't. We saw this same behavior. When we were migrating from our old storage to our new ZFS storage, I was copying data over the network at around 3gigabit/sec. During this copy time I was doing NFS testing to make sure permissions etc were all correct. I noticed that whenever I saved a file via nfs, even touching a zero byte file on the filesystem, it took upwards of 5+ seconds for the operation to complete while doing the same thing from a location command prompt was almost instantaneous. Richard Elling wrote an excellent tool called zilstat which showed exactly what I thought was happening - whenever a NFS flush operation happened the ZIL would dump a whole bunch of data. This is where a super fast SLOG helps.
Do search around the interweb on ZFS ZIL interactions with SMB/NFS etc. It's good knowledge to have.
Another interesting option that might help you is the Logbias ZFS option. Basically the default, logbias=latency, is to use a separate ZIL, like a SLOG device, which improves latency for synchronous writes for the application. If you set logbias=throughput, then no separate ZIL log is used. Instead, ZFS will allocate intent log blocks from the main pool, writing data immediately and spreading the load across the pool's devices. This improves bandwidth at the expense of latency for synchronous writes.
thanks,
liam
Liam thanks for sharing your experience. It sounds like you (and a few others on
this thread) are using ZFS in a much larger post production context that myself.
I will only have a handful of editors running online DNxHD120-185,.
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any.
Beyond that, the rest of the settings are pretty out of the box.
I'm especially interested to read these two remarks - aside from some minor
tweaks - it sound like you are letting ZFS take care of things.
In my use case we do not use NFS, but do have lots of SMB and AFP shares.
Furthermore most of the access will be async, so I now expect (after some help
from all here) that caching devices would not help with the streaming. However,
I did wonder about implementing them to catch things such as media databases,
that can be maintained quite a lot (Avid MC) and also project files/bins. The
latter have their permissions checked quite regular in our setup, its just the
way the storage software works. As such I think that SLOG/L2ARC might help us
gain IOPS that should hopefully avoid situations where we have to interrupt
streaming to service these smaller files.
One thing I am slightly unclear on is if SMB/AFP are always async - but I can no
doubt find that information somewhere else :-)
At the end of the day I think the most important features of video edit storage
- streaming is ideally not interrupted by other demands on the file system
(dropped frames really annoy editors)
- there is minimal latency when starting playback in the NLE. I believe this may
be why companies like HP tend to bias cache towards write, instead of read - but
our specific use case is seldom examined online.
Best,
Alex
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our large ZFS
server replaced a glusterfs cluster with 4 servers - which like you have,
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top of all
that performance issues. Once we got everything on ZFS and got the bugs
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to OmniOS,
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to figure
that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200 has 12
disks. So 240 disks. Most are 4T SAS drives however some of the earlier
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced with 2 x
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc cache.
It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more drives
this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write about
20T and read that 20T back and write it back elsewhere to another storage
system in many different codecs. It's basically write once read once -
although sometimes we go back and transcode audio into another codec so
we're reread everything.
We have an identical ZFS server (same number of drives) that we replicate
(zfs send | zfs recv) to every few minutes that we use as backup incase
something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are about ~20
clients (linux, windows, mac). It's a vfx company and they are working
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
r***@gmail.com
2015-01-10 05:19:34 UTC
Permalink
Post by Alex Gardiner
Hello guys,
just to briefly touch on this again

Post by Liam Slusser
Alex - Yep, ZFS is a great filesystem right out of the box. You will
notice that SMB on ZFS is very different than NFS on ZFS from a performance
standpoint. ZFS and NFS can be tricky because NFS on every write will ask
for a flush of the ZFS ZIL, while SMB won't. We saw this same behavior.
When we were migrating from our old storage to our new ZFS storage, I was
copying data over the network at around 3gigabit/sec. During this copy
time I was doing NFS testing to make sure permissions etc were all correct.
I noticed that whenever I saved a file via nfs, even touching a zero byte
file on the filesystem, it took upwards of 5+ seconds for the operation to
complete while doing the same thing from a location command prompt was
almost instantaneous. Richard Elling wrote an excellent tool called
zilstat which showed exactly what I thought was happening - whenever a NFS
flush operation happened the ZIL would dump a whole bunch of data. This is
where a super fast SLOG helps.
Post by Liam Slusser
Do search around the interweb on ZFS ZIL interactions with SMB/NFS etc.
It's good knowledge to have.
Post by Liam Slusser
Another interesting option that might help you is the Logbias ZFS
option. Basically the default, logbias=latency, is to use a separate ZIL,
like a SLOG device, which improves latency for synchronous writes for the
application. If you set logbias=throughput, then no separate ZIL log is
used. Instead, ZFS will allocate intent log blocks from the main pool,
writing data immediately and spreading the load across the pool's devices.
This improves bandwidth at the expense of latency for synchronous writes.
Post by Liam Slusser
thanks,
liam
Liam, thanks again for those settings - I’m giving them a test on my
little test system.
aio read size = 16384 # Use asynchronous I/O for reads bigger than 16KB request size
aio write size = 16384 # Use asynchronous I/O for writes bigger than 16KB request size
Samba has to be compiled to support this, which I used smbd -b to check
(Debian Wheezy). Its worth noting I configure smb.conf directly, not
offering shares via zfs (yet to try this).
NB: I note that Samba4 performs far better with recent OSX clients.
I have no significant data from ZFS testing yet, but hopefully soon.
Post by Liam Slusser
This happens when you hit the old write throttle. Also on a
OmniOS/Solaris/illumos box
Post by Liam Slusser
with other OOB settings, each throttle will cost 10ms, which really
kills streaming performance.
Post by Liam Slusser
I'm pretty sure we'll see an end to tuning zfs_txg_synctime_ms in modern
systems and the
Post by Liam Slusser
new write throttle ;-)
Richard thanks for going further on this one again - it kind of went over
my head first time.
I’ve spoken to a couple of folks who seem to also recommend this tweak in
a streaming environment.
I also wonder about tuning zfs_vdev_max_pending. Reports seem to suggest
using a value of “1” helps with SATA disks (which I have).
I've found a range of 2-4 works well for most modern SATA HDDs. That said,
this tunable is gone in favor of a rewritten I/O scheduler (part of the new
write
throttle project). So you shouldn't need to adjust this tunable from
0.6.3-stable onward.
-- richard
Post by Alex Gardiner
Post by Liam Slusser
Alex - Yep, ZFS is a great filesystem right out of the box. You will
notice that SMB on ZFS is very different than NFS on ZFS from a performance
standpoint. ZFS and NFS can be tricky because NFS on every write will ask
for a flush of the ZFS ZIL, while SMB won't. We saw this same behavior.
When we were migrating from our old storage to our new ZFS storage, I was
copying data over the network at around 3gigabit/sec. During this copy
time I was doing NFS testing to make sure permissions etc were all correct.
I noticed that whenever I saved a file via nfs, even touching a zero byte
file on the filesystem, it took upwards of 5+ seconds for the operation to
complete while doing the same thing from a location command prompt was
almost instantaneous. Richard Elling wrote an excellent tool called
zilstat which showed exactly what I thought was happening - whenever a NFS
flush operation happened the ZIL would dump a whole bunch of data. This is
where a super fast SLOG helps.
Post by Liam Slusser
Do search around the interweb on ZFS ZIL interactions with SMB/NFS etc.
It's good knowledge to have.
Post by Liam Slusser
Another interesting option that might help you is the Logbias ZFS
option. Basically the default, logbias=latency, is to use a separate ZIL,
like a SLOG device, which improves latency for synchronous writes for the
application. If you set logbias=throughput, then no separate ZIL log is
used. Instead, ZFS will allocate intent log blocks from the main pool,
writing data immediately and spreading the load across the pool's devices.
This improves bandwidth at the expense of latency for synchronous writes.
Post by Liam Slusser
thanks,
liam
Liam thanks for sharing your experience. It sounds like you (and a few
others on
Post by Liam Slusser
this thread) are using ZFS in a much larger post production context that
myself.
Post by Liam Slusser
I will only have a handful of editors running online DNxHD120-185,.
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any.
Beyond that, the rest of the settings are pretty out of the box.
I'm especially interested to read these two remarks - aside from some
minor
Post by Liam Slusser
tweaks - it sound like you are letting ZFS take care of things.
In my use case we do not use NFS, but do have lots of SMB and AFP
shares.
Post by Liam Slusser
Furthermore most of the access will be async, so I now expect (after
some help
Post by Liam Slusser
from all here) that caching devices would not help with the streaming.
However,
Post by Liam Slusser
I did wonder about implementing them to catch things such as media
databases,
Post by Liam Slusser
that can be maintained quite a lot (Avid MC) and also project
files/bins. The
Post by Liam Slusser
latter have their permissions checked quite regular in our setup, its
just the
Post by Liam Slusser
way the storage software works. As such I think that SLOG/L2ARC might
help us
Post by Liam Slusser
gain IOPS that should hopefully avoid situations where we have to
interrupt
Post by Liam Slusser
streaming to service these smaller files.
One thing I am slightly unclear on is if SMB/AFP are always async - but
I can no
Post by Liam Slusser
doubt find that information somewhere else :-)
At the end of the day I think the most important features of video edit
storage
Post by Liam Slusser
- streaming is ideally not interrupted by other demands on the file
system
Post by Liam Slusser
(dropped frames really annoy editors)
- there is minimal latency when starting playback in the NLE. I believe
this may
Post by Liam Slusser
be why companies like HP tend to bias cache towards write, instead of
read - but
Post by Liam Slusser
our specific use case is seldom examined online.
Best,
Alex
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our
large ZFS
Post by Liam Slusser
Post by Liam Slusser
server replaced a glusterfs cluster with 4 servers - which like you
have,
Post by Liam Slusser
Post by Liam Slusser
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top
of all
Post by Liam Slusser
Post by Liam Slusser
that performance issues. Once we got everything on ZFS and got the
bugs
Post by Liam Slusser
Post by Liam Slusser
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to
OmniOS,
Post by Liam Slusser
Post by Liam Slusser
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to
figure
Post by Liam Slusser
Post by Liam Slusser
that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5
LSI
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each
MD1200 has 12
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
disks. So 240 disks. Most are 4T SAS drives however some of the
earlier
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced
with 2 x
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for
l2arc cache.
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
It's one large zpool array. Each MD1200 is a 12 disk raidz2.
Current
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
volume size is 665T and is at 91% capacity - I need to order more
drives
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
this month. :-)
It's connected to a server farm of about 10 servers that handle
audio
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
transcoding. We ingest about 20T of audio per month - so we write
about
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
20T and read that 20T back and write it back elsewhere to another
storage
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
system in many different codecs. It's basically write once read
once -
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
although sometimes we go back and transcode audio into another
codec so
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
we're reread everything.
We have an identical ZFS server (same number of drives) that we
replicate
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
(zfs send | zfs recv) to every few minutes that we use as backup
incase
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
Post by Liam Slusser
something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are
about ~20
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
clients (linux, windows, mac). It's a vfx company and they are
working
Post by Liam Slusser
Post by Liam Slusser
Post by Tamas Papp
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it,
send an
Post by Liam Slusser
Post by Liam Slusser
To unsubscribe from this group and stop receiving emails from it, send
an
Post by Liam Slusser
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-07 16:11:57 UTC
Permalink
Just out of interest, is increasing the txg_synctime_ms preferable to just
running nfsd with the async export option?
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our large ZFS
server replaced a glusterfs cluster with 4 servers - which like you have,
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top of all
that performance issues. Once we got everything on ZFS and got the bugs
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to OmniOS,
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to figure
that one out.
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200 has 12
disks. So 240 disks. Most are 4T SAS drives however some of the earlier
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced with 2 x
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc cache.
It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more drives
this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write about
20T and read that 20T back and write it back elsewhere to another storage
system in many different codecs. It's basically write once read once -
although sometimes we go back and transcode audio into another codec so
we're reread everything.
We have an identical ZFS server (same number of drives) that we
replicate (zfs send | zfs recv) to every few minutes that we use as backup
incase something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are about
~20 clients (linux, windows, mac). It's a vfx company and they are working
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
r***@gmail.com
2015-01-08 04:44:14 UTC
Permalink
Thanks for sharing, Liam!
Post by Liam Slusser
We have 64g of ram, which is probably hugely overkill. Everything is
random access, caching doesn't really help our workload any. Our large ZFS
server replaced a glusterfs cluster with 4 servers - which like you have,
there was always something wrong. We had XFS file corruption issues,
hugely long rebuild times, split-brain issues constantly, and on top of all
that performance issues. Once we got everything on ZFS and got the bugs
out, it just works.
We're running OpenIndiana 151a8 (I've been meaning to upgrade to OmniOS,
but you know there is always something else to do)
Here are our zfs settings
set zfs:zfs_arc_max=51539607552
set zfs:zfs_arc_meta_limit=34359738368
set zfs:zfs_prefetch_disable=1
set zfs:zfs_txg_synctime_ms=15000
set sd:sd_io_time=5
set zfs:zfs_resilver_delay = 0
The txg_synctime was the silver bullet that fixed our NFS performance
problems. It took us a few days with dtrace watching NFS usage to figure
that one out.
This happens when you hit the old write throttle. Also on a
OmniOS/Solaris/illumos box
with other OOB settings, each throttle will cost 10ms, which really kills
streaming performance.
I'm pretty sure we'll see an end to tuning zfs_txg_synctime_ms in modern
systems and the
new write throttle ;-)
-- richard
Post by Liam Slusser
Beyond that, the rest of the settings are pretty out of the box.
Let me know if you'd like any other info.
thanks,
liam
Post by Tamas Papp
Post by Liam Slusser
Tamas -
Yes, it's a single server. The server is a 2u Dell r720xd with 5 LSI
9207-8e cards, 20 Dell MD1200 all connected via SAS, and each MD1200 has 12
disks. So 240 disks. Most are 4T SAS drives however some of the earlier
drives are 3T SAS. The actual r720 server has two 250g boot disks
(mirrored) and the front 12 disks have been removed and replaced with 2 x
Samsung 840 PRO 256g for SLOG and 2 x Samsung 840 PRO 512g for l2arc cache.
It's one large zpool array. Each MD1200 is a 12 disk raidz2. Current
volume size is 665T and is at 91% capacity - I need to order more drives
this month. :-)
It's connected to a server farm of about 10 servers that handle audio
transcoding. We ingest about 20T of audio per month - so we write about
20T and read that 20T back and write it back elsewhere to another storage
system in many different codecs. It's basically write once read once -
although sometimes we go back and transcode audio into another codec so
we're reread everything.
We have an identical ZFS server (same number of drives) that we
replicate (zfs send | zfs recv) to every few minutes that we use as backup
incase something bad happens to our master.
How much ram do you have. What properties did you set?
Although your workload completely different from ours. There are about
~20 clients (linux, windows, mac). It's a vfx company and they are working
online on the files.
Currently it's a glusterfs cluster with 5 server but there is always
something wrong and I'm wondering how I could replace it with a more
reliable solution.
Thanks for the info!
Cheers,
tamas
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-06 23:07:52 UTC
Permalink
Post by Gordan Bobic
Post by Gordan Bobic
It's important to be very careful with anecdotal "evidence" like this.
Simply saying "SATA+PMP" is problematic without listing the exact SATA
controller and PMP chip models (and sometimes firmware versions are
relevant, too) is not particularly useful. As I explained above, I have a
stack of SAS cards all of which are sub-optimal or otherwise problematic in
many ways.
Gordan, could I ask which brand/model of SATA PMP do you prefer?
Anything based on SIL3726 chips has worked very well for me so far.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
a***@whisperpc.com
2015-01-06 22:37:06 UTC
Permalink
Gordon,
Post by Gordan Bobic
1) Make sure your drives support TLER (all of mine do, and these days I
refuse to buy Seagate and WD)
I've been using Seagate "NS" series (Nearline SATA) drives for years, and
had very few problems. Which Seagate drives have you had problems with?

Thank you.

Peter Ashford

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-01-06 23:24:31 UTC
Permalink
Post by a***@whisperpc.com
Gordon,
Post by Gordan Bobic
1) Make sure your drives support TLER (all of mine do, and these days I
refuse to buy Seagate and WD)
I've been using Seagate "NS" series (Nearline SATA) drives for years, and
had very few problems. Which Seagate drives have you had problems with?
Off the top of my head at least 4 different models of Barracuda 7200.11,
7200.12 bought over the past 6 years, all 1TB in size. I have had a failure
rate on them well in excess of 100% within the warranty period (i.e. for n
drives bought m drives replaced, where m > n). While a few have outlived
their warranty period, others were replaced multiple times (i.e.
replacements had replacements, and often those replacements had
replacements).

And the issue is not vibration, chassis or controller related since I have
HGST and Samsung drives of the same rpm randomly mixed in, and I have had 0
of those fail over the same period, with similar numbers of units of each
make deployed.

As a consequence I can say that Seagate's warranty service is extremely
good. I've never had the turnaround take more than 4 days.

Samsung's was good 6-7 years ago when it was handled by Rexo in UK, they
had some dreadfully bad models back then, I think I got as far as 800%
failure rate under warranty with the 500GB Samsungs back then. Not had to
try them since.

I haven't seen dead IBM/Hitachi/HGST disk since 2002, and that was a single
DOA drive, promptly exchanged by the retailer. I recently gave away all of
my 125GB IBM/Hitachi IDEs from back in 2002, still in fully operational
order, and they were used pretty much 24/7 since then. They outlived 3 PSUs
and two motherboards in that server.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Luke Olson
2015-01-05 15:03:46 UTC
Permalink
I've had a similar positive experience with ZFS on Linux used in a video
production environment. I'm using it at a television station so most of the
videos are lower bit rate like 40 Mbps (XDCAM, H.264, etc.) but there are
quite a few editors.

http://whenpicsfly.com/getting-to-know-zfs/

In my experience all of the defaults seem to be working great. The pool was
created with 4K block offset (ashift=12) and LZ4 compression is enabled on
the file systems. The SMB and NFS exports have sync writes disabled.

Luke
Post by Alex Gardiner
Hello list,
Is anybody else is using ZFS on Linux to store and stream video files?
In my case I am serving half a dozen streams between 100-200Mbps and
offering them to my lab over SMB/AFP.
Although traditionally I have used hardware solutions by 3ware and HP,
both of which boast cache profiles and firmware that claim to be optimised
for this use case, I wonder if there are any equivalent approaches for ZFS?
http://h20565.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c00687518
“Decreasing the maximum latency of block requests to the Smart Array is
one of the key goals for VOD
. Other improvements include changing the
cache ratio to be 0 % read and 100 % write. Since VOD operations are 99%
random, any read-ahead operation would penalize performance. You want to
post the writes so that they have the least impact on the reads”
Spec wise my lab box is a 2U Supermicro 825TQ running an 8 drive Z2 (4TB
WD RED/LSI 9211-8i/16GB ECC). Actually, the performance is pretty
impressive right out of the box, it just feels too easy :-)
From what I can see video streaming workloads do not seem to benefit all
that much from adding lots of ARC (at least not much beyond what I already
have). Similarly due to the shape of this kind of workload, L2ARC/SLOG do
not seem to be especially effective, although can save some IOPS with
smaller files.
Ultimately what I am asking is should I just relax and trust ZFS to be
smart, or can you think of any specific tunings that may be useful?
Many thanks and all the best for 2015.
Alex
PS. I have not been brave enough to use SATA discs with a SAS port
expander due reading a few horror stories. Am I still right to avoid this
as much as possible?
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Loading...