Discussion:
[zfs-discuss] First pool/vdev HGST drives/ (newbie) questions
Fabio Zetaeffesse
2015-02-23 11:05:44 UTC
Permalink
Hi there,

I've just subscribed and read many posts on this list because I'd like to
build my own pool/vdev. I'm still learning but I would like to put my hands
on ZFS.
That's why I'd like to go for a RAIDZ2 with 5 units and I'd like to go for
HGST disks after having read the BackBlaze report (see my question 2), I
prefer reliability over performance, after all ZFS is born having that in
mind, no?

My pool won't have to perform a lot and I rather go for the most silent and
least consuming, in terms of power, drives. Basically what normally people
would call such a pool THE BACKUP pool for me it will be THE pool :)

I have some of questions, I apologize in advance if the majority of them
look at newbie level :) :

1) do I really need SSDs for the cache/log? Can I share SSDs for both? I
see some practice this. Looking at some models I read compressible (ATTO
used by Kingston for instance) vs. non-compressible (S-SSD and
CrystalDiskMark yet Kingston's terms) transfer rate. whilst the first is
very high (400-450 Mb/s) the second is low even lower of a hard disk. I'm a
bit confused :-) 'cause I know SSD are faster then HDs. Which parameters do
you take into account when choosing a SSD?

2) though outdated I found HDS722020ALA330 for 80€ each.Its claimed
transfer rate is 200 Mb/s whilst the transfer rate of newer disks with
6Gbits/s interface and 4TB or 6TB capacity is just a bit higher (230 Mb/sec
and their price is double) So I won't benefit in buying models with
6Gbit/sec. Am I correct?

3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
limit? I mean the transfer rate toward the cache is as the same as towards
the disk but via the cache I have two steps and I double the flows through
the controller, am I right? If I'm right is there a way to bypass the cache
for certain transfers?

4) if I have to replace a disk (from what I've read it's not as easy as it
might seem detecting such faulty disk) I'll have to replace it and resilver
it, but what if I have to do the same for a cache/log? Normally for the
cache I'd choose for the mirror model.

Thanks,

Alex

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Fajar A. Nugraha
2015-02-23 12:04:04 UTC
Permalink
Post by Fabio Zetaeffesse
1) do I really need SSDs for the cache/log?
No
Post by Fabio Zetaeffesse
Can I share SSDs for both?
Yes
Post by Fabio Zetaeffesse
Which parameters do you take into
account when choosing a SSD?
In order of importance: persistence (e.g. doesn't lie about "sync", or
have supercaps), reliable, price, and then speed
Post by Fabio Zetaeffesse
3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
limit? I mean the transfer rate toward the cache is as the same as towards
the disk but via the cache I have two steps and I double the flows through
the controller, am I right?
No
Post by Fabio Zetaeffesse
If I'm right is there a way to bypass the cache
for certain transfers?
IIRC by default zfs should bypass cache for sequential reads
Post by Fabio Zetaeffesse
4) if I have to replace a disk (from what I've read it's not as easy as it
might seem detecting such faulty disk) I'll have to replace it and resilver
it, but what if I have to do the same for a cache/log? Normally for the
cache I'd choose for the mirror model.
Apparently you don't even need mirror for cache and slog. The pool can
function normally without data loss as long as slog failure and power
failure does NOT occur at the SAME time. I don't have the link handy
right now, IIRC it was Delphix who did just that.
--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 12:06:32 UTC
Permalink
Hang on - SLOG takes the sync writes. Those writes no longer exist in RAM
once they are flushed to SLOG. If it turns out that data has gotten
corrupted (SLOG failure) when the commit thread reaches it, then surely
those transactions will get lost, will they not?
Post by Fajar A. Nugraha
Post by Fabio Zetaeffesse
1) do I really need SSDs for the cache/log?
No
Post by Fabio Zetaeffesse
Can I share SSDs for both?
Yes
Post by Fabio Zetaeffesse
Which parameters do you take into
account when choosing a SSD?
In order of importance: persistence (e.g. doesn't lie about "sync", or
have supercaps), reliable, price, and then speed
Post by Fabio Zetaeffesse
3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
limit? I mean the transfer rate toward the cache is as the same as
towards
Post by Fabio Zetaeffesse
the disk but via the cache I have two steps and I double the flows
through
Post by Fabio Zetaeffesse
the controller, am I right?
No
Post by Fabio Zetaeffesse
If I'm right is there a way to bypass the cache
for certain transfers?
IIRC by default zfs should bypass cache for sequential reads
Post by Fabio Zetaeffesse
4) if I have to replace a disk (from what I've read it's not as easy as
it
Post by Fabio Zetaeffesse
might seem detecting such faulty disk) I'll have to replace it and
resilver
Post by Fabio Zetaeffesse
it, but what if I have to do the same for a cache/log? Normally for the
cache I'd choose for the mirror model.
Apparently you don't even need mirror for cache and slog. The pool can
function normally without data loss as long as slog failure and power
failure does NOT occur at the SAME time. I don't have the link handy
right now, IIRC it was Delphix who did just that.
--
Fajar
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 12:17:58 UTC
Permalink
Post by Gordan Bobic
Hang on - SLOG takes the sync writes. Those writes no longer exist in
RAM once they are flushed to SLOG.
Incorrect. They're only flushed from ram (and SLOG) after the pending
data is committed to the main array.

The SLOG is a circular write-always,read-never buffer area which is only
ever read from in the event of a restart.

Remember: ZFS is paranoid.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 12:47:58 UTC
Permalink
Interesting. Thanks for explaining that, I thought it was more like a
write-back cache for sync writes.
Post by Gordan Bobic
Hang on - SLOG takes the sync writes. Those writes no longer exist in
RAM once they are flushed to SLOG.
Incorrect. They're only flushed from ram (and SLOG) after the pending data
is committed to the main array.
The SLOG is a circular write-always,read-never buffer area which is only
ever read from in the event of a restart.
Remember: ZFS is paranoid.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
r***@gmail.com
2015-02-24 04:31:37 UTC
Permalink
Post by Gordan Bobic
Hang on - SLOG takes the sync writes. Those writes no longer exist in RAM
once they are flushed to SLOG. If it turns out that data has gotten
corrupted (SLOG failure) when the commit thread reaches it, then surely
those transactions will get lost, will they not?
Not correct. ZIL is an Intent Log. http://en.wikipedia.org/wiki/Intent_log
All of the data written to ZIL (either pool or in separate logs) is also in
ARC.
Transaction group commit to the pool takes the data from ARC. Separate
logs are rarely read (they exist for primarily for crash recovery of data
-- when
you import a pool that was not cleanly exported)
-- richard

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 12:13:22 UTC
Permalink
Post by Fabio Zetaeffesse
That's why I'd like to go for a RAIDZ2 with 5 units
The best gemoetry are powers of 2 for the data + redundancy:

ie, (2 4 8 16 32) + (1,2,3)

Other layouts will work, but have lower performance. In some cases
that's substantially lower performance (I saw a ~35% speedup by moving
to the "right" geometry)
Post by Fabio Zetaeffesse
1) do I really need SSDs for the cache/log?
No, but it helps if you're reading the same areas repeatedly (little
advantage if you're mostly doing sequential reads and asynchronous writes)
Post by Fabio Zetaeffesse
Can I share SSDs for both?
Yes, but you're increasing the write requirements (see below)
Post by Fabio Zetaeffesse
I
see some practice this. Looking at some models I read compressible (ATTO
used by Kingston for instance) vs. non-compressible (S-SSD and
CrystalDiskMark yet Kingston's terms) transfer rate. whilst the first is
very high (400-450 Mb/s) the second is low even lower of a hard disk.
Cheap SSDs have poor write rates and will die quickly in any case. Don't
buy them. You will regret it.

Even though they're cache drives and if they die ZFS will simply drop
them, it's still not worth the hassle.
Post by Fabio Zetaeffesse
I'm a bit confused :-) 'cause I know SSD are faster then HDs.
You "know" wrong. Seriously. Some SSDs are utter garbage.

Cheap ones often read 40-50times faster than a HDD, but have write speed
slower than a HDD (I have some PATA laptop/industrial SSDs which only
read at 5 times HDD speed, but their seek equivalent speed is 0.1ms vs
14ms, so they're genrally faster under normal operations - however their
write speed is _very_ slow)

This has changed a lot in the last few years, but the best advice is to
use a good quality drive such as a samsung 840Pro or better (Avoid
850pros - according to various reports they have a tendency to lockup,
and evos, they're likely to die quickly) or if you have the money, HGST
S840 enterprise SAS SSDs.
Post by Fabio Zetaeffesse
Which
parameters do you take into account when choosing a SSD?
IO performance, longevity, write speed (iops), warranty.
Post by Fabio Zetaeffesse
2) though outdated I found HDS722020ALA330 for 80€ each.Its claimed
transfer rate is 200 Mb/s whilst the transfer rate of newer disks with
6Gbits/s interface and 4TB or 6TB capacity is just a bit higher (230
Mb/sec and their price is double) So I won't benefit in buying models
with 6Gbit/sec. Am I correct?
Quooted speed for spinning media is _burst_ speed. 7200rpm drives won't
exceed 120MB/sec sequential or about 120IOPS random.

Warranty is important - there's a reason seagate and WDC have dropped to
12months on most of their range. Make sure these drives are fully
warrantied - the HDS722020ALA330s I see online only have 12 months on
them which isn't encouraging.

Bear in mind that 7k2000s are noisy and (relatively) hot - which is
important if they're going into an office or residential space.
Post by Fabio Zetaeffesse
3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
limit?
No, the drives will be your issue. Write speed per-vdev is limited to
that of a single drive (at best). Read speed is that of the data drives
- both of these can easily be badly compromised by poor choice of
array geometry.
Post by Fabio Zetaeffesse
I mean the transfer rate toward the cache is as the same as
towards the disk but via the cache I have two steps and I double the
flows through the controller, am I right? If I'm right is there a way to
bypass the cache for certain transfers?
You misunderstand how ZFS cache works.

The "write cache" isn't a write cache - it's a write intent log and only
used if the power goes off before data is committed to the disk from
ram. Under normal circumstances it's a circular write-only buffer.

The "read cache" is filled in parallel with disk reads and only matters
if you're hitting the same data repeatedly. Sequential reads (such as
large file xfers) aren't put into the cache by default.

In any case unless you're using slow controllers or a mounatin of sas
expanders the bus will vastly outrun your spinning media. Don't use sata
expanders. You _will_ regret it.
Post by Fabio Zetaeffesse
4) if I have to replace a disk (from what I've read it's not as easy as
it might seem detecting such faulty disk) I'll have to replace it and
resilver it
It's trivial to detect and replace a faulty drive, although this isn't
automated inside ZFS but can be done using a bit of shell logic. There
are good arguments not to immediately pick up a hot spare when a disk
fails - as others have noted an incremental backup is a good move before
starting resilvering.


It's a little trickier to force a drive which has developed a few bad
sectors to swap them out, but even then ZFS will handle that kind of
loss and try to repair the loss. (To force the sectors out by hand, take
drive offline, repair sectors, bring drive back online, scrub.)

Contemporary HDDs have a few thousand spare sectors, so even having a
couple of hundred mapped out isn't a major issue unless they're all
starting to happen at once. Continuous SMART monitoring (smartd) is a
very good idea - it doesn't always catch failing drives but it's better
than no warning at all.

Make sure you can set the seterc timer on the disk down to ~7 seconds
(enterprise mode) or the entire ZFS vdev will STOP for several minutes
when it hits a bad sector.

Desktop drives will try very hard to retrieve data from a potentially
bad sector (up to 3 minutes), whilst raid array drives need to drop it
quickly and trust that the raid system will take care of the issue.
Post by Fabio Zetaeffesse
but what if I have to do the same for a cache/log? Normally
for the cache I'd choose for the mirror model.
Why bother with mirrors?

If a cache dies, drop it and replace later. It's a _cache_ and can be
refilled. For that matter it's emptied out at every restart.

If you're paranoid, mirror the write intent log but you're working on
the slim-to-negligible basis of losing the cache AND power at the same
time - and under normal circumstances you'd only lose pending
synchronous writes not the entire FS.

If the write intent log drive dies and is removed ZFS will stripe the
write-intent across the data drives, so there is no major issue other
than performance loss - as noted above the write intent log is ONLY used
for synchronous writes unless you explicitly configure otherwise - async
writes are held in memory and may be lost if there's a power loss but
this does not damage the filesystem.

The prime user of write intent logspace is databases as these tend to
use synchronous writes (mysql and pgsql use async by default) and some
types of nfs mount.

I hope that helps.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 12:46:52 UTC
Permalink
Post by Uncle Stoat
I
Post by Fabio Zetaeffesse
see some practice this. Looking at some models I read compressible (ATTO
used by Kingston for instance) vs. non-compressible (S-SSD and
CrystalDiskMark yet Kingston's terms) transfer rate. whilst the first is
very high (400-450 Mb/s) the second is low even lower of a hard disk.
Cheap SSDs have poor write rates and will die quickly in any case. Don't
buy them. You will regret it.
When you say "cheap", what are you referring to? The cheapest SSDs on the
market (per GB)
are currently Crucial and Samsung.

Is there a considerably cheaper, less well known brand?
Post by Uncle Stoat
I'm a bit confused :-) 'cause I know SSD are faster then HDs.
You "know" wrong. Seriously. Some SSDs are utter garbage.
Cheap ones often read 40-50times faster than a HDD, but have write speed
slower than a HDD (I have some PATA laptop/industrial SSDs which only read
at 5 times HDD speed, but their seek equivalent speed is 0.1ms vs 14ms, so
they're genrally faster under normal operations - however their write speed
is _very_ slow)
Again - what are the cheap ones?
IME just about every SATA SSD of the most recent two generations is very
similar in price and performance.
Once you stray away from the mass consumed SSDs things fall apart pretty
quickly. For example, various
industrial devices or ones with less common interfaces, and SD/CF/USB flash
devices have atrocious
random write performance. But they are not cheap! They are typically some
significant multiple more
expensive than the fast SATA ones.
Post by Uncle Stoat
This has changed a lot in the last few years, but the best advice is to
use a good quality drive such as a samsung 840Pro or better (Avoid 850pros
- according to various reports they have a tendency to lockup, and evos,
they're likely to die quickly) or if you have the money, HGST S840
enterprise SAS SSDs.
Wasn't it the Samsung 840 that recently made the news for suffering
inexplicable massive performance degradation?

I have had good experience with Intel and Kingston drives. The ones I have
are quite similar
(both use Sandforce controller and Intel NAND), but I specifically mix them
together in
pools to reduce the probability of multiple simultaneous failures.

Crucial are generally well regarded, too, and they are also among the
cheapest at the moment.


2) though outdated I found HDS722020ALA330 for 80€ each.Its claimed
Post by Uncle Stoat
transfer rate is 200 Mb/s whilst the transfer rate of newer disks with
6Gbits/s interface and 4TB or 6TB capacity is just a bit higher (230
Mb/sec and their price is double) So I won't benefit in buying models
with 6Gbit/sec. Am I correct?
Quooted speed for spinning media is _burst_ speed. 7200rpm drives won't
exceed 120MB/sec sequential or about 120IOPS random.

Newer drives can get near 200MB sustained from the outer tracks. But the
performance when reading from the inner-most tracks is typically half of
that on the outermost (e.g. if you are getting 200MB on sustained reads
starting at sector 0, you'll get about 100MB/s from the end of the disk).
Post by Uncle Stoat
Bear in mind that 7k2000s are noisy and (relatively) hot - which is
important if they're going into an office or residential space.

HGST drives are, unfortunately, noisy and most seem to lack acoustic
management functionality.


3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
Post by Uncle Stoat
limit?
No, the drives will be your issue. Write speed per-vdev is limited to
that of a single drive (at best).
Post by Uncle Stoat
Read speed is that of the data drives - both of these can easily be
badly compromised by poor
Post by Uncle Stoat
choice of array geometry.
I don't think that is quite true. I regularly get sequential throughput
that vastly exceeds that
of a single drive in my pool (if the data isn't fragmented).

In terms of IOPS:

Random writes will tend have performance of a single disk (so for a pool
consisting of 7200rpm disks, you'll get 120 write IOPS).
But, for very small writes, it may be better, e.g. if you have writes that
are <= ashift, then one write will consume one disk's
IOPS + redundancy disks IOPS. So if you have, say, 6-disk RAIDZ2, that will
use 3 disk's worth of IOPS, thus leaving you
with 3 disks free to serve other data requests.

Random reads will have up to performance of the number of disks in your
pool, depending on how your data is stored.

Redundancy is distributed among the disks, and the redundant blocks don't
get checked on regular reads if the block
checksum matches. So the redundancy blocks only get read when an read error
is encountered.

The point I'm making here is that the performance will depend heavily on
the kind of data you have
and it's fragmentation.
Post by Uncle Stoat
The prime user of write intent logspace is databases as these tend to use
synchronous writes
Post by Uncle Stoat
(mysql and pgsql use async by default) and some types of nfs mount.
MySQL writes most definitely do benefit from sync=disabled, though. It
makes quite a huge difference, but
I wouldn't recommend that unless you have a multiple node cluster to ensure
that not all nodes will fail at the
same time.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 13:54:41 UTC
Permalink
Post by Gordan Bobic
Wasn't it the Samsung 840 that recently made the news for suffering
inexplicable massive performance degradation?
No, it was the 850Pro - which I read about just as I was about to order
a dozen of the things.

I've got 840pros in my systems and they're working well (one has almost
reached lifetime write expiry after 4 years but it does get the snot
hammered out of it - and the experience has directed us to HGST S840
drives for the next generation on that system)
Post by Gordan Bobic
Post by Uncle Stoat
No, the drives will be your issue. Write speed per-vdev is limited to
that of a single drive (at best).
Post by Uncle Stoat
Read speed is that of the data drives - both of these can easily be
badly compromised by poor
Post by Uncle Stoat
choice of array geometry.
I don't think that is quite true. I regularly get sequential throughput
that vastly exceeds that
of a single drive in my pool (if the data isn't fragmented).
That's what I wrote, although not clearly enough.

Write speed = that of the slowest drive in the vdev.
Read speed (sequential) = sum of the data drives in the vdev.

These are both best-case scenarios. Reads may happen faster/lower
latency if the data is cached but you're not likely to see that on
sequential activity.
Post by Gordan Bobic
Random writes will tend have performance of a single disk (so for a pool
consisting of 7200rpm disks, you'll get 120 write IOPS).
Under normal circumstances (a multi-user envoronment) the performance of
a single drive tends to turn to s**t rapidly as it seeks itself to
death. I benchmarked raid0 arrays used for backup caching a while back
and the degradation in throughput when simultaneous streams were fed
to/read from was mind-boggling(*) - but justified replacing the drives
with 6 raid0 Intel E25s.

(*) Think of an inverse exponential curve asymptotically approaching
zero overall throughput and you won't be far off.

Because ZFS batches writes it tends to fare better, but it will still
fall well short of 90-120 IOPS/ 100Mb/s unless that write is the only
thing going on. Under normal cirsumstances you'll probably see half that
rate at best for writes larger than the available cache space (ram or
slog depending if sync or async)
Post by Gordan Bobic
Post by Uncle Stoat
The prime user of write intent logspace is databases as these tend to
use synchronous writes
Post by Uncle Stoat
(mysql and pgsql use async by default) and some types of nfs mount.
MySQL writes most definitely do benefit from sync=disabled, though.
It's disabled by default (or used to be at any rate) and that "benefit"
comes with the risk of massive data loss if the power goes off for any
reason.

I benchmarked Oracle 8i vs mysql vs pgsql about a decade ago and with
fsync disabled they all performed at about the same speed over 200
million INSERTS - but you have to understand (and be willing to accept)
the risks that go with that mode. I'd only ever do it whilst populating
a new system.

If the DB is on a ZFS filesystem then the SSD SLOG will pick up the
writes and fsync activity is generally extremely fast, which is a win/win


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 14:44:38 UTC
Permalink
Post by Gordan Bobic
Wasn't it the Samsung 840 that recently made the news for suffering
Post by Gordan Bobic
inexplicable massive performance degradation?
No, it was the 850Pro - which I read about just as I was about to order a
dozen of the things.
http://www.theregister.co.uk/2015/02/22/samsung_in_second_ssd_slowdown_snafu/
Post by Gordan Bobic
I've got 840pros in my systems and they're working well (one has almost
reached lifetime write expiry after 4 years but it does get the snot
hammered out of it - and the experience has directed us to HGST S840 drives
for the next generation on that system)
Post by Gordan Bobic
No, the drives will be your issue. Write speed per-vdev is limited to
that of a single drive (at best).
Post by Uncle Stoat
Read speed is that of the data drives - both of these can easily be
badly compromised by poor
Post by Uncle Stoat
choice of array geometry.
I don't think that is quite true. I regularly get sequential throughput
that vastly exceeds that
of a single drive in my pool (if the data isn't fragmented).
That's what I wrote, although not clearly enough.
Write speed = that of the slowest drive in the vdev.
I'm still not convinced that's right. On small operations the data
occupancy will deteriorate, but the IOPS will be higher than a single drive
as one IOP won't consume an IOP on each drive, only a subset. On large
writes it will consume and IOP from all drives, but in that scenario the
throughput will be more sequential than IOPS bound, which means you could,
on large enough streaming writes, get the same performance as the combined
sequential throughput of all the data bearing drives in the pool. So in
both scenarios the theoretical maximum is much bigger than what a single
drive can provide.
Post by Gordan Bobic
Read speed (sequential) = sum of the data drives in the vdev.
These are both best-case scenarios. Reads may happen faster/lower latency
if the data is cached but you're not likely to see that on sequential
activity.
Random writes will tend have performance of a single disk (so for a pool
Post by Gordan Bobic
consisting of 7200rpm disks, you'll get 120 write IOPS).
Under normal circumstances (a multi-user envoronment) the performance of a
single drive tends to turn to s**t rapidly as it seeks itself to death. I
benchmarked raid0 arrays used for backup caching a while back and the
degradation in throughput when simultaneous streams were fed to/read from
was mind-boggling(*) - but justified replacing the drives with 6 raid0
Intel E25s.
This is where matching the application level block size to the RAID chunk
size helps a lot. The problem with ZFS is that the closest equivalent to
this we have is ashift, which can be abused to approximate a chunk size of
up to 8KB, which is woefully inadequate for many workloads. Above that, the
block gets striped across multiple disks, thus consuming way more IOPS than
it should.

In case any developers are listening, it would be really handy of there was
a block allocator setting on ZFS that allows for setting the minimum
per-disk block size that is a multiple of ashift. That way we wouldn't have
to abuse the ashift setting, and we could still achieve sane block
alignment to optimize for the application stack above without bothering
more disks than necessary for each operation due to needless striping.
Post by Gordan Bobic
Post by Gordan Bobic
The prime user of write intent logspace is databases as these tend to
use synchronous writes
Post by Uncle Stoat
(mysql and pgsql use async by default) and some types of nfs mount.
MySQL writes most definitely do benefit from sync=disabled, though.
It's disabled by default (or used to be at any rate) and that "benefit"
comes with the risk of massive data loss if the power goes off for any
reason.
Are you talking about MySQL or ZFS level async?
MySQL may use async writes by default, but it also issues a flush at every
commit, which is what keeps the data safe. But most people use O_DIRECT on
InnoDB anyway to avoid pointless double-caching.

On ZFS there's no O_DIRECT, so its back to async writes (except you also
have to disable native async I/O) with explicit flushing. Hence why setting
sync=disabled helps (those explicit flushes get ignored).

Definitely not something to do on a production single-node server, but if
you have something like a Galera cluster and take the view that you won't
lose _all_ the nodes at the same time, sync=disabled becomes a viable
proposition.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 14:46:41 UTC
Permalink
Post by Uncle Stoat
No, it was the 850Pro - which I read about just as I was about to
order a dozen of the things.
http://www.theregister.co.uk/2015/02/22/samsung_in_second_ssd_slowdown_snafu/
840Pro != 840Evo - and I did say not to use those.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Omen Wild
2015-02-23 20:25:37 UTC
Permalink
The "read cache" is filled in parallel with disk reads and only matters if
you're hitting the same data repeatedly. Sequential reads (such as large
file xfers) aren't put into the cache by default.
I'm singling this piece out because I want to clarify my understanding of
ZFS. I thought that the L2ARC (read cache) was filled by entries evicted
from the ARC (in RAM), and that the only way for data to enter the L2ARC
was to have first passed through the ARC.

This is important because if the primarycache is limited to metadata,
then the secondarycache can only ever hold metadata.

Thanks
--
Law of Selective Gravity: An object will fall so as to do the most damage.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoatwarbler
2015-02-23 21:44:28 UTC
Permalink
Post by Omen Wild
The "read cache" is filled in parallel with disk reads and only matters if
you're hitting the same data repeatedly. Sequential reads (such as large
file xfers) aren't put into the cache by default.
I'm singling this piece out because I want to clarify my understanding of
ZFS. I thought that the L2ARC (read cache) was filled by entries evicted
from the ARC (in RAM), and that the only way for data to enter the L2ARC
was to have first passed through the ARC.
Correct.
Post by Omen Wild
This is important because if the primarycache is limited to metadata,
then the secondarycache can only ever hold metadata.
Also correct.

metadata-only isn't enabled by default and for the most part if you do
enable it on the primarycache (ARC), you may as well forget about having
L2ARC (secondaryarc)


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
r***@gmail.com
2015-02-24 04:40:36 UTC
Permalink
Post by Uncle Stoat
Post by Fabio Zetaeffesse
That's why I'd like to go for a RAIDZ2 with 5 units
ie, (2 4 8 16 32) + (1,2,3)
In general, this is not true because most people compress data and
therefore it
does not align nicely to your expectation of block allocation. Also writes
are
coalesced, so you think you write 4k, but the disk sees 128k.

The best approach is to marry your expected workload against
compressibility,
physical block size, and redundancy.
Post by Uncle Stoat
Other layouts will work, but have lower performance. In some cases
that's substantially lower performance (I saw a ~35% speedup by moving
to the "right" geometry)
YMMV, but I'll bet a steak dinner your results do no translate to
compressible
workloads :-)
Post by Uncle Stoat
Post by Fabio Zetaeffesse
1) do I really need SSDs for the cache/log?
No, but it helps if you're reading the same areas repeatedly (little
advantage if you're mostly doing sequential reads and asynchronous writes)
Post by Fabio Zetaeffesse
Can I share SSDs for both?
Yes, but you're increasing the write requirements (see below)
Post by Fabio Zetaeffesse
I
see some practice this. Looking at some models I read compressible (ATTO
used by Kingston for instance) vs. non-compressible (S-SSD and
CrystalDiskMark yet Kingston's terms) transfer rate. whilst the first is
very high (400-450 Mb/s) the second is low even lower of a hard disk.
Cheap SSDs have poor write rates and will die quickly in any case. Don't
buy them. You will regret it.
Even though they're cache drives and if they die ZFS will simply drop
them, it's still not worth the hassle.
Post by Fabio Zetaeffesse
I'm a bit confused :-) 'cause I know SSD are faster then HDs.
You "know" wrong. Seriously. Some SSDs are utter garbage.
Cheap ones often read 40-50times faster than a HDD, but have write speed
slower than a HDD (I have some PATA laptop/industrial SSDs which only
read at 5 times HDD speed, but their seek equivalent speed is 0.1ms vs
14ms, so they're genrally faster under normal operations - however their
write speed is _very_ slow)
This has changed a lot in the last few years, but the best advice is to
use a good quality drive such as a samsung 840Pro or better (Avoid
850pros - according to various reports they have a tendency to lockup,
and evos, they're likely to die quickly) or if you have the money, HGST
S840 enterprise SAS SSDs.
Post by Fabio Zetaeffesse
Which
parameters do you take into account when choosing a SSD?
IO performance, longevity, write speed (iops), warranty.
Post by Fabio Zetaeffesse
2) though outdated I found HDS722020ALA330 for 80€ each.Its claimed
transfer rate is 200 Mb/s whilst the transfer rate of newer disks with
6Gbits/s interface and 4TB or 6TB capacity is just a bit higher (230
Mb/sec and their price is double) So I won't benefit in buying models
with 6Gbit/sec. Am I correct?
Quooted speed for spinning media is _burst_ speed. 7200rpm drives won't
exceed 120MB/sec sequential or about 120IOPS random.
Warranty is important - there's a reason seagate and WDC have dropped to
12months on most of their range. Make sure these drives are fully
warrantied - the HDS722020ALA330s I see online only have 12 months on
them which isn't encouraging.
Bear in mind that 7k2000s are noisy and (relatively) hot - which is
important if they're going into an office or residential space.
Post by Fabio Zetaeffesse
3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
limit?
No, the drives will be your issue. Write speed per-vdev is limited to
that of a single drive (at best). Read speed is that of the data drives
- both of these can easily be badly compromised by poor choice of
array geometry.
Post by Fabio Zetaeffesse
I mean the transfer rate toward the cache is as the same as
towards the disk but via the cache I have two steps and I double the
flows through the controller, am I right? If I'm right is there a way to
bypass the cache for certain transfers?
You misunderstand how ZFS cache works.
The "write cache" isn't a write cache - it's a write intent log and only
used if the power goes off before data is committed to the disk from
ram. Under normal circumstances it's a circular write-only buffer.
The "read cache" is filled in parallel with disk reads and only matters
if you're hitting the same data repeatedly. Sequential reads (such as
large file xfers) aren't put into the cache by default.
cache devices (aka L2ARC) take data that would be evicted from ARC.
In parsing the above, I interpret "read cache" as the ARC, but the questions
are about cache devices.

Also, all data goes through the ARC, so everything is cached. This is true
for all modern file systems.
Post by Uncle Stoat
In any case unless you're using slow controllers or a mounatin of sas
expanders the bus will vastly outrun your spinning media. Don't use sata
expanders. You _will_ regret it.
Post by Fabio Zetaeffesse
4) if I have to replace a disk (from what I've read it's not as easy as
it might seem detecting such faulty disk) I'll have to replace it and
resilver it
It's trivial to detect and replace a faulty drive, although this isn't
automated inside ZFS but can be done using a bit of shell logic. There
are good arguments not to immediately pick up a hot spare when a disk
fails - as others have noted an incremental backup is a good move before
starting resilvering.
It's a little trickier to force a drive which has developed a few bad
sectors to swap them out, but even then ZFS will handle that kind of
loss and try to repair the loss. (To force the sectors out by hand, take
drive offline, repair sectors, bring drive back online, scrub.)
Contemporary HDDs have a few thousand spare sectors, so even having a
couple of hundred mapped out isn't a major issue unless they're all
starting to happen at once. Continuous SMART monitoring (smartd) is a
very good idea - it doesn't always catch failing drives but it's better
than no warning at all.
I can tell you from direct experience that the max size of the grown
defects list for a
well-known 4TB SAS drive is 1,821. This is easily attainable :-(
-- richard
Post by Uncle Stoat
Make sure you can set the seterc timer on the disk down to ~7 seconds
(enterprise mode) or the entire ZFS vdev will STOP for several minutes
when it hits a bad sector.
Desktop drives will try very hard to retrieve data from a potentially
bad sector (up to 3 minutes), whilst raid array drives need to drop it
quickly and trust that the raid system will take care of the issue.
Post by Fabio Zetaeffesse
but what if I have to do the same for a cache/log? Normally
for the cache I'd choose for the mirror model.
Why bother with mirrors?
If a cache dies, drop it and replace later. It's a _cache_ and can be
refilled. For that matter it's emptied out at every restart.
If you're paranoid, mirror the write intent log but you're working on
the slim-to-negligible basis of losing the cache AND power at the same
time - and under normal circumstances you'd only lose pending
synchronous writes not the entire FS.
If the write intent log drive dies and is removed ZFS will stripe the
write-intent across the data drives, so there is no major issue other
than performance loss - as noted above the write intent log is ONLY used
for synchronous writes unless you explicitly configure otherwise - async
writes are held in memory and may be lost if there's a power loss but
this does not damage the filesystem.
The prime user of write intent logspace is databases as these tend to
use synchronous writes (mysql and pgsql use async by default) and some
types of nfs mount.
I hope that helps.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-24 09:17:00 UTC
Permalink
Post by r***@gmail.com
Post by Uncle Stoat
Post by Fabio Zetaeffesse
That's why I'd like to go for a RAIDZ2 with 5 units
ie, (2 4 8 16 32) + (1,2,3)
In general, this is not true because most people compress data and
therefore it
does not align nicely to your expectation of block allocation. Also writes
are
coalesced, so you think you write 4k, but the disk sees 128k.
The best approach is to marry your expected workload against
compressibility,
physical block size, and redundancy.
Post by Uncle Stoat
Other layouts will work, but have lower performance. In some cases
that's substantially lower performance (I saw a ~35% speedup by moving
to the "right" geometry)
YMMV, but I'll bet a steak dinner your results do no translate to
compressible
workloads :-)
I already explained in a previous post that sometimes using compression is
the
only way to ensure with any degree of regularity that the total IOPS on the
workload
improve by a factor of 2.

The attitude that the FS will magically do the right thing whatever the
operator
does is at best misguided.
Post by r***@gmail.com
Contemporary HDDs have a few thousand spare sectors, so even having a
couple of hundred mapped out isn't a major issue unless they're all
starting to happen at once. Continuous SMART monitoring (smartd) is a
very good idea - it doesn't always catch failing drives but it's better
than no warning at all.
I can tell you from direct experience that the max size of the grown
defects list for a
Post by r***@gmail.com
well-known 4TB SAS drive is 1,821. This is easily attainable :-(
It varies from drive to drive. The 1TB Seagates I had expire due to
reallocation
space exhaustion seemed to max out at about double that figure.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoatwarbler
2015-02-24 10:20:19 UTC
Permalink
Post by Uncle Stoat
ie, (2 4 8 16 32) + (1,2,3)
In general, this is not true because most people compress data
In general there's little point in doing so.

Unless you're running a corner case(*), most large data is incompresible
and the advantages with small files are outweighed by the speed
penalties (~30MB/thread with current CPUs even with lz4) and added
fiddliness.

(*) In my experience, one of the few areas where compression seems
worthwhile is a mail spool, but how many people run mail servers or
other systems where what's stored is mainly plaintext?

In almost all cases on my home and work systems, running compression
tests shows an overall advantage of less than 15-20%

This is underscored by our backup tape consumption (LTO tape drives have
built-in compression) which sees an overall compression ratio of 1.09
over about 20Pb of backups.

Yes, OS/program and home directory areas are highly compressible, but
they account for so little disk space that focussing on them is
counterproductive.

The space-eaters are things like image and audio files, etc - these are
invariably already compressed.
Post by Uncle Stoat
and
therefore it
does not align nicely to your expectation of block allocation.
There was discussion of this a few years back. The issue is the way the
parity data ends up being written, not block allocation. The
recommendation of 2^n came from various devs.
Post by Uncle Stoat
YMMV, but I'll bet a steak dinner your results do no translate to
compressible
workloads :-)
As stated above - compressible workloads are such a minor part of the
overall picture that they may as well not exist.

And that's even with filesystems such as the one I'm dealing with at the
moment that have 2-300 million files in 500Gb (deep galactic survey
imaging files averaging 50kB apiece)
Post by Uncle Stoat
I already explained in a previous post that sometimes using compression
is the
only way to ensure with any degree of regularity that the total IOPS on
the workload
improve by a factor of 2.
You're substituting disk load for CPU load. That isn't always a good
thing to do. There are _always_ tradeoffs.
Post by Uncle Stoat
The attitude that the FS will magically do the right thing whatever the
operator
does is at best misguided.
I spend most of my time dealing with "what the operator does". 9 times
out of 10 the FS gets it a lot more right than the user. Attempting to
tune requires intimate knowledge of what's going onto the disk and
additional knowledge that it's unlikely to change.
Post by Uncle Stoat
I can tell you from direct experience that the max size of the grown
defects list for a
well-known 4TB SAS drive is 1,821. This is easily attainable :-(
That wouldn't surprise me given the reduction in warranties but in most
of our 2TB drives it's more than double that (also much higher in HGST
drives) - however it's better to buy reliable drives in the first place
and ensure they arrive well packed.

The fact that "it's easily attainable" is a good indicator for "avoid
that device" and/or "that shipper"

The reason I bang on about packaging is simple: Drives which arrive
packed in 2 inches of foam or the OEM packaging last a lot longer that
ones which come with lesser packaging.

Similarly, complete systems which arrive with the packaging having
sustained transit damage invariably have drive failure sooner than the
ones that don't. You can tell Inwards Goods staff to refuse damaged
shipments until you're blue in the face but they'll still do it and it's
extremely hard to return them or dispute conditions after they've been
signed for.
Post by Uncle Stoat
It varies from drive to drive. The 1TB Seagates I had expire due to
reallocation
space exhaustion seemed to max out at about double that figure.
If I was to keep drives until this happened, I'd be looking for a new
job. We change drives at the end of their warranty period. (not that any
of the 1Tb Seagate Constellation SAS drives we use have actually lasted
to the end of their warranty....)

Even at home I run conservatively. Drives displaying increasing defect
counts get replaced. Past experience has shown that they usually happen
as cascades, so once you get an initial large number in a short period
its time to changeout.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-24 10:39:32 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Uncle Stoat
ie, (2 4 8 16 32) + (1,2,3)
In general, this is not true because most people compress data
In general there's little point in doing so.
Unless you're running a corner case(*), most large data is incompresible
and the advantages with small files are outweighed by the speed
penalties (~30MB/thread with current CPUs even with lz4) and added
fiddliness.
(*) In my experience, one of the few areas where compression seems
worthwhile is a mail spool, but how many people run mail servers or
other systems where what's stored is mainly plaintext?
In almost all cases on my home and work systems, running compression
tests shows an overall advantage of less than 15-20%
I agree on the whole - most huge data is already compressed.
One case where compression generally seems to work quite
well is databases. Most of my MySQL data sets compress well
over 2:1 using LZ4.
Post by Uncle Stoatwarbler
As stated above - compressible workloads are such a minor part of the
overall picture that they may as well not exist.
And that's even with filesystems such as the one I'm dealing with at the
moment that have 2-300 million files in 500Gb (deep galactic survey
imaging files averaging 50kB apiece)
Post by Uncle Stoat
I already explained in a previous post that sometimes using compression
is the
only way to ensure with any degree of regularity that the total IOPS on
the workload
improve by a factor of 2.
You're substituting disk load for CPU load. That isn't always a good
thing to do. There are _always_ tradeoffs.
True - there are tradeoffs. But I usually test at saturation load (many
concurrent
threads, enough to saturate the CPU most of the time), and even on that
kind of a CPU heavy load I find that compressing MySQL data sets yields
higher overall throughput.

Doubling the IOPS through compression by preventing application level
pages straddling disk boundaries certainly helps more than the cost
of CPU time for compressing/decompressing the data in this case.

And you may consider that this is a relatively small space hog, but
some of the databases I work with are many TBs in size. Although
the primary motivation is improving performance, halving the space
usage is a nice freebie on top of everything going faster. :-)
Post by Uncle Stoatwarbler
Post by Uncle Stoat
The attitude that the FS will magically do the right thing whatever the
operator
does is at best misguided.
I spend most of my time dealing with "what the operator does". 9 times
out of 10 the FS gets it a lot more right than the user. Attempting to
tune requires intimate knowledge of what's going onto the disk and
additional knowledge that it's unlikely to change.
Sure - that's why I wouldn't hire somebody who didn't have that
degree of insight.
Post by Uncle Stoatwarbler
Post by Uncle Stoat
It varies from drive to drive. The 1TB Seagates I had expire due to
reallocation
space exhaustion seemed to max out at about double that figure.
If I was to keep drives until this happened, I'd be looking for a new
job. We change drives at the end of their warranty period. (not that any
of the 1Tb Seagate Constellation SAS drives we use have actually lasted
to the end of their warranty....)
Who said anything about those Seagates I mentioned outlasting their
warranty period? They came with 5 years warranty when I bought
them and by the time the 5 years were up, all but a handful were
replaced, some more than once.
Post by Uncle Stoatwarbler
Even at home I run conservatively. Drives displaying increasing defect
counts get replaced. Past experience has shown that they usually happen
as cascades, so once you get an initial large number in a short period
its time to changeout.
In my experience, more often than not, disk fail without much of a
warning in terms of pending or reallocated sectors. Yes, when the
pending/reallocation counts start to go up, it usually indicates that
the disk is about to fail completely, but I have had many drives
outright fail without being immediately preceeded by defect growh.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Fajar A. Nugraha
2015-02-24 10:51:26 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Uncle Stoat
ie, (2 4 8 16 32) + (1,2,3)
In general, this is not true because most people compress data
In general there's little point in doing so.
Unless you're running a corner case(*), most large data is incompresible
and the advantages with small files are outweighed by the speed
penalties (~30MB/thread with current CPUs even with lz4) and added
fiddliness.
(*) In my experience, one of the few areas where compression seems
worthwhile is a mail spool, but how many people run mail servers or
other systems where what's stored is mainly plaintext?
In almost all cases on my home and work systems, running compression
tests shows an overall advantage of less than 15-20%
This is my laptop:
# zfs get compression,compressratio,logicalused rpool
NAME PROPERTY VALUE SOURCE
rpool compression lz4 local
rpool compressratio 1.55x -
rpool logicalused 296G -

This is one of my servers at work:
# zfs get compression,compressratio,logicalused data
NAME PROPERTY VALUE SOURCE
data compression gzip local
data compressratio 1.32x -
data logicalused 506G -

This is my backup server:
# zfs get compression,compressratio,logicalused data
NAME PROPERTY VALUE SOURCE
data compression gzip local
data compressratio 4.79x -
data logicalused 13.0T -


Definitely more than just 20% gain, definitely worthed.
--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-24 16:15:21 UTC
Permalink
In the first two cases you're saving far less than 1TB and in the third
you've got highly repetitive data

I assume the first one is ZFS single disk or mirror?

What about the others? The discussion was really about RAIDZn type
arrays and equations vary for mirror/singledrive use.
Post by Fajar A. Nugraha
Definitely more than just 20% gain, definitely worthed.
I'm running 30TB on my home server and hundreds on the work ones.

If I could save 20% with compression then I'd consider it, but the
reality is that it's more like 2-5% overall, with some areas (as
mentioned - OS, personal data areas) having savings but the question is
if it's worth it vs the extra heat, compared to the price of a larger
disk (or padding out the array to 2^n + R)




To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Omen Wild
2015-02-24 17:47:18 UTC
Permalink
If I could save 20% with compression then I'd consider it, but the reality
is that it's more like 2-5% overall, with some areas (as mentioned - OS,
personal data areas) having savings but the question is if it's worth it vs
the extra heat, compared to the price of a larger disk (or padding out the
array to 2^n + R)
It all depends on your data. One of our OpenIndiana backup servers gets:

NAME PROPERTY VALUE SOURCE
backups1 compression lz4 local
backups1 compressratio 2.28x -
backups1 logicalused 9.30T -
[ This one would be 21 TB without compression. ]

backups2 compression lz4 local
backups2 compressratio 1.60x -
backups2 logicalused 74.7T -
[ This one would be 120 TB without compression! ]

backups3 compression lz4 local
backups3 compressratio 1.43x -
backups3 logicalused 58.5T -
[ This one would only be 84 TB without compression. ]

That's on top of dedup for backups2 and backups3
backups2: dedup = 2.00, compress = 1.61, copies = 1.10, dedup * compress / copies = 2.92
backups3: dedup = 1.62, compress = 1.43, copies = 1.06, dedup * compress / copies = 2.19

We did bump the RAM to 256GB and installed a decent SSD for L2ARC/dedup
spill. Looks like meta data has peaked at 170GB, which is a significant
amount of RAM, but between dedup and compression giving such huge
savings, it seems worth it so far.
--
You cannot tell which way the train went by looking at the track.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-24 11:13:46 UTC
Permalink
Hello Uncle,
Post by Uncle Stoatwarbler
Post by Uncle Stoat
ie, (2 4 8 16 32) + (1,2,3)
In general, this is not true because most people compress data
In general there's little point in doing so.
Unless you're running a corner case(*), most large data is incompresible
and the advantages with small files are outweighed by the speed
penalties (~30MB/thread with current CPUs even with lz4) and added
fiddliness.
I think you may be off by 1.5 orders of magnitude here.
Quoting from https://code.google.com/p/lz4 :
"single thread, Core i5-3340M @2.7GHz

Name Ratio C.speed D.speed
MB/s MB/s MB/s
LZ4 (r101) 2.084 422 1820"

According to my own experience LZ4 is so cheap CPU-wise that I always turn
it on, and it manages to gain some free space even on uncompressible
datasets due to filesystem metadata compression, specially on picture
datasets which tend to have a higher number-of-files to number-of-bytes
ratio.
Post by Uncle Stoatwarbler
(*) In my experience, one of the few areas where compression seems
worthwhile is a mail spool, but how many people run mail servers or
other systems where what's stored is mainly plaintext?
Other stuff that highly benefit from compression, in no particular order:

wikis

websites

databases

log directories (these are massively compressible)


Cheers,
--
Durval.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-24 16:23:56 UTC
Permalink
Post by Durval Menezes
Post by Uncle Stoatwarbler
(*) In my experience, one of the few areas where compression seems
worthwhile is a mail spool, but how many people run mail servers or
other systems where what's stored is mainly plaintext?
wikis
websites
databases
log directories (these are massively compressible)
How much space do these use?

If you have enough space taken up that you're considering putting a
bunch of drives together to make a RAIDZ[123] array most of these items
are already down in the "noise" as far as disk space consumption is
concerned.

It may be that zfscreating a subpartition would be worthwhile for those
areas but the counterargument is that if there's neligable overall
benefit for compression across 30-60Tb or larger, why bother?

(BTW, logfiles are one of the few areas where i think dedupe might
actually be worthwhile. Smaller database spaces also benefit - IF you
have enough RAM and CPU)

CPU requirements aren't just about the filesystem. Try quantifying how
much NFSD uses as a f'instance (it's surprisingly large). The parasitic
loss of lz4 may well be enough to require a larger system when you have
150 nfs clients to service, as I do here and faced with a cost increment
of a few thousand dollars vs a few hundred for extra space I'll take the
space.



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Michael Kjörling
2015-02-23 12:37:24 UTC
Permalink
Post by Fabio Zetaeffesse
1) do I really need SSDs for the cache/log?
No, and no. Note that cache and log are VERY different.

A SSD SLOG is only really useful if you have lots of synchronous
writes, which most workloads don't. And even then, it's only an issue
if you need higher performance than the main storage devices can
offer; if there is no SLOG device in the pool, a portion of the main
storage devices are used for that purpose. You can always add a SLOG
later.

A SSD cache (L2ARC) may be useful if you have lots and _lots_ of
random reads. For a single user system with a decent amount of RAM
(which is recommended for ZFS anyway -- oh, and _make sure_ you have
ECC RAM and that ECC is working) the cost of adding this likely
outweighs the benefit. And again, you can add a cache device later if
you want to.
Post by Fabio Zetaeffesse
Can I share SSDs for both?
You can.
Post by Fabio Zetaeffesse
I
see some practice this. Looking at some models I read compressible (ATTO
used by Kingston for instance) vs. non-compressible (S-SSD and
CrystalDiskMark yet Kingston's terms) transfer rate. whilst the first is
very high (400-450 Mb/s) the second is low even lower of a hard disk. I'm a
bit confused :-) 'cause I know SSD are faster then HDs. Which parameters do
you take into account when choosing a SSD?
Besides what has been said already, keep in mind that there are two
different figures for storage performance: throughput (measured in
megabytes per second or some similar unit) and latency (measured in
either ms or IOPS, I/O Operations Per Second). SSDs are good for
throughput but where they _really_ shine is in terms of latency/IOPS.
A spinning platter HDD will, in theory, give you at most one IOP per
revolution, so a 7200 rpm drive will top out at 7200 / 60 = 120 IOPS
(corresponding to a latency of 8.3 ms).
Post by Fabio Zetaeffesse
2) though outdated I found HDS722020ALA330 for 80€ each.Its claimed
transfer rate is 200 Mb/s whilst the transfer rate of newer disks with
6Gbits/s interface and 4TB or 6TB capacity is just a bit higher (230 Mb/sec
and their price is double) So I won't benefit in buying models with
6Gbit/sec. Am I correct?
You are comparing apples and oranges. 1.5 Gbit/s, 3 Gbit/s and 6
Gbit/s refer to the SATA interconnect data rate. The HDD's quoted data
rate (which is more likely to be 200 MB/s than 200 Mb/s) is a largely
theoretical data point that tells you how fast it can work the
platters in a purely sequential workload. Keep in mind that with CoW
in general, and ZFS in particular, purely sequential workloads
essentially don't exist.

Most spinning platter HDDs will top out at around 100-120 MB/s for
7200 rpm drives anyway, and going beyond 7200 rpm means gobbling huge
amounts of power for relatively little gain (in your case). That's
about 1 Gb/s. So going 6 Gb/s SATA on the HDD side doesn't
significantly improve performance, although it might on the
*controller* (or in case your setup uses port multipliers, PMPs).
Post by Fabio Zetaeffesse
3) While transferring a big file (say 40-50Gb) might the ZFS cache be a
limit? I mean the transfer rate toward the cache is as the same as towards
the disk but via the cache I have two steps and I double the flows through
the controller, am I right? If I'm right is there a way to bypass the cache
for certain transfers?
Even if the cache gets used for the copy (which I don't think), once
the cache fills up the throughput through the cache will be limited to
the throughput of the underlying storage. So it's all about how much
cache you have versus how fast the storage is at writing.
Post by Fabio Zetaeffesse
4) if I have to replace a disk (from what I've read it's not as easy as it
might seem detecting such faulty disk) I'll have to replace it and resilver
it, but what if I have to do the same for a cache/log? Normally for the
cache I'd choose for the mirror model.
Don't bother with redundancy for cache; cache is expendable, and will
be repopulated automatically if errors are detected in the cache data.
So for L2ARC, if you need it at all (depends on your workload), just
go with a fast (IOPS wise) SSD. The worst that can happen with poorly
functioning cache is that you don't get the full speed advantage out
of it.

If you are looking at SLOG, you might consider a mirror for
redundancy, but most personal systems (which I get the feeling is what
you are building) aren't built to anywhere near that level of
redundancy anyway. Under normal operations, the log is write-only; it
is read _only_ when the system gets shut down while there are
outstanding transaction groups not yet committed to final storage.
Thus the only situation in which SLOG redundancy matters is if log
device failure happens _at the same time_ as an improper system
shutdown (e.g. a power outage).

I would probably rather take the money saved on not mirroring the SLOG
and get a UPS instead, with enough run time to gracefully shut the
system down in case of a power failure. Even if it triggers a
"shutdown -h now" after the power being out for just a few seconds,
that remains a lot less bad compared to an improper power loss.
--
Michael Kjörling • https://michael.kjorling.se • ***@kjorling.se
OpenPGP B501AC6429EF4514 https://michael.kjorling.se/public-keys/pgp
“People who think they know everything really annoy
those of us who know we don’t.” (Bjarne Stroustrup)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Loading...