Discussion:
[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?
Emmanuel Noobadmin
2012-04-05 20:02:17 UTC
Permalink
I'm trying to improve the setup of our Dovecot/Exim mail servers to
handle the increasingly huge accounts (everybody thinks it's like
infinitely growing storage like gmail and stores everything forever in
their email accounts) by changing from Maildir to mdbox, and to take
advantage of offloading older emails to alternative networked storage
nodes.

The question now is whether having a single large server or will a
number of 1U servers with the same total capacity be better? Will be
using RAID 1 pairs, likely XFS based on reading Hoeppner's
recommendation on this and the mdadm list.


Currently, I'm leaning towards multiple small servers because I think
it should be better in terms of performance. At the very least even if
one node gets jammed up, the rest should still be able serve up the
emails for other accounts that is unless Dovecot will get locked up by
that jammed transaction. Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.

Would I be correct in these or do actual experiences say otherwise?
Stan Hoeppner
2012-04-07 10:19:46 UTC
Permalink
On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,
Post by Emmanuel Noobadmin
I'm trying to improve the setup of our Dovecot/Exim mail servers to
handle the increasingly huge accounts (everybody thinks it's like
infinitely growing storage like gmail and stores everything forever in
their email accounts) by changing from Maildir to mdbox, and to take
advantage of offloading older emails to alternative networked storage
nodes.
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
Post by Emmanuel Noobadmin
The question now is whether having a single large server or will a
number of 1U servers with the same total capacity be better?
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I came up with the following system that should be close to suitable,
for ~$10k USD. The 4 node system runs ~$12k USD. At $2k this isn't
substantially higher. But when we double the storage of each
architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a
difference of ~$7k. That's $1k shy of another 12 disk JBOD. Since CPU
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1
8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB)
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1
w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12
allocation groups, 12TB usable. Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64. No XFS stripe alignment to monkey with. No md
chunk size or anything else to worry about. XFS' allocation group
design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can
daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add
cables. This quadruples IOPS, throughput, and capacity--96TB total,
48TB net. Simply create 6 more mdraid1 devices and grow the linear
array with them. Then do an xfs_growfs to bring the extra 12TB of free
space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive
whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD
chassis for $7400 USD. That twice the drives, capacity, IOPS, for
~$2500 less than the HP JBOD. And unlike the HP 'enterprise SATA'
drives, the 2TB WD Black series have a 5 year warranty, and work great
with mdraid. Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our
LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles
in our concat+RAID1 setup, 144TB net space.
Post by Emmanuel Noobadmin
Will be
using RAID 1 pairs, likely XFS based on reading Hoeppner's
recommendation on this and the mdadm list.
To be clear, the XFS configuration I recommend/promote for mailbox
storage is very specific and layered. The layers must all be used
together to get the performance. These layers consist of using multiple
hardware or software RAID1 pairs and concatenating them with an md
linear array. You then format that md device with the XFS defaults, or
a specific agcount if you know how to precisely tune AG layout based on
disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.

2. Because we decrease seeks dramatically we also decrease response
latency significantly. With the RAID1+concat+XFS we have 12 disks each
with only 2 AGs spaced evenly down each platter. We have the same 4
user mail dirs in each AG, but in this case only 8 user mail dirs are
contained on each disk instead of portions all 96. With the same 96
concurrent writes to indexes, in this case end up with only 16 seeks per
drive--again, one to update each index file and one to update the metadata.

Assuming these drives have a max seek rate of 150 which is the average
for 7.2k drives, it will take 192/150 = 1.28 seconds for these
operations on the RAID10 array. With the concat array it will only take
16/150 = 0.11 seconds. Extrapolating from that demonstrates that the
concat array can handle 1.28/0.11 = 11.6*96 = 1,111 concurrent user
index updates in the same time as the RAID10 array, just over 10 times
more users. Granted, these are rough theoretical numbers--an index plus
metadata update isn't always going to cause a seek on every chunk in a
stripe, etc. But this does paint a very accurate picture of the
differences in mailbox workload disk seek patterns between XFS on concat
and RAID10 with the same hardware. In production one should be able to
handle at minimum 2x more users, probably many more, with the
RAID1+concat+XFS vs RAID10+XFS setup on the same hardware.
Post by Emmanuel Noobadmin
Currently, I'm leaning towards multiple small servers because I think
it should be better in terms of performance.
This usually isn't the case with mail. It's impossible to split up the
user files across the storage nodes in a way that balances block usage
on each node and user access to those blocks. Hotspots are inevitable
in both categories. You may achieve the same total performance of a
single server, maybe slightly surpass it depending on user load, but you
end up spending extra money on building resources that are idle most of
the time, in the case of CPU and NICs, or under/over utilized, in the
case of disk capacity in each node. Switch ports aren't horribly
expensive today, but you're still wasting some with the farm setup.
Post by Emmanuel Noobadmin
At the very least even if
one node gets jammed up, the rest should still be able serve up the
emails for other accounts that is unless Dovecot will get locked up by
that jammed transaction.
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Post by Emmanuel Noobadmin
Would I be correct in these or do actual experiences say otherwise?
Oracles on Mount Interweb profess that 2^5 nodes wide scale out is the
holy grail. IBM's mainframe evangelists tell us to put 5 million mail
users on a SystemZ with hundreds of Linux VMs.

I think bliss for most of us is found somewhere in the middle.
--
Stan
Emmanuel Noobadmin
2012-04-07 14:43:09 UTC
Permalink
On 4/7/12, Stan Hoeppner <stan at hardwarefreak.com> wrote:

Firstly, thanks for the comprehensive reply. :)
Post by Stan Hoeppner
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
Post by Stan Hoeppner
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
Post by Stan Hoeppner
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl. My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Post by Stan Hoeppner
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
Post by Stan Hoeppner
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.

Of course management and clients don't agree with me since
backup/redundancy costs money. :)
Stan Hoeppner
2012-04-08 18:21:47 UTC
Permalink
Post by Emmanuel Noobadmin
Firstly, thanks for the comprehensive reply. :)
Post by Stan Hoeppner
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
So it seems you have two courses of action:

1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.

2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs

Adding an NFS server and moving infrequently (old) accessed files to
alternate storage will alleviate your space problems. But it will
probably not fix some of the other problems you mention, such as servers
bogging down and becoming unresponsive, as that's not a space issue.
The cause of that would likely be an IOPS issue, meaning you don't have
enough storage spindles to service requests in a timely manner.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.

A SAN is required for such a setup. I had extensive experience with ESX
and HA about 5 years ago and it works as advertised. After 5 years it
can only have improved. It's not "cheap" but usually pays for itself
due to being able to consolidate the workload of dozens of physical
servers into just 2 or 3 boxes.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl.
Maybe just starting with a description of your current hardware setup
and number of total users/mailboxes would be a good starting point. How
many servers do you have, what storage is connected to each, percent of
MUA POP/IMAP connections from user PCs versus those from webmail
applications, etc, etc.

Probably the single most important piece of information would be the
hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc,
and what version of Dovecot you're running.

The focus of your email is building a storage server strictly to offload
old mail and free up space on the Dovecot server. From the sound of
things, this may not be sufficient to solve all your problems.
Post by Emmanuel Noobadmin
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
If multiple servers are screeching to a halt due to iowait, either all
of your servers individual disks are overloaded, or you already have
shared storage. We really need more info on your current architecture.
Right now we don't knw if we're talking about 4 servers or 40., 100
users or 10,000.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?
If the big storage node is strictly alt storage, and it dies, Dovecot
will still access its main mdbox storage just fine. It simply wouldn't
be able to access the alt storage and would log errors for those requests.

If you design a whole new architecture from scratch, going with ESX and
an iSCSI SAN this whole line of thinking is moot as there is no SPOF.
This can be done with as little as two physical servers and one iSCSI
SAN array which has dual redundant controllers in the base config.
Depending on your actual IOPS needs, you could possibly consolidate
everything you have now into two physical servers and one iSCSI SAN
array, for between $30-40K USD in hardware and $8-10K in ESX licenses.
That just a guess on ESX as I don't know the current pricing. Even if
it's that "high" it's far more than worth the price due to the capability.

Such a setup allows you to run all of your Exim, webmail, Dovecot, etc
servers on two machines, and you usually get much better performance
than with individual boxes, especially if you manually place the VMs on
the nodes for lowest network latency. For instance, if you place your
webmail server VM on the same host as the Dovecot VM, TCP packet latency
drops from the high micro/low milliscond range into the mid nanosecond
range--a 1000x decrease in latency. Why? The packet transfer is now a
memory-to-memory copy through the hypervisor. The packets never reach a
physical network interface. You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.
Search the list archives for Charles' thread about bringing up a 2nd
office site. His desire was/is to duplicate machines at the 2nd site
for redundancy, when the proper thing to do is duplicate them at the
primary site, and simply duplicate the network links between sites. My
point to you and Charles is that you never add complexity for the sake
of adding complexity.
Post by Emmanuel Noobadmin
Of course management and clients don't agree with me since
backup/redundancy costs money. :)
So does gasoline, but even as the price has more than doubled in 3 years
in the States, people haven't stopped buying it. Why? They have to
have it. The case is the same for certain levels of redundancy. You
simply have to have it. You job is properly explaining that need. Ask
the CEO/CFO how much money the company will lose in productivity if
nobody has email for 1 workday, as that is how long it will take to
rebuild it from scratch and restore all the data when it fails. Then
ask what the cost is if all the email is completely lost because they
were to cheap to pay for a backup solution?

To executives, money in the bank is like the family jewels in their
trousers. Kicking the family jewels and generating that level of pain
seriously gets their attention. Likewise, if a failed server plus
rebuild/restore costs $50K in lost productivity, spending $20K on a
solution to prevent that from happening is a good investment. Explain
it in terms execs understand. Have industry data to back your position.
There plenty of it available.
--
Stan
Emmanuel Noobadmin
2012-04-09 19:15:02 UTC
Permalink
Post by Stan Hoeppner
1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.
2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs
I started to do this and realize I have a serious mess on hand that
makes delving in other people's uncommented source code seem like a
joy :D

Management added to this by deciding if we're going to offload the
email storage to a network storage, we might as well consolidate
everything into that shared storage system so we don't have TBs of
un-utilized space. So I might not even be able to use your tested XFS
+ concat solution since it may not be optimal for VM images and
databases.

As the requirements' changed, I'll stop asking here as it's no longer
really relevant just for Dovecot purposes.
Post by Stan Hoeppner
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.
A SAN is required for such a setup.
Thanks for the suggestion, I will need to find some time to look into
this although we've mostly been using KVM for virtualization so far.
Although the "SAN" part will probably prevent this from being accepted
due to cost.
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
True, but I'd hate to be the customer who get to pick up the pieces
when things explode due to unintended negligence by a dev trying to
level up by multi-classing as an admin.
Post by Stan Hoeppner
physical network interface. You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.
Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all "servers" after all.

So I have to make do with OTS commodity parts and free software for
the most parts.
Stan Hoeppner
2012-04-10 05:00:19 UTC
Permalink
Post by Emmanuel Noobadmin
Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all "servers" after all.
So I have to make do with OTS commodity parts and free software for
the most parts.
OTS meaning you build your own systems from components? Too few in the
business realm do so today. :(

It sounds like budget overrides redundancy then. You can do an NFS
cluster with SAN and GFS2, or two servers with their own storage and
DRBD mirroring. Here's how to do the latter:
http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat

The total cost is about the same for each solution as an iSCSI SAN array
of drive count X is about the same cost as two JBOD disk arrays of count
X*2. Redundancy in this case is expensive no matter the method. Given
how infrequent host failures are, and the fact your storage is
redundant, it may make more sense to simply keep spare components on
hand and swap what fails--PSU, mobo, etc.

Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts list:
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

In case the link doesn't work, the core components are:

SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.

Price today: $5,376.62

Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.

If you need more transactional throughput you could use 20 WD6000HLHX
600GB 10K RPM WD Raptor drives. You'll get 40% more throughput and 6TB
net space with RAID10. They'll cost you $1200 more, or $6,576.62 total.
Well worth the $1200 for 40% more throughput, if 6TB is enough.

Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.

Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.

The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.

Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
--
Stan
Emmanuel Noobadmin
2012-04-10 06:09:18 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
So I have to make do with OTS commodity parts and free software for
the most parts.
OTS meaning you build your own systems from components? Too few in the
business realm do so today. :(
For the inhouse stuff and budget customers yes, in fact both the email
servers are on seconded hardware that started life as something else.
I spec HP servers for our app servers to customers who are willing to
pay for their own colocated or onsite servers but still there are
customers who balk at the cost and so go OTS or virtualized.
Post by Stan Hoeppner
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.
Post by Stan Hoeppner
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.
Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
MegaRAID being used as the actual RAID controller or just as a HBA?

I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.
Post by Stan Hoeppner
Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.
Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.
Post by Stan Hoeppner
Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.
Stan Hoeppner
2012-04-11 07:18:49 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
It's pretty phenomenally low considering what all you get, especially 20
enterprise class drives.
Post by Emmanuel Noobadmin
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.
The 10K drives I mentioned are SATA not SAS. WD's 7.2k RE and 10k
Raptor series drives are both SATA but have RAID specific firmware,
better reliability, longer warranties, etc. The RAID specific firmware
is why both are tested and certified by LSI with their RAID cards.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.
Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
Not to mention rebuild times for large width RAID5/6.
Post by Emmanuel Noobadmin
MegaRAID being used as the actual RAID controller or just as a HBA?
It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD
support, the works. It's an LSI "Feature Line" card:
http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.aspx

The specs:
http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.aspx

You'll need the cache battery module for safe write caching, which I
forgot in the wish list (now added), $160:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163&Tpk=LSIiBBU08

With your workload and RAID10 you should run with all 512MB configured
as write cache. Linux caches all reads so using any controller cache
for reads is a waste. Using all 512MB for write cache will increase
random write IOPS.

Note the 9280 allows up to 64 LUNs, so you can do tiered storage within
this 20 bay chassis. For spares management you'd probably not want to
bother with two different sized drives.

I didn't mention the 300GB 10K Raptors previously due to their limited
capacity. Note they're only $15 more apiece than the 1TB RE4 drives in
the original parts list. For a total of $300 more you get the same 40%
increase in IOPs of the 600GB model, but you'll only have 3TB net space
after RAID10. If 3TB is sufficient space for your needs, that extra 40%
IOPS makes this config a no brainer. The decreased latency of the 10K
drives will give a nice boost to VM read performance, especially when
using NFS. Write performance probably won't be much different due to
the generous 512MB write cache on the controller. I also forgot to
mention that with BBWC enabled you can turn off XFS barriers, which will
dramatically speed up Exim queues and Dovecot writes, all writes actually.

Again, you probably don't want the spares management overhead of two
different disk types on the shelf, but you could stick these 10K 300s in
the first 16 slots, and put the 2TB RE4 drive in the last 4 slots,
RAID10 on the 10K drives, RAID5 on the 2TB drives. This yields an 8
spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB
for near line storage such as your Dovecot alt storage, VM templates,
etc, 8.4TB net, 1.6TB less than the original 10TB setup. Total
additional cost is $920 for this setup. You'd have two XFS filesystems
(with quite different mkfs parameters).
Post by Emmanuel Noobadmin
I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.
Assuming you have the right connector configuration for your
drive/enclosure on the replacement card, you can usually swap out one
LSI RAID card with any other LSI RAID card in the same, or newer,
generation. It'll read the configuration metadata from the disks and be
up an running in minutes. This feature has been around all the way back
to the AMI/Mylex cards of the late 1990s. LSI acquired both companies,
who were #1 and #2 in RAID, which is why LSI is so successful today.
Back in those days LSI simply supplied the ASICs to AMI and Mylex. I
have an AMI MegaRAID 428, top of the line in 1998, lying around
somewhere. Still working when I retired it many years ago.

FYI, LSI is the OEM provider of RAID and SAS/SATA HBA ASIC silicon for
the tier 1 HBA and mobo down markets. Dell, HP, IBM, Intel, Oracle
(Sun), Siemens/Fujitsu, all use LSI silicon and firmware. Some simply
rebadge OEM LSI cards with their own model and part numbers. IBM and
Dell specifically have been doing this rebadging for well over a decade,
long before LSI acquired Mylex and AMI. The Dell PERC/2 is a rebadged
AMI MegaRAID 428.

Software and hardware RAID each have their pros and cons. I prefer
hardware RAID for write cache performance and many administrative
reasons, including SAF-TE enclosure management (fault LEDs, alarms, etc)
so you know at a glance which drive has failed and needs replacing,
email and SNMP notification of events, automatic rebuild, configurable
rebuild priority, etc, etc, and good performance with striping and
mirroring. Parity RAID performance often lags behind md with heavy
workloads but not with light/medium. FWIW I rarely use parity RAID, due
to the myriad performance downsides.

For ultra high random IOPS workloads, or when I need a single filesystem
space larger than the drive limit or practical limit for one RAID HBA,
I'll stitch hardware RAID1 or small stripe width RAID 10 arrays (4-8
drives, 2-4 spindles) together with md RAID 0 or 1.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.
Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.
Absolutely. If you setup these 20 drives as a single RAID10, soft/hard
or hybrid, with the LSI cache set to 100% write-back, with a single XFS
filesystem with 10 allocation groups and proper stripe alignment, you'll
get maximum performance for pretty much any conceivable workload.

Your only limitations will be possible NFS or TCP tuning issues, and
maybe having only two GbE ports. For small random IOPS such as Exim
queues, Dovecot store, VM image IO, etc, the two GbE ports are plenty.
But if you add any large NFS file copies into the mix, such as copying
new VM templates or ISO images over, etc, or do backups over NFS instead
of directly on the host machine at the XFS level, then two bonded GbE
ports might prove a bottleneck.

The mobo has 2 PCIe x8 slots and one x4 slot. One of the x8 slots is an
x16 physical connector. You'll put the LSI card in the x16 slot. If
you mount the Intel SAS expander to the chassis as I do instead of in a
slot, you have one free x8 and one free x4 slot. Given the $250 price,
I'd simply ad an Intel quad port GbE NIC to the order. Link aggregate
all 4 ports on day one and use one IP address for the NFS traffic. Use
the two on board ports for management etc. This should give you a
theoretical 400MB/s of peak NFS throughput, which should be plenty no
matter what workload you throw at it.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.
Just about every 1U server I've seen that's been racked for 3 or more
years has warped under its own weight. I even saw an HPQ 2U that was
warped this way, badly warped. In this instance the slide rail bolts
had never been tightened down to the rack--could spin them by hand.
Since the chassis side panels weren't secured, and there was lateral
play, the weight of the 6 drives caused the side walls of the case to
fold into a mild trapezoid, which allowed the bottom and top panels to
bow. Let this be a lesson boys and girls: always tighten your rack
bolts. :)
--
Stan
Ed W
2012-04-11 16:50:09 UTC
Permalink
Re XFS. Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"

I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...

Regards

Ed W
Adrian Minta
2012-04-11 20:48:00 UTC
Permalink
On 04/11/12 19:50, Ed W wrote:

...
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
RAID6 is very slow for write operations. That's why is the worst choice
for maildir.
Charles Marcus
2012-04-11 18:50:11 UTC
Permalink
Post by Adrian Minta
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
RAID6 is very slow for write operations. That's why is the worst choice
for maildir.
He did say '"For *low* *performance* requirements..." ... ;)
--
Best regards,

Charles
Charles Marcus
2012-04-11 18:50:11 UTC
Permalink
Post by Adrian Minta
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
RAID6 is very slow for write operations. That's why is the worst choice
for maildir.
He did say '"For *low* *performance* requirements..." ... ;)
--
Best regards,

Charles
Charles Marcus
2012-04-11 18:50:11 UTC
Permalink
Post by Adrian Minta
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
RAID6 is very slow for write operations. That's why is the worst choice
for maildir.
He did say '"For *low* *performance* requirements..." ... ;)
--
Best regards,

Charles
Stan Hoeppner
2012-04-12 01:18:08 UTC
Permalink
Post by Ed W
Re XFS. Have you been watching BTRFS recently?
I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"
Links?
Post by Ed W
I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?
http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%. It's hard to say because the Boxacle folks didn't
show the XFG AG config they used. The concat+RAID1 setup can decrease
disk seeks by many orders of magnitude vs striping. Everyone knows as
seeks go down IOPS go up. Even with this very suboptimal disk setup,
XFS still trounces everything but JFS which is a close 2nd. BTRFS is
way down in the pack. It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35. EXT4 and JFS have seen little performance
work since. In fact JFS has seen no commits but bug fixes and changes
to allow compiling with recent kernels.
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
--
Stan
Emmanuel Noobadmin
2012-04-12 02:23:19 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
Stan Hoeppner
2012-04-12 10:20:31 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years. The mindset here is that anyone would rather spent $150-$2500
dollars on a replacement drive than take a chance with his/her valuable
data.

Yes I typed $2500. EMC charges over $2000 for a single Seagate disk
drive with an EMC label and serial# on it. The serial number is what
prevents one from taking the same off the shelf Seagate drive at $300
and mounting it in a $250,000 EMC array chassis. The controller
firmware reads the S/N from each connected drive and will not allow
foreign drives to be used. HP, IBM, Oracle/Sun, etc do this as well.
Which is why they make lots of profit, and is why I prefer open storage
systems.
--
Stan
Ed W
2012-04-12 10:58:52 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years. The mindset here is that anyone would rather spent $150-$2500
dollars on a replacement drive than take a chance with his/her valuable
data.
I'm asking a subtlely different question.

The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data? I can't personally claim to have observed this, so it
remains someone else's theory... (for background my experience is
simply: RAID10 for high performance arrays and RAID6 for all my personal
data - I intend to investigate your linear raid idea in the future though)

I do agree that if one drive reports a read error, then it's quite easy
to guess which pair of the array is wrong...

Just as an aside, I don't have a lot of failure experience. However,
the few I have had (perhaps 6-8 events now) is that there is a massive
correlation in failure time with RAID1, eg one pair I had lasted perhaps
2 years and then both failed within 6 hours of each other. I also had a
bad experience with RAID 5 that wasn't being scrubbed regularly and when
one drive started reporting errors (ie lack of monitoring meant it had
been bad for a while), the rest of the array turned out to be a
patchwork of read errors - linux raid then turns out to be quite fragile
in the presence of a small number of read failures and it's extremely
difficult to salvage the 99% of the array which is ok due to the disks
getting kicked out... (of course regular scrubs would have prevented
getting so deep into that situation - it was a small cheap nas box
without such features)

Ed W
Timo Sirainen
2012-04-12 11:09:31 UTC
Permalink
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
Ed W
2012-04-12 12:10:20 UTC
Permalink
Post by Timo Sirainen
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
I have to say - I haven't actually seen this happen... Do any of your
big mailstore contacts observe this, eg rackspace, etc?

I think it's worth thinking about the failure cases before implementing
something to be honest? Just sticking in a checksum possibly doesn't
help anyone unless it's on the right stuff and in the right place?

Off the top of my head:
- Someone butchers the file on disk (disk error or someone edits it with vi)
- Restore of some files goes subtly wrong, eg tool tries to be clever
and fails, snapshot taken mid-write, etc?
- Filesystem crash (sudden power loss), how to deal with partial writes?


Things I might like to do *if* there were some suitable "checksums"
available:
- Use the checksum as some kind of guid either for the whole message,
the message minus the headers, or individual mime sections
- Use the checksums to assist with replication speed/efficiency (dsync
or custom imap commands)
- File RFCs for new imap features along the "lemonde" lines which allow
clients to have faster recovery from corrupted offline states...
- Single instance storage (presumably already done, and of course this
has some subtleties in the face of deliberate attack)
- Possibly duplicate email suppression (but really this is an LDA
problem...)
- Storage backends where emails are redundantly stored and might not ALL
be on a single server (find me the closest copy of email X) -
derivations of this might be interesting for compliance archiving of
messages?
- Fancy key-value storage backends might use checksums as part of the
key value (either for the whole or parts of the message)

The mail server has always looked like a kind of key-value store to my
eye. However, traditional key-value isn't usually optimised for
"streaming reads", hence dovecot seems like a "key value store,
optimised for sequential high speed streaming access to the key
values"... Whilst it seems increasingly unlikely that a traditional
key-value store will work well to replace say mdbox, I wonder if it's
not worth looking at the replication strategies of key-value stores to
see if those ideas couldn't lead to new features for mdbox?

Cheers

Ed W
Dirk Jahnke-Zumbusch
2012-04-12 14:08:31 UTC
Permalink
Hi there,
Post by Ed W
I have to say - I haven't actually seen this happen... Do any of your
big mailstore contacts observe this, eg rackspace, etc?
Just to throw in to the discussion that with (silent) data corruption
not only "the disk" is involved but many other parts of your systems.
So perhaps you would like to have a look at

https://indico.desy.de/getFile.py/access?contribId=65&sessionId=42&resId=0&materialId=slides&confId=257

http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797

The documents are from 2007 but the principals are still the same.

Kind regards
Dirk
Timo Sirainen
2012-04-13 11:51:06 UTC
Permalink
Post by Timo Sirainen
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
I have to say - I haven't actually seen this happen... Do any of your big mailstore contacts observe this, eg rackspace, etc?
I haven't heard. But then again people don't necessarily notice if it has.
- Use the checksum as some kind of guid either for the whole message, the message minus the headers, or individual mime sections
Messages already have a GUID. And the rest of that is kind of done with the single instance storage stuff.. I was thinking of using SHA1 of the entire message with headers as the checksum, and save it into dbox metadata field. I also thought about checksumming the metadata fields as well, but that would need another checksum as the first one can have other uses as well besides verifying the message integrity.
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
- Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)
GUID would work for these as well, without the possibility of a hash collision.
Ed W
2012-04-13 12:17:19 UTC
Permalink
Post by Timo Sirainen
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
..
Post by Timo Sirainen
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
I presume you have seen that cyrus is working on various distributed
options? Standardising this through imap might work if they also buy
into it?
Post by Timo Sirainen
- Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)
GUID would work for these as well, without the possibility of a hash collision.
I was thinking that the win for key-value store as a backend is if you
can reduce the storage requirements or do better placement of the data
(mail text replicated widely, attachments stored on higher latency
storage?). Hence whilst I don't see this being a win with current
options, if it were done then it would almost certainly be "per mime
part", eg storing all large attachments in one place and the rest of the
message somewhere else, perhaps with different redundancy levels per type

OK, this is all completely pie in the sky. Please don't build it! All
I meant was that these are the kind of things that someone might one day
desire to do and hence they would have competing requirements for what
to checksum...

Cheers

Ed W
Timo Sirainen
2012-04-13 12:21:49 UTC
Permalink
Post by Timo Sirainen
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
..
Post by Timo Sirainen
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it?
Probably more trouble than worth. I doubt anyone would want to run a cross-Dovecot/Cyrus cluster.
Post by Timo Sirainen
- Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)
GUID would work for these as well, without the possibility of a hash collision.
I was thinking that the win for key-value store as a backend is if you can reduce the storage requirements or do better placement of the data (mail text replicated widely, attachments stored on higher latency storage?). Hence whilst I don't see this being a win with current options, if it were done then it would almost certainly be "per mime part", eg storing all large attachments in one place and the rest of the message somewhere else, perhaps with different redundancy levels per type
OK, this is all completely pie in the sky. Please don't build it! All I meant was that these are the kind of things that someone might one day desire to do and hence they would have competing requirements for what to checksum...
That can almost be done already .. the attachments are saved and accessed via a lib-fs API. It wouldn't be difficult to write a backend for some key-value databases. So with about one day's coding you could already have Dovecot save all message attachments to a key-value db, and you can configure redundancy in the db's configs.
Ed W
2012-04-13 14:04:17 UTC
Permalink
Post by Timo Sirainen
Post by Timo Sirainen
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
..
Post by Timo Sirainen
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it?
Probably more trouble than worth. I doubt anyone would want to run a cross-Dovecot/Cyrus cluster.
No definitely not. Sorry I just meant that you are both working on
similar things. Standardising the basics that each use might be useful
in the future
Post by Timo Sirainen
That can almost be done already .. the attachments are saved and accessed via a lib-fs API. It wouldn't be difficult to write a backend for some key-value databases. So with about one day's coding you could already have Dovecot save all message attachments to a key-value db, and you can configure redundancy in the db's configs.
Hmm, super.

Ed W
Ed W
2012-04-13 14:04:17 UTC
Permalink
Post by Timo Sirainen
Post by Timo Sirainen
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
..
Post by Timo Sirainen
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it?
Probably more trouble than worth. I doubt anyone would want to run a cross-Dovecot/Cyrus cluster.
No definitely not. Sorry I just meant that you are both working on
similar things. Standardising the basics that each use might be useful
in the future
Post by Timo Sirainen
That can almost be done already .. the attachments are saved and accessed via a lib-fs API. It wouldn't be difficult to write a backend for some key-value databases. So with about one day's coding you could already have Dovecot save all message attachments to a key-value db, and you can configure redundancy in the db's configs.
Hmm, super.

Ed W
Timo Sirainen
2012-04-13 12:21:49 UTC
Permalink
Post by Timo Sirainen
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
..
Post by Timo Sirainen
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it?
Probably more trouble than worth. I doubt anyone would want to run a cross-Dovecot/Cyrus cluster.
Post by Timo Sirainen
- Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)
GUID would work for these as well, without the possibility of a hash collision.
I was thinking that the win for key-value store as a backend is if you can reduce the storage requirements or do better placement of the data (mail text replicated widely, attachments stored on higher latency storage?). Hence whilst I don't see this being a win with current options, if it were done then it would almost certainly be "per mime part", eg storing all large attachments in one place and the rest of the message somewhere else, perhaps with different redundancy levels per type
OK, this is all completely pie in the sky. Please don't build it! All I meant was that these are the kind of things that someone might one day desire to do and hence they would have competing requirements for what to checksum...
That can almost be done already .. the attachments are saved and accessed via a lib-fs API. It wouldn't be difficult to write a backend for some key-value databases. So with about one day's coding you could already have Dovecot save all message attachments to a key-value db, and you can configure redundancy in the db's configs.
Ed W
2012-04-13 12:17:19 UTC
Permalink
Post by Timo Sirainen
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
..
Post by Timo Sirainen
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
I presume you have seen that cyrus is working on various distributed
options? Standardising this through imap might work if they also buy
into it?
Post by Timo Sirainen
- Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)
GUID would work for these as well, without the possibility of a hash collision.
I was thinking that the win for key-value store as a backend is if you
can reduce the storage requirements or do better placement of the data
(mail text replicated widely, attachments stored on higher latency
storage?). Hence whilst I don't see this being a win with current
options, if it were done then it would almost certainly be "per mime
part", eg storing all large attachments in one place and the rest of the
message somewhere else, perhaps with different redundancy levels per type

OK, this is all completely pie in the sky. Please don't build it! All
I meant was that these are the kind of things that someone might one day
desire to do and hence they would have competing requirements for what
to checksum...

Cheers

Ed W
Dirk Jahnke-Zumbusch
2012-04-12 14:08:31 UTC
Permalink
Hi there,
Post by Ed W
I have to say - I haven't actually seen this happen... Do any of your
big mailstore contacts observe this, eg rackspace, etc?
Just to throw in to the discussion that with (silent) data corruption
not only "the disk" is involved but many other parts of your systems.
So perhaps you would like to have a look at

https://indico.desy.de/getFile.py/access?contribId=65&sessionId=42&resId=0&materialId=slides&confId=257

http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797

The documents are from 2007 but the principals are still the same.

Kind regards
Dirk
Timo Sirainen
2012-04-13 11:51:06 UTC
Permalink
Post by Timo Sirainen
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
I have to say - I haven't actually seen this happen... Do any of your big mailstore contacts observe this, eg rackspace, etc?
I haven't heard. But then again people don't necessarily notice if it has.
- Use the checksum as some kind of guid either for the whole message, the message minus the headers, or individual mime sections
Messages already have a GUID. And the rest of that is kind of done with the single instance storage stuff.. I was thinking of using SHA1 of the entire message with headers as the checksum, and save it into dbox metadata field. I also thought about checksumming the metadata fields as well, but that would need another checksum as the first one can have other uses as well besides verifying the message integrity.
- Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
It would be of some use with dbox index rebuilding. I don't think it would help with dsync.
- File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Too much trouble, no one would implement it :)
- Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)
GUID would work for these as well, without the possibility of a hash collision.
Ed W
2012-04-12 12:10:20 UTC
Permalink
Post by Timo Sirainen
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
I have to say - I haven't actually seen this happen... Do any of your
big mailstore contacts observe this, eg rackspace, etc?

I think it's worth thinking about the failure cases before implementing
something to be honest? Just sticking in a checksum possibly doesn't
help anyone unless it's on the right stuff and in the right place?

Off the top of my head:
- Someone butchers the file on disk (disk error or someone edits it with vi)
- Restore of some files goes subtly wrong, eg tool tries to be clever
and fails, snapshot taken mid-write, etc?
- Filesystem crash (sudden power loss), how to deal with partial writes?


Things I might like to do *if* there were some suitable "checksums"
available:
- Use the checksum as some kind of guid either for the whole message,
the message minus the headers, or individual mime sections
- Use the checksums to assist with replication speed/efficiency (dsync
or custom imap commands)
- File RFCs for new imap features along the "lemonde" lines which allow
clients to have faster recovery from corrupted offline states...
- Single instance storage (presumably already done, and of course this
has some subtleties in the face of deliberate attack)
- Possibly duplicate email suppression (but really this is an LDA
problem...)
- Storage backends where emails are redundantly stored and might not ALL
be on a single server (find me the closest copy of email X) -
derivations of this might be interesting for compliance archiving of
messages?
- Fancy key-value storage backends might use checksums as part of the
key value (either for the whole or parts of the message)

The mail server has always looked like a kind of key-value store to my
eye. However, traditional key-value isn't usually optimised for
"streaming reads", hence dovecot seems like a "key value store,
optimised for sequential high speed streaming access to the key
values"... Whilst it seems increasingly unlikely that a traditional
key-value store will work well to replace say mdbox, I wonder if it's
not worth looking at the replication strategies of key-value stores to
see if those ideas couldn't lead to new features for mdbox?

Cheers

Ed W
Stan Hoeppner
2012-04-13 05:29:52 UTC
Permalink
Post by Ed W
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data?
You need to read those articles again very carefully. If you don't
understand what they mean by "1 in 10^15 bits non-recoverable read error
rate" and combined probability, let me know.

And this has zero bearing on RAID1. And RAID1 reads don't work the way
you describe above. I explained this in some detail recently.
Post by Ed W
I do agree that if one drive reports a read error, then it's quite easy
to guess which pair of the array is wrong...
Been working that way for more than 2 decades Ed. :) Note that "RAID1"
has that "1" for a reason. It was the first RAID level. It was in
production for many many years before parity RAID hit the market. It is
the most well understood of all RAID levels, and the simplest.
--
Stan
Ed W
2012-04-13 15:09:31 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data?
You need to read those articles again very carefully. If you don't
understand what they mean by "1 in 10^15 bits non-recoverable read error
rate" and combined probability, let me know.
OK, I'll bite. I only have an honours degree in mathematics from a well
known university, so grateful if you could dumb it down appropriately?

Lets start with what "those articles" are you referring to? I don't see
any articles if I go literally up the chain from this email, but you
might be talking about any one of the lots of other emails in this
thread or even some other email thread?

Wikipedia has it's faults, but it dumbs the "silent corruption" claim
down to:
http://en.wikipedia.org/wiki/ZFS
"an undetected error for every 67TB"

And a CERN study apparently claims "far higher than one in every 10^16 bits"

Now, I'm NOT professing any experience of axe to grind here. I'm simply
asking by what feature do you believe either software or hardware RAID1
is capable of detecting which pair is correct when both pairs of a raid
one disk return different results and there is no hardware failure to
clue us that one pair suffered a read error? Please don't respond with
a maths pissing competition, it's an innocent question about what levels
of data checking are done on each piece of the hardware chain? My
(probably flawed) understanding is that popular RAID 1 implementations
don't add any additional sector checksums over and above what the
drives/filesystem/etc add already offer - is this the case?
Post by Stan Hoeppner
And this has zero bearing on RAID1. And RAID1 reads don't work the way
you describe above. I explained this in some detail recently.
Where?
Post by Stan Hoeppner
Been working that way for more than 2 decades Ed. :) Note that "RAID1"
has that "1" for a reason. It was the first RAID level.
What should I make of RAID0 then?

Incidentally do you disagree with the history of RAID evolution on
Wikipedia?
http://en.wikipedia.org/wiki/RAID


Regards

Ed W
Ed W
2012-04-13 15:09:31 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data?
You need to read those articles again very carefully. If you don't
understand what they mean by "1 in 10^15 bits non-recoverable read error
rate" and combined probability, let me know.
OK, I'll bite. I only have an honours degree in mathematics from a well
known university, so grateful if you could dumb it down appropriately?

Lets start with what "those articles" are you referring to? I don't see
any articles if I go literally up the chain from this email, but you
might be talking about any one of the lots of other emails in this
thread or even some other email thread?

Wikipedia has it's faults, but it dumbs the "silent corruption" claim
down to:
http://en.wikipedia.org/wiki/ZFS
"an undetected error for every 67TB"

And a CERN study apparently claims "far higher than one in every 10^16 bits"

Now, I'm NOT professing any experience of axe to grind here. I'm simply
asking by what feature do you believe either software or hardware RAID1
is capable of detecting which pair is correct when both pairs of a raid
one disk return different results and there is no hardware failure to
clue us that one pair suffered a read error? Please don't respond with
a maths pissing competition, it's an innocent question about what levels
of data checking are done on each piece of the hardware chain? My
(probably flawed) understanding is that popular RAID 1 implementations
don't add any additional sector checksums over and above what the
drives/filesystem/etc add already offer - is this the case?
Post by Stan Hoeppner
And this has zero bearing on RAID1. And RAID1 reads don't work the way
you describe above. I explained this in some detail recently.
Where?
Post by Stan Hoeppner
Been working that way for more than 2 decades Ed. :) Note that "RAID1"
has that "1" for a reason. It was the first RAID level.
What should I make of RAID0 then?

Incidentally do you disagree with the history of RAID evolution on
Wikipedia?
http://en.wikipedia.org/wiki/RAID


Regards

Ed W
Timo Sirainen
2012-04-12 11:09:31 UTC
Permalink
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
Stan Hoeppner
2012-04-13 05:29:52 UTC
Permalink
Post by Ed W
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data?
You need to read those articles again very carefully. If you don't
understand what they mean by "1 in 10^15 bits non-recoverable read error
rate" and combined probability, let me know.

And this has zero bearing on RAID1. And RAID1 reads don't work the way
you describe above. I explained this in some detail recently.
Post by Ed W
I do agree that if one drive reports a read error, then it's quite easy
to guess which pair of the array is wrong...
Been working that way for more than 2 decades Ed. :) Note that "RAID1"
has that "1" for a reason. It was the first RAID level. It was in
production for many many years before parity RAID hit the market. It is
the most well understood of all RAID levels, and the simplest.
--
Stan
Emmanuel Noobadmin
2012-04-13 06:12:48 UTC
Permalink
Post by Emmanuel Noobadmin
I suppose the controller could throw an error if
Post by Emmanuel Noobadmin
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years.
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.

If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?

if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?

If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Stan Hoeppner
2012-04-13 12:33:19 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Emmanuel Noobadmin
I suppose the controller could throw an error if
Post by Emmanuel Noobadmin
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years.
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
Post by Emmanuel Noobadmin
If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?
I already answered this in a previous post.
Post by Emmanuel Noobadmin
if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?
No, this is not correct.
Post by Emmanuel Noobadmin
If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.

Emmanuel, Ed, we're at a point where I simply don't have the time nor
inclination to continue answering these basic questions about the base
level functions of storage hardware. You both have serious
misconceptions about how many things work. To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.

I don't mind, and actually enjoy, passing knowledge. But the amount
that seems to be required here to bring you up to speed is about 2^15
times above and beyond the scope of mailing list conversation.

In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
--
Stan
Jim Lawson
2012-04-13 13:12:02 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
?! Stan, are you really saying that silent data corruption "simply
can't happen"? People who have been studying this have been talking
about it for years now. It can happen in the same way that Emmanuel
describes.

USENIX FAST08:

http://static.usenix.org/event/fast08/tech/bairavasundaram.html

CERN:

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf

LANL:

http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf

There are others if you search for it. This problem has been well-known
in large (petabyte+) data storage systems for some time.

Jim
Stan Hoeppner
2012-04-13 14:20:29 UTC
Permalink
Post by Jim Lawson
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
?! Stan, are you really saying that silent data corruption "simply
can't happen"?
Yes, I did. Did you read the context in which I made that statement?
Post by Jim Lawson
People who have been studying this have been talking
about it for years now.
Yes, they have. Did you miss the paragraph where I stated exactly that?
Did you also miss the part about the probably of such being dictated by
total storage system size and access rate?
Post by Jim Lawson
It can happen in the same way that Emmanuel
describes.
No, it can't. Not in the way Emmanuel described. I already stated the
reason, and all of this research backs my statement. You won't see this
with a 2 drive mirror, or a 20 drive RAID10. Not until each drive has a
capacity in the 15TB+ range, if not more, and again, depending on the
total system size. This doesn't address the "RAID5", better known as
"parity RAID" write hole, which is a separate issue. Which is also one
of the reasons I don't use it.

In lieu of an actual controller firmware bug, or mdraid or lvm bug,
you'll never see this on small scale systems.
Post by Jim Lawson
http://static.usenix.org/event/fast08/tech/bairavasundaram.html
http://storagemojo.com/2007/09/19/cerns-data-corruption-research/
http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf
http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf
There are others if you search for it. This problem has been well-known
in large (petabyte+) data storage systems for some time.
And again, this is the crux of it. One doesn't see this problem until
one hits extreme scale, which I spent at least a paragraph or two
explaining, referencing the same research. Please re-read my post at
least twice, critically. Then tell me if I've stated anything
substantively different than what any of these researches have.

The statements "shouldn't" "wouldn't" and "can't" are based on
probabilities. "Can't" or "won't" does not need equal probability 0.
The probability of this type of silent data corruption occurring on a 2
disk or 20 disk array of today's drives is not zero over 10 years, but
it is so low the effective statement is "can't" or "won't" see this
corruption. As I said, when we reach 15-30TB+ disk drives, this may
change for small count arrays.
--
Stan
Stan Hoeppner
2012-04-13 14:20:29 UTC
Permalink
Post by Jim Lawson
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
?! Stan, are you really saying that silent data corruption "simply
can't happen"?
Yes, I did. Did you read the context in which I made that statement?
Post by Jim Lawson
People who have been studying this have been talking
about it for years now.
Yes, they have. Did you miss the paragraph where I stated exactly that?
Did you also miss the part about the probably of such being dictated by
total storage system size and access rate?
Post by Jim Lawson
It can happen in the same way that Emmanuel
describes.
No, it can't. Not in the way Emmanuel described. I already stated the
reason, and all of this research backs my statement. You won't see this
with a 2 drive mirror, or a 20 drive RAID10. Not until each drive has a
capacity in the 15TB+ range, if not more, and again, depending on the
total system size. This doesn't address the "RAID5", better known as
"parity RAID" write hole, which is a separate issue. Which is also one
of the reasons I don't use it.

In lieu of an actual controller firmware bug, or mdraid or lvm bug,
you'll never see this on small scale systems.
Post by Jim Lawson
http://static.usenix.org/event/fast08/tech/bairavasundaram.html
http://storagemojo.com/2007/09/19/cerns-data-corruption-research/
http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf
http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf
There are others if you search for it. This problem has been well-known
in large (petabyte+) data storage systems for some time.
And again, this is the crux of it. One doesn't see this problem until
one hits extreme scale, which I spent at least a paragraph or two
explaining, referencing the same research. Please re-read my post at
least twice, critically. Then tell me if I've stated anything
substantively different than what any of these researches have.

The statements "shouldn't" "wouldn't" and "can't" are based on
probabilities. "Can't" or "won't" does not need equal probability 0.
The probability of this type of silent data corruption occurring on a 2
disk or 20 disk array of today's drives is not zero over 10 years, but
it is so low the effective statement is "can't" or "won't" see this
corruption. As I said, when we reach 15-30TB+ disk drives, this may
change for small count arrays.
--
Stan
Ed W
2012-04-13 15:31:35 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It quite clearly can??!

Just grab your drive, lever the connector off a little bit until it's a
bit flaky and off you go? *THIS* type of problem I have heard of and
you can find easy examples with a quick google search of any hobbyist
storage board. Very common other examples are such problems due to
failing PSUs and other interference driven examples causing explicit
disk errors (and once the error rate goes up, some will make it past the
checksum)

Note this is NOT what I was originally asking about. My interest is
more about when the hardware is working reliably and as you agree, the
error levels are vastly lower. However, it would be incredibly foolish
to claim that it's not trivial to construct a scenario where bad
hardware causes plenty of silent corruption?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?
I already answered this in a previous post.
Not obviously?!

I will also add my understanding that linux software RAID1,5&6 *DO NOT*
read all disks and hence will not be aware when disks have different
data. In fact with software raid you need to run a regular "scrub" job
to check this consistency.

I also believe that most commodity hardware raid implementations work
exactly the same way and a background scrub is needed to detect
inconsistent arrays. However, feel free to correct that understanding?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?
No, this is not correct.
I definitely think you are wrong and Emmanuel is right?

If the controller gets a good read from the disk then it will trust that
read and will NOT check the result with the other disk (or parity in the
case of RAID5/6). If that read was incorrect for some reason then the
data will be passed as good.
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I
don't believe on commodity hardware controllers either! (You would be
able to tell because the disk IO would be doubled)

Linux software raid 1 isn't that smart, but reads only one disk and
trusts the answer if the read did not trigger an error. It does not
check the other disk except during an explicit disk scrub.
Post by Stan Hoeppner
Emmanuel, Ed, we're at a point where I simply don't have the time nor
inclination to continue answering these basic questions about the base
level functions of storage hardware.
You mean those "answers" like:
"I answered that in another thread"
or
"you need to read 'those' articles again"

Referring to some unknown and hard to find previous emails is not the
same as answering?

Also you are wondering off at extreme tangents. The question is simple:

- Disk 1 Read good, checksum = A
- Disk 2 Read good, checksum = B

Disks are a raid 1 pair. How do we know which disk is correct. Please
specify raid 1 implementation and mechanism used with any answer
Post by Stan Hoeppner
To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.
I really think not... A simple statement of:

- Each sector on disk has a certain sized checksum
- Controller checks checksum on read
- Sent back over SATA connection, with a certain sized checksum
- After that you are on your own vs corruption

...Should cover it I think?
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
So why are so many people getting excited about it now?

Note, there have been plenty of shoddy disk controller implementations
before today - ie there exists hardware on sale with *known* defects.
Despite that the industry continues without collapse. Now you claim
that if corruption is silent and people only tend to notice it much
later and under certain edge conditions that this can't be possible
because it should cause the industry to collapse..???

...Not buying your logic...

Ed W
Maarten Bezemer
2012-04-13 19:10:04 UTC
Permalink
Post by Ed W
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It quite clearly can??!
I totally agree with Ed here. Drives sure can and sometimes really do
return different data, without reporting errors. Also, data can get
corrupted on any of the busses or chips it passes through.

The math about 10^15 or 10^16 and all that stuff is not only about array
sizes. It's also about data transfer.

I've seen silent corruption on a few systems myself. (Luckily, only 3
times in a couple years.) Those systems were only in the 2TB-5TB size
category, which is substantially lower than the 67TB claimed elsewhere.
Yet, statistically, it's well within normal probability levels.

Linux mdraid only reads one mirror as long as the drives don't return an
error. Easy to check, the read speeds are way beyond a single drive's read
speed. When the kernel would have to read all (possibly more than two)
mirrors, and compare them, and make a decision based on this comparison,
things would be horribly slow. Hardware raid typically uses this exact
same approach. This goes for Areca, 3ware, LSI, which cover most of the
regular (i.e. non-SAN) professional hardware raid setups.

If you don't believe it, just don't take my word for it but test it for
yourself. Cleanly power down a raid1 array, take the individual drives,
put them into a simple desktop machine, and write different data to both,
using some raw disk writing tool like dd. Then put the drives back into
the raid1 array, power it up, and re-read the information. You'll see data
from both drives will be intermixed as parts of the reads come from one
disk, and parts come from the other. Only when you order the raid array to
do a verification pass, it'll start screaming and yelling. At least, I
hope it will...


But as explained elsewhere, silent corruption can occur at numerous
places. If you don't have an explicit checksumming/checking mechanism,
there are indeed cases that will haunt you if you don't do regular
scrubbing or at least do regular verification runs. Heck, that's why Linux
mdadm comes with cron jobs to do just that, and hardware raid controllers
have similar scheduling capabilities.

Of course, scrubbing/verification is not going to magically protect you
from all problems. But you would at least get notifications if it detects
problems.
Post by Ed W
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I don't
believe on commodity hardware controllers either! (You would be able to tell
because the disk IO would be doubled)
Obviously there is no way to tell which versions of a story are correct if
you are not biased to believe one of the storytellers and distrust the
other. You would have to add a checksum layer for that. (And hope the
checksum isn't the part that got corrupted!)
Post by Ed W
Post by Stan Hoeppner
To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.
I'm quite familiar with the basics of these protocols. I'm also quite
familiar with the flaws in several implementations of "seemingly
straightforward protocols". More often than not, there's a pressing need
to get new devices onto the market before the competition has something
similar and you loose your advantage. More often than not, this results in
suboptimal implementations of all those fine protocols and algorithms. And
let's face it: flaws in error recovery routines often don't surface until
someone actually needs those routines. As long as drives (or any other
device) are functioning as expected, everything is all right. But as soon
as something starts to get flaky, error recovery has to kick in but may
just as well fail to do the right thing.

Just consider the real-world analogy of politicians. They do or say
something stupid every once in a while, and error recovery (a.k.a. damage
control) has to kick in. But even though those well trained professionals,
having decades of experience in the political arena, sometimes simply fail
to do the right thing. They may have overlooked some pesky details, or
they may take actions that don't have the expected outcome because...
indeed, things work differently in damage control mode, and the only law
you can trust is physics: you always go down when you can't stay on your
feet.

With hard drives, raid controllers, mainboards, data buses, it's exactly
the same. If _something_ isn't working as it should, how should we know
which part of it we _can_ trust?
Post by Ed W
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
Isn't it just "worked around" by adding more layers of checksuming and
adding more redundancy into the mix? Don't believe this "storage industry"
because they tell you it's OK. It simply is not OK. You might want to talk
to people in the data and computing cluster business about their opinion
on "storage industry professionals"...

Timo's suggestion to add checksums to mailboxes/metadata could help to
(at least) report these types of failures. Re-reading from different
storage when available could also recover the data that got corrupted, but
I'm not sure what would be the best way to handle these situations. If you
know there is a corruption problem on one of your storage locations, you
might want to switch that to read-only asap. Automagically trying to
recover might not be the best thing to do. Given all kinds of different
use cases, I think that should at least be configurable :-P
--
Maarten
Stan Hoeppner
2012-04-14 03:31:04 UTC
Permalink
Post by Ed W
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
So why are so many people getting excited about it now?
"So many"? I know of one person "getting excited" about it.

Data densities and overall storage sizes and complexity at the top end
of the spectrum are increasing at a faster rate than the
consistency/validation mechanisms. That's the entire point of the
various academic studies on the issue. Note that the one study required
a sample set of 1.5 million disk drives. If the phenomenon were a
regular occurrence as you would have everyone here believe, they could
have used a much smaller sample set.

Ed, this is an academic exercise. Academia leads industry. Almost
always has. Academia blows the whistle and waves hands, prompting
industry to take action.

There is nothing normal users need to do to address this problem. The
hardware and software communities will make the necessary adjustments to
address this issue before it filters down to the general user community
in a half decade or more--when normal users have a 10-20 drive array of
500TB to 1PB or more.

Having the prestigious degree that you do, you should already understand
the relationship between academic research and industry, and the
considerable lead times involved.
--
Stan
Ed W
2012-04-14 10:22:37 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
So why are so many people getting excited about it now?
"So many"? I know of one person "getting excited" about it.
You love being vague don't you? Go on, I'll bite again, do you mean
yourself?

:-)
Post by Stan Hoeppner
Data densities and overall storage sizes and complexity at the top end
of the spectrum are increasing at a faster rate than the
consistency/validation mechanisms. That's the entire point of the
various academic studies on the issue.
Again, you love being vague. By your dismissive "academic studies"
phrase, do you mean studies done on a major industrial player, ie NetApp
in this case? Or do you mean that it's rubbish because they asked
someone with some background in statistics to do the work, rather than
asking someone sitting nearby in the office to do it?

I don't think the researcher broke into NetApp to do this research, so
we have to conclude that the industrial partner was onboard. NetApp
seem to do a bunch of engineering of their own (got enough patents..)
that I think we can safely assume they very much do their own research
on this and it's not just "academic"... I doubt they publish all their
own internal research, be thankful you got to see some of the results
this way...
Post by Stan Hoeppner
Note that the one study required
a sample set of 1.5 million disk drives. If the phenomenon were a
regular occurrence as you would have everyone here believe, they could
have used a much smaller sample set.
Sigh... You could criticise the study if it had a small number of drives
as being under-representive and now you criticise a large study for
having too many observations...

You cannot have "too many" observations when measuring a small and
unpredictable phenomena...

Where does it say that they could NOT have reproduced this study with
just 10 drives? If you have 1.5 million available, why not use all the
results??
Post by Stan Hoeppner
Ed, this is an academic exercise. Academia leads industry. Almost
always has. Academia blows the whistle and waves hands, prompting
industry to take action.
Sigh... We are back to the start of the email thread again... Gosh you
seem to love arguing and muddying the water for zero reason but to have
the last word?

It's *trivial* to do a google search and hit *lots* of reports of
corruptions in various parts of the system, from corrupting drivers, to
hardware which writes incorrectly, to operating system flaws. I just
found a bunch more in the Redhat database today while looking for
something else. You yourself are very vocal on avoiding certain brands
of HD controller which have been rumoured to cause corrupted data...
(and thankyou for revealing that kind of thing - it's very helpful)

Don't veer off at a tangent now: The *original* email this has spawned
is about a VERY specific point. RAID1 appears to offer less protection
against a class of error conditions than does RAID6. Nothing more,
nothing less. Don't veer off and talk about the minutiae of testing
studies at universities, this is a straightforward claim that you have
been jumping around and avoiding answering with claims of needing to
educate me on SCSI protocols and other fatuous responses. Nor deviate
and discuss that RAID6 is inappropriate for many situations - we all get
that...
Post by Stan Hoeppner
There is nothing normal users need to do to address this problem.
...except sit tight and hope they don't loose anything important!

:-)
Post by Stan Hoeppner
Having the prestigious degree that you do, you should already understand
the relationship between academic research and industry, and the
considerable lead times involved.
I'm guessing you haven't attended higher education then? You are
confusing graduate and post-graduate systems...

Byee

Ed W
Ed W
2012-04-14 10:22:37 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
So why are so many people getting excited about it now?
"So many"? I know of one person "getting excited" about it.
You love being vague don't you? Go on, I'll bite again, do you mean
yourself?

:-)
Post by Stan Hoeppner
Data densities and overall storage sizes and complexity at the top end
of the spectrum are increasing at a faster rate than the
consistency/validation mechanisms. That's the entire point of the
various academic studies on the issue.
Again, you love being vague. By your dismissive "academic studies"
phrase, do you mean studies done on a major industrial player, ie NetApp
in this case? Or do you mean that it's rubbish because they asked
someone with some background in statistics to do the work, rather than
asking someone sitting nearby in the office to do it?

I don't think the researcher broke into NetApp to do this research, so
we have to conclude that the industrial partner was onboard. NetApp
seem to do a bunch of engineering of their own (got enough patents..)
that I think we can safely assume they very much do their own research
on this and it's not just "academic"... I doubt they publish all their
own internal research, be thankful you got to see some of the results
this way...
Post by Stan Hoeppner
Note that the one study required
a sample set of 1.5 million disk drives. If the phenomenon were a
regular occurrence as you would have everyone here believe, they could
have used a much smaller sample set.
Sigh... You could criticise the study if it had a small number of drives
as being under-representive and now you criticise a large study for
having too many observations...

You cannot have "too many" observations when measuring a small and
unpredictable phenomena...

Where does it say that they could NOT have reproduced this study with
just 10 drives? If you have 1.5 million available, why not use all the
results??
Post by Stan Hoeppner
Ed, this is an academic exercise. Academia leads industry. Almost
always has. Academia blows the whistle and waves hands, prompting
industry to take action.
Sigh... We are back to the start of the email thread again... Gosh you
seem to love arguing and muddying the water for zero reason but to have
the last word?

It's *trivial* to do a google search and hit *lots* of reports of
corruptions in various parts of the system, from corrupting drivers, to
hardware which writes incorrectly, to operating system flaws. I just
found a bunch more in the Redhat database today while looking for
something else. You yourself are very vocal on avoiding certain brands
of HD controller which have been rumoured to cause corrupted data...
(and thankyou for revealing that kind of thing - it's very helpful)

Don't veer off at a tangent now: The *original* email this has spawned
is about a VERY specific point. RAID1 appears to offer less protection
against a class of error conditions than does RAID6. Nothing more,
nothing less. Don't veer off and talk about the minutiae of testing
studies at universities, this is a straightforward claim that you have
been jumping around and avoiding answering with claims of needing to
educate me on SCSI protocols and other fatuous responses. Nor deviate
and discuss that RAID6 is inappropriate for many situations - we all get
that...
Post by Stan Hoeppner
There is nothing normal users need to do to address this problem.
...except sit tight and hope they don't loose anything important!

:-)
Post by Stan Hoeppner
Having the prestigious degree that you do, you should already understand
the relationship between academic research and industry, and the
considerable lead times involved.
I'm guessing you haven't attended higher education then? You are
confusing graduate and post-graduate systems...

Byee

Ed W
Stan Hoeppner
2012-04-14 03:48:07 UTC
Permalink
Post by Ed W
"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the
same as answering?
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own.
Is it not a correct assumption that you read this in articles? If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.
--
Stan
Ed W
2012-04-14 10:00:40 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the
same as answering?
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own.
Is it not a correct assumption that you read this in articles? If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.
WHAT?!! The original context was that you wanted me to learn some very
specific thing that you accused me of misunderstanding, and then it
turns out that the thing I'm supposed to learn comes from re-reading
every email, every blog post, every video, every slashdot post, every
wiki, every ... that mentions ZFS's reason for including end to end
checksumming?!!

Please stop wasting our time and get specific

You have taken my email which contained a specific question, been asked
of you multiple times now and yet you insist on only answering
irrelevant details with a pointed and personal dig on each answer. The
rudeness is unnecessary, and your evasiveness of answers does not fill
me with confidence that you actually know the answer...

For the benefit of anyone reading this via email archives or whatever, I
think the conclusion we have reached is that: modern systems are now a)
a complex sum of pieces, any of which can cause an error to be injected,
b) the level of error correction which was originally specified as being
sufficient is now starting to be reached in real systems, possibly even
consumer systems. There is no "solution", however, the first step is to
enhance "detection". Various solutions have been proposed, all increase
cost, computation or have some disadvantage - however, one of the more
promising detection mechanisms is an end to end checksum, which will
then have the effect of augmenting ALL the steps in the chain, not just
one specific step. As of today, only a few filesystems offer this, roll
on more adopting it

Regards

Ed W
Stan Hoeppner
2012-04-15 00:05:19 UTC
Permalink
Post by Ed W
Post by Stan Hoeppner
Post by Ed W
"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the
same as answering?
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own.
Is it not a correct assumption that you read this in articles? If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.
WHAT?!! The original context was that you wanted me to learn some very
specific thing that you accused me of misunderstanding, and then it
turns out that the thing I'm supposed to learn comes from re-reading
every email, every blog post, every video, every slashdot post, every
wiki, every ... that mentions ZFS's reason for including end to end
checksumming?!!
No, the original context was your town crier statement that the sky is
falling due to silent data corruption. I pointed out that this is not
the case, currently, that most wouldn't see this until quite a few years
down the road. I provided facts to back my statement, which you didn't
seem to grasp or comprehend. I pointed this out and your top popped
with a cloud of steam.
Post by Ed W
Please stop wasting our time and get specific
Whose time am I wasting Ed? You're the primary person one on this list
who wastes everyone's time with these drawn out threads, usually
unrelated to Dovecot. I have been plenty specific. The problem is you
lack the knowledge and understanding of hardware communication. You're
upset because I'm not pointing out the knowledge you seem to lack? Is
that not a waste of everyone's time? Is that not be even "more
insulting"? Causing even more excited/heated emails from you?
Post by Ed W
You have taken my email which contained a specific question, been asked
of you multiple times now and yet you insist on only answering
irrelevant details with a pointed and personal dig on each answer. The
rudeness is unnecessary, and your evasiveness of answers does not fill
me with confidence that you actually know the answer...
Ed, I have not been rude. I've been attempting to prevent you dragging
us into the mud, which you've done, as you often do. How specific would
you like me to get? This is what you seem to be missing:

Drives perform per sector CRC before transmitting data to the HBA. ATA,
SATA, SCSI, SAS, fiber channel devices and HBAs all perform CRC on wire
data. The PCI/PCI-X/PCIe buses/channels and Southbridge all perform CRC
on wire data. HyperTransport, and Intel's proprietary links also
perform CRC on wire transmissions. Server memory is protected by ECC,
some by ChipKill which can tolerate double bit errors.

With today's systems and storage densities, with error correcting code
on all data paths within the system, and on the drives themselves,
"silent data corruption" is not an issue--in absence of defective
hardware or a bug, which are not relevant to the discussion.
Post by Ed W
For the benefit of anyone reading this via email archives or whatever, I
think the conclusion we have reached is that: modern systems are now a)
a complex sum of pieces, any of which can cause an error to be injected,
Errors occur all the time. And they're corrected nearly all of the
time, on modern complex systems. Silent errors do not occur frequently,
usually not at all, on most modern systems.
Post by Ed W
b) the level of error correction which was originally specified as being
sufficient is now starting to be reached in real systems,
FSVO 'real systems'. The few occurrences of "silent data corruption"
I'm aware of have been documented in academic papers published by
researches working at taxpayer funded institutions. In the case of
CERN, the problem was a firmware bug in the Western Digital drives that
caused an issue with the 3Ware controllers. This kind of thing happens
when using COTS DIY hardware in the absence of proper load validation
testing. So this case doesn't really fit the Henny-penny silent data
corruption scenario as a firmware bug caused it. One that should have
been caught and corrected during testing.

In the other cases I'm aware of, all were HPC systems which generated
SDC under extended high loads, and these SDCs nearly all occurred
somewhere other than the storage systems--CPUs, RAM, interconnect, etc.
HPC apps tend to run the CPUs, interconnects, storage, etc, at full
bandwidth for hours at a time, across tens of thousands of nodes, so the
probability of SDC is much higher simply due to scale.
Post by Ed W
possibly even
consumer systems.
Possibly? If you're going to post pure conjecture why not say "possibly
even iPhones or Androids"? There's no data to back either claim. Stick
to the facts.
Post by Ed W
There is no "solution", however, the first step is to
enhance "detection". Various solutions have been proposed, all increase
cost, computation or have some disadvantage - however, one of the more
promising detection mechanisms is an end to end checksum, which will
then have the effect of augmenting ALL the steps in the chain, not just
one specific step. As of today, only a few filesystems offer this, roll
on more adopting it
So after all the steam blowing, we're back to where we started. I
disagree with your assertion that this is an issue that we--meaning
"average" users not possessing 1PB storage systems or massive
clusters--need to be worried about TODAY. I gave sound reasons as to
why this is the case. You've given us 'a couple of academic papers say
the sky is falling so I'm repeating the sky is falling'. Without
apparently truly understanding the issue.

The data available and the experience of the vast majority of IT folks
backs my position--which is why that's my position. There is little to
no data supporting your position.

I say this isn't going to be an issue for average users, if at all, for
a few years to come. You say it's here now. That's a fairly minor
point of disagreement to cause such a heated (on your part) lengthy
exchange.

BTW, if you see anything I've stated as rude you've apparently not been
on the Interwebs long. ;)
--
Stan
Stan Hoeppner
2012-04-15 00:05:19 UTC
Permalink
Post by Ed W
Post by Stan Hoeppner
Post by Ed W
"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the
same as answering?
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own.
Is it not a correct assumption that you read this in articles? If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.
WHAT?!! The original context was that you wanted me to learn some very
specific thing that you accused me of misunderstanding, and then it
turns out that the thing I'm supposed to learn comes from re-reading
every email, every blog post, every video, every slashdot post, every
wiki, every ... that mentions ZFS's reason for including end to end
checksumming?!!
No, the original context was your town crier statement that the sky is
falling due to silent data corruption. I pointed out that this is not
the case, currently, that most wouldn't see this until quite a few years
down the road. I provided facts to back my statement, which you didn't
seem to grasp or comprehend. I pointed this out and your top popped
with a cloud of steam.
Post by Ed W
Please stop wasting our time and get specific
Whose time am I wasting Ed? You're the primary person one on this list
who wastes everyone's time with these drawn out threads, usually
unrelated to Dovecot. I have been plenty specific. The problem is you
lack the knowledge and understanding of hardware communication. You're
upset because I'm not pointing out the knowledge you seem to lack? Is
that not a waste of everyone's time? Is that not be even "more
insulting"? Causing even more excited/heated emails from you?
Post by Ed W
You have taken my email which contained a specific question, been asked
of you multiple times now and yet you insist on only answering
irrelevant details with a pointed and personal dig on each answer. The
rudeness is unnecessary, and your evasiveness of answers does not fill
me with confidence that you actually know the answer...
Ed, I have not been rude. I've been attempting to prevent you dragging
us into the mud, which you've done, as you often do. How specific would
you like me to get? This is what you seem to be missing:

Drives perform per sector CRC before transmitting data to the HBA. ATA,
SATA, SCSI, SAS, fiber channel devices and HBAs all perform CRC on wire
data. The PCI/PCI-X/PCIe buses/channels and Southbridge all perform CRC
on wire data. HyperTransport, and Intel's proprietary links also
perform CRC on wire transmissions. Server memory is protected by ECC,
some by ChipKill which can tolerate double bit errors.

With today's systems and storage densities, with error correcting code
on all data paths within the system, and on the drives themselves,
"silent data corruption" is not an issue--in absence of defective
hardware or a bug, which are not relevant to the discussion.
Post by Ed W
For the benefit of anyone reading this via email archives or whatever, I
think the conclusion we have reached is that: modern systems are now a)
a complex sum of pieces, any of which can cause an error to be injected,
Errors occur all the time. And they're corrected nearly all of the
time, on modern complex systems. Silent errors do not occur frequently,
usually not at all, on most modern systems.
Post by Ed W
b) the level of error correction which was originally specified as being
sufficient is now starting to be reached in real systems,
FSVO 'real systems'. The few occurrences of "silent data corruption"
I'm aware of have been documented in academic papers published by
researches working at taxpayer funded institutions. In the case of
CERN, the problem was a firmware bug in the Western Digital drives that
caused an issue with the 3Ware controllers. This kind of thing happens
when using COTS DIY hardware in the absence of proper load validation
testing. So this case doesn't really fit the Henny-penny silent data
corruption scenario as a firmware bug caused it. One that should have
been caught and corrected during testing.

In the other cases I'm aware of, all were HPC systems which generated
SDC under extended high loads, and these SDCs nearly all occurred
somewhere other than the storage systems--CPUs, RAM, interconnect, etc.
HPC apps tend to run the CPUs, interconnects, storage, etc, at full
bandwidth for hours at a time, across tens of thousands of nodes, so the
probability of SDC is much higher simply due to scale.
Post by Ed W
possibly even
consumer systems.
Possibly? If you're going to post pure conjecture why not say "possibly
even iPhones or Androids"? There's no data to back either claim. Stick
to the facts.
Post by Ed W
There is no "solution", however, the first step is to
enhance "detection". Various solutions have been proposed, all increase
cost, computation or have some disadvantage - however, one of the more
promising detection mechanisms is an end to end checksum, which will
then have the effect of augmenting ALL the steps in the chain, not just
one specific step. As of today, only a few filesystems offer this, roll
on more adopting it
So after all the steam blowing, we're back to where we started. I
disagree with your assertion that this is an issue that we--meaning
"average" users not possessing 1PB storage systems or massive
clusters--need to be worried about TODAY. I gave sound reasons as to
why this is the case. You've given us 'a couple of academic papers say
the sky is falling so I'm repeating the sky is falling'. Without
apparently truly understanding the issue.

The data available and the experience of the vast majority of IT folks
backs my position--which is why that's my position. There is little to
no data supporting your position.

I say this isn't going to be an issue for average users, if at all, for
a few years to come. You say it's here now. That's a fairly minor
point of disagreement to cause such a heated (on your part) lengthy
exchange.

BTW, if you see anything I've stated as rude you've apparently not been
on the Interwebs long. ;)
--
Stan
Ed W
2012-04-14 10:00:40 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the
same as answering?
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own.
Is it not a correct assumption that you read this in articles? If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.
WHAT?!! The original context was that you wanted me to learn some very
specific thing that you accused me of misunderstanding, and then it
turns out that the thing I'm supposed to learn comes from re-reading
every email, every blog post, every video, every slashdot post, every
wiki, every ... that mentions ZFS's reason for including end to end
checksumming?!!

Please stop wasting our time and get specific

You have taken my email which contained a specific question, been asked
of you multiple times now and yet you insist on only answering
irrelevant details with a pointed and personal dig on each answer. The
rudeness is unnecessary, and your evasiveness of answers does not fill
me with confidence that you actually know the answer...

For the benefit of anyone reading this via email archives or whatever, I
think the conclusion we have reached is that: modern systems are now a)
a complex sum of pieces, any of which can cause an error to be injected,
b) the level of error correction which was originally specified as being
sufficient is now starting to be reached in real systems, possibly even
consumer systems. There is no "solution", however, the first step is to
enhance "detection". Various solutions have been proposed, all increase
cost, computation or have some disadvantage - however, one of the more
promising detection mechanisms is an end to end checksum, which will
then have the effect of augmenting ALL the steps in the chain, not just
one specific step. As of today, only a few filesystems offer this, roll
on more adopting it

Regards

Ed W
Maarten Bezemer
2012-04-13 19:10:04 UTC
Permalink
Post by Ed W
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It quite clearly can??!
I totally agree with Ed here. Drives sure can and sometimes really do
return different data, without reporting errors. Also, data can get
corrupted on any of the busses or chips it passes through.

The math about 10^15 or 10^16 and all that stuff is not only about array
sizes. It's also about data transfer.

I've seen silent corruption on a few systems myself. (Luckily, only 3
times in a couple years.) Those systems were only in the 2TB-5TB size
category, which is substantially lower than the 67TB claimed elsewhere.
Yet, statistically, it's well within normal probability levels.

Linux mdraid only reads one mirror as long as the drives don't return an
error. Easy to check, the read speeds are way beyond a single drive's read
speed. When the kernel would have to read all (possibly more than two)
mirrors, and compare them, and make a decision based on this comparison,
things would be horribly slow. Hardware raid typically uses this exact
same approach. This goes for Areca, 3ware, LSI, which cover most of the
regular (i.e. non-SAN) professional hardware raid setups.

If you don't believe it, just don't take my word for it but test it for
yourself. Cleanly power down a raid1 array, take the individual drives,
put them into a simple desktop machine, and write different data to both,
using some raw disk writing tool like dd. Then put the drives back into
the raid1 array, power it up, and re-read the information. You'll see data
from both drives will be intermixed as parts of the reads come from one
disk, and parts come from the other. Only when you order the raid array to
do a verification pass, it'll start screaming and yelling. At least, I
hope it will...


But as explained elsewhere, silent corruption can occur at numerous
places. If you don't have an explicit checksumming/checking mechanism,
there are indeed cases that will haunt you if you don't do regular
scrubbing or at least do regular verification runs. Heck, that's why Linux
mdadm comes with cron jobs to do just that, and hardware raid controllers
have similar scheduling capabilities.

Of course, scrubbing/verification is not going to magically protect you
from all problems. But you would at least get notifications if it detects
problems.
Post by Ed W
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I don't
believe on commodity hardware controllers either! (You would be able to tell
because the disk IO would be doubled)
Obviously there is no way to tell which versions of a story are correct if
you are not biased to believe one of the storytellers and distrust the
other. You would have to add a checksum layer for that. (And hope the
checksum isn't the part that got corrupted!)
Post by Ed W
Post by Stan Hoeppner
To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.
I'm quite familiar with the basics of these protocols. I'm also quite
familiar with the flaws in several implementations of "seemingly
straightforward protocols". More often than not, there's a pressing need
to get new devices onto the market before the competition has something
similar and you loose your advantage. More often than not, this results in
suboptimal implementations of all those fine protocols and algorithms. And
let's face it: flaws in error recovery routines often don't surface until
someone actually needs those routines. As long as drives (or any other
device) are functioning as expected, everything is all right. But as soon
as something starts to get flaky, error recovery has to kick in but may
just as well fail to do the right thing.

Just consider the real-world analogy of politicians. They do or say
something stupid every once in a while, and error recovery (a.k.a. damage
control) has to kick in. But even though those well trained professionals,
having decades of experience in the political arena, sometimes simply fail
to do the right thing. They may have overlooked some pesky details, or
they may take actions that don't have the expected outcome because...
indeed, things work differently in damage control mode, and the only law
you can trust is physics: you always go down when you can't stay on your
feet.

With hard drives, raid controllers, mainboards, data buses, it's exactly
the same. If _something_ isn't working as it should, how should we know
which part of it we _can_ trust?
Post by Ed W
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
Isn't it just "worked around" by adding more layers of checksuming and
adding more redundancy into the mix? Don't believe this "storage industry"
because they tell you it's OK. It simply is not OK. You might want to talk
to people in the data and computing cluster business about their opinion
on "storage industry professionals"...

Timo's suggestion to add checksums to mailboxes/metadata could help to
(at least) report these types of failures. Re-reading from different
storage when available could also recover the data that got corrupted, but
I'm not sure what would be the best way to handle these situations. If you
know there is a corruption problem on one of your storage locations, you
might want to switch that to read-only asap. Automagically trying to
recover might not be the best thing to do. Given all kinds of different
use cases, I think that should at least be configurable :-P
--
Maarten
Stan Hoeppner
2012-04-14 03:31:04 UTC
Permalink
Post by Ed W
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
So why are so many people getting excited about it now?
"So many"? I know of one person "getting excited" about it.

Data densities and overall storage sizes and complexity at the top end
of the spectrum are increasing at a faster rate than the
consistency/validation mechanisms. That's the entire point of the
various academic studies on the issue. Note that the one study required
a sample set of 1.5 million disk drives. If the phenomenon were a
regular occurrence as you would have everyone here believe, they could
have used a much smaller sample set.

Ed, this is an academic exercise. Academia leads industry. Almost
always has. Academia blows the whistle and waves hands, prompting
industry to take action.

There is nothing normal users need to do to address this problem. The
hardware and software communities will make the necessary adjustments to
address this issue before it filters down to the general user community
in a half decade or more--when normal users have a 10-20 drive array of
500TB to 1PB or more.

Having the prestigious degree that you do, you should already understand
the relationship between academic research and industry, and the
considerable lead times involved.
--
Stan
Stan Hoeppner
2012-04-14 03:48:07 UTC
Permalink
Post by Ed W
"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the
same as answering?
The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own.
Is it not a correct assumption that you read this in articles? If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.
--
Stan
Jan-Frode Myklebust
2012-04-14 10:04:22 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It has happened to me, with RAID5 not RAID1. It was a firmware bug
in the raid controller that caused the RAID array to go silently
corrupted. The HW reported everything green -- but the filesystem was
reporting lots of strange errors.. This LUN was part of a larger
filesystem striped over multiple LUNs, so parts of the fs was OK, while
other parts was corrupt.

It was this bug:

http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
- Fix 432525 - CR139339 Data corruption found on drive after
reconstruct from GHSP (Global Hot Spare)


<snip>
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
Look at the plans are for your favorite fs:



They're planning on doing metadata checksumming to be sure they don't
receive corrupted metadata from the backend storage, and say that data
validation is a storage subsystem *or* application problem.

Hardly a solved problem..


-jf
Stan Hoeppner
2012-04-14 22:39:55 UTC
Permalink
Post by Jan-Frode Myklebust
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It has happened to me, with RAID5 not RAID1. It was a firmware bug
in the raid controller that caused the RAID array to go silently
corrupted. The HW reported everything green -- but the filesystem was
reporting lots of strange errors.. This LUN was part of a larger
filesystem striped over multiple LUNs, so parts of the fs was OK, while
other parts was corrupt.
http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
- Fix 432525 - CR139339 Data corruption found on drive after
reconstruct from GHSP (Global Hot Spare)
Note my comments were specific to the RAID1 case, or a concatenated set
of RAID1 devices. And note the discussion was framed around silent
corruption in the absence of bugs and hardware failure, or should I say,
where no bugs or hardware failures can be identified.
Post by Jan-Frode Myklebust
<snip>
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
http://youtu.be/FegjLbCnoBw
They're planning on doing metadata checksumming to be sure they don't
receive corrupted metadata from the backend storage, and say that data
validation is a storage subsystem *or* application problem.
You can't made sure you don't receive corrupted data. You take steps to
mitigate the negative effects of it if and when it happens. The XFS
devs are planning this for the future. If the problem was here now,
this work would have already been done.
Post by Jan-Frode Myklebust
Hardly a solved problem..
It has been up to this point. The issue going forward is that current
devices don't employ sufficient consistency checking to meet future
needs. And the disk drive makers apparently don't want to consume the
additional bits required to properly do this in the drives.

If they'd dedicate far more bits to ECC we may not have this issue. But
since it appears this isn't going to change, kernel, filesystem and
application developers are taking steps to mitigate it. Again, this
"silent corruption" issue as described in the various academic papers is
a future problem for most, not a current problem. It's only a current
problem for those are the bleeding edge of large scale storage. Note
that firmware bugs in individual products aren't part of this issue.
Those will be with us forever in various products because humans make
mistakes. No amount of filesystem or application code can mitigate
those. The solution to that is standard best practices: snapshots,
backups, or even mirroring all your storage across different vendor
hardware.
--
Stan
Stan Hoeppner
2012-04-14 22:39:55 UTC
Permalink
Post by Jan-Frode Myklebust
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It has happened to me, with RAID5 not RAID1. It was a firmware bug
in the raid controller that caused the RAID array to go silently
corrupted. The HW reported everything green -- but the filesystem was
reporting lots of strange errors.. This LUN was part of a larger
filesystem striped over multiple LUNs, so parts of the fs was OK, while
other parts was corrupt.
http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
- Fix 432525 - CR139339 Data corruption found on drive after
reconstruct from GHSP (Global Hot Spare)
Note my comments were specific to the RAID1 case, or a concatenated set
of RAID1 devices. And note the discussion was framed around silent
corruption in the absence of bugs and hardware failure, or should I say,
where no bugs or hardware failures can be identified.
Post by Jan-Frode Myklebust
<snip>
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
http://youtu.be/FegjLbCnoBw
They're planning on doing metadata checksumming to be sure they don't
receive corrupted metadata from the backend storage, and say that data
validation is a storage subsystem *or* application problem.
You can't made sure you don't receive corrupted data. You take steps to
mitigate the negative effects of it if and when it happens. The XFS
devs are planning this for the future. If the problem was here now,
this work would have already been done.
Post by Jan-Frode Myklebust
Hardly a solved problem..
It has been up to this point. The issue going forward is that current
devices don't employ sufficient consistency checking to meet future
needs. And the disk drive makers apparently don't want to consume the
additional bits required to properly do this in the drives.

If they'd dedicate far more bits to ECC we may not have this issue. But
since it appears this isn't going to change, kernel, filesystem and
application developers are taking steps to mitigate it. Again, this
"silent corruption" issue as described in the various academic papers is
a future problem for most, not a current problem. It's only a current
problem for those are the bleeding edge of large scale storage. Note
that firmware bugs in individual products aren't part of this issue.
Those will be with us forever in various products because humans make
mistakes. No amount of filesystem or application code can mitigate
those. The solution to that is standard best practices: snapshots,
backups, or even mirroring all your storage across different vendor
hardware.
--
Stan
Jim Lawson
2012-04-13 13:12:02 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
?! Stan, are you really saying that silent data corruption "simply
can't happen"? People who have been studying this have been talking
about it for years now. It can happen in the same way that Emmanuel
describes.

USENIX FAST08:

http://static.usenix.org/event/fast08/tech/bairavasundaram.html

CERN:

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf

LANL:

http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf

There are others if you search for it. This problem has been well-known
in large (petabyte+) data storage systems for some time.

Jim
Ed W
2012-04-13 15:31:35 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It quite clearly can??!

Just grab your drive, lever the connector off a little bit until it's a
bit flaky and off you go? *THIS* type of problem I have heard of and
you can find easy examples with a quick google search of any hobbyist
storage board. Very common other examples are such problems due to
failing PSUs and other interference driven examples causing explicit
disk errors (and once the error rate goes up, some will make it past the
checksum)

Note this is NOT what I was originally asking about. My interest is
more about when the hardware is working reliably and as you agree, the
error levels are vastly lower. However, it would be incredibly foolish
to claim that it's not trivial to construct a scenario where bad
hardware causes plenty of silent corruption?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?
I already answered this in a previous post.
Not obviously?!

I will also add my understanding that linux software RAID1,5&6 *DO NOT*
read all disks and hence will not be aware when disks have different
data. In fact with software raid you need to run a regular "scrub" job
to check this consistency.

I also believe that most commodity hardware raid implementations work
exactly the same way and a background scrub is needed to detect
inconsistent arrays. However, feel free to correct that understanding?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?
No, this is not correct.
I definitely think you are wrong and Emmanuel is right?

If the controller gets a good read from the disk then it will trust that
read and will NOT check the result with the other disk (or parity in the
case of RAID5/6). If that read was incorrect for some reason then the
data will be passed as good.
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I
don't believe on commodity hardware controllers either! (You would be
able to tell because the disk IO would be doubled)

Linux software raid 1 isn't that smart, but reads only one disk and
trusts the answer if the read did not trigger an error. It does not
check the other disk except during an explicit disk scrub.
Post by Stan Hoeppner
Emmanuel, Ed, we're at a point where I simply don't have the time nor
inclination to continue answering these basic questions about the base
level functions of storage hardware.
You mean those "answers" like:
"I answered that in another thread"
or
"you need to read 'those' articles again"

Referring to some unknown and hard to find previous emails is not the
same as answering?

Also you are wondering off at extreme tangents. The question is simple:

- Disk 1 Read good, checksum = A
- Disk 2 Read good, checksum = B

Disks are a raid 1 pair. How do we know which disk is correct. Please
specify raid 1 implementation and mechanism used with any answer
Post by Stan Hoeppner
To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.
I really think not... A simple statement of:

- Each sector on disk has a certain sized checksum
- Controller checks checksum on read
- Sent back over SATA connection, with a certain sized checksum
- After that you are on your own vs corruption

...Should cover it I think?
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
So why are so many people getting excited about it now?

Note, there have been plenty of shoddy disk controller implementations
before today - ie there exists hardware on sale with *known* defects.
Despite that the industry continues without collapse. Now you claim
that if corruption is silent and people only tend to notice it much
later and under certain edge conditions that this can't be possible
because it should cause the industry to collapse..???

...Not buying your logic...

Ed W
Jan-Frode Myklebust
2012-04-14 10:04:22 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
It has happened to me, with RAID5 not RAID1. It was a firmware bug
in the raid controller that caused the RAID array to go silently
corrupted. The HW reported everything green -- but the filesystem was
reporting lots of strange errors.. This LUN was part of a larger
filesystem striped over multiple LUNs, so parts of the fs was OK, while
other parts was corrupt.

It was this bug:

http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
- Fix 432525 - CR139339 Data corruption found on drive after
reconstruct from GHSP (Global Hot Spare)


<snip>
Post by Stan Hoeppner
In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
Look at the plans are for your favorite fs:

http://youtu.be/FegjLbCnoBw

They're planning on doing metadata checksumming to be sure they don't
receive corrupted metadata from the backend storage, and say that data
validation is a storage subsystem *or* application problem.

Hardly a solved problem..


-jf
Stan Hoeppner
2012-04-13 12:33:19 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Emmanuel Noobadmin
I suppose the controller could throw an error if
Post by Emmanuel Noobadmin
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years.
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.
This simply can't happen. What articles are you referring to? If the
author is stating what you say above, he simply doesn't know what he's
talking about.
Post by Emmanuel Noobadmin
If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?
I already answered this in a previous post.
Post by Emmanuel Noobadmin
if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?
No, this is not correct.
Post by Emmanuel Noobadmin
If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.

Emmanuel, Ed, we're at a point where I simply don't have the time nor
inclination to continue answering these basic questions about the base
level functions of storage hardware. You both have serious
misconceptions about how many things work. To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.

I don't mind, and actually enjoy, passing knowledge. But the amount
that seems to be required here to bring you up to speed is about 2^15
times above and beyond the scope of mailing list conversation.

In closing, I'll simply say this: If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper. The questions you're asking were
solved by hardware and software engineers decades ago. You're fretting
and asking about things that were solved decades ago.
--
Stan
Ed W
2012-04-12 10:58:52 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years. The mindset here is that anyone would rather spent $150-$2500
dollars on a replacement drive than take a chance with his/her valuable
data.
I'm asking a subtlely different question.

The claim by ZFS/BTRFS authors and others is that data silently "bit
rots" on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data? I can't personally claim to have observed this, so it
remains someone else's theory... (for background my experience is
simply: RAID10 for high performance arrays and RAID6 for all my personal
data - I intend to investigate your linear raid idea in the future though)

I do agree that if one drive reports a read error, then it's quite easy
to guess which pair of the array is wrong...

Just as an aside, I don't have a lot of failure experience. However,
the few I have had (perhaps 6-8 events now) is that there is a massive
correlation in failure time with RAID1, eg one pair I had lasted perhaps
2 years and then both failed within 6 hours of each other. I also had a
bad experience with RAID 5 that wasn't being scrubbed regularly and when
one drive started reporting errors (ie lack of monitoring meant it had
been bad for a while), the rest of the array turned out to be a
patchwork of read errors - linux raid then turns out to be quite fragile
in the presence of a small number of read failures and it's extremely
difficult to salvage the 99% of the array which is ok due to the disks
getting kicked out... (of course regular scrubs would have prevented
getting so deep into that situation - it was a small cheap nas box
without such features)

Ed W
Emmanuel Noobadmin
2012-04-13 06:12:48 UTC
Permalink
Post by Emmanuel Noobadmin
I suppose the controller could throw an error if
Post by Emmanuel Noobadmin
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years.
What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.

If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?

if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?

If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Stan Hoeppner
2012-04-12 10:20:31 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it. Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years. The mindset here is that anyone would rather spent $150-$2500
dollars on a replacement drive than take a chance with his/her valuable
data.

Yes I typed $2500. EMC charges over $2000 for a single Seagate disk
drive with an EMC label and serial# on it. The serial number is what
prevents one from taking the same off the shelf Seagate drive at $300
and mounting it in a $250,000 EMC array chassis. The controller
firmware reads the S/N from each connected drive and will not allow
foreign drives to be used. HP, IBM, Oracle/Sun, etc do this as well.
Which is why they make lots of profit, and is why I prefer open storage
systems.
--
Stan
Ed W
2012-04-12 11:45:51 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
Re XFS. Have you been watching BTRFS recently?
I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"
Links?
http://btrfs.ipv5.de/index.php?title=Main_Page#Benchmarking

See the regular Phoronix benchmarks in particular. However, I believe
these are all single disk?
Post by Stan Hoeppner
Post by Ed W
I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?
http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html
This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%.
My instinct is that this is an irrelevant benchmark for BTRFS because
its performance characteristics for these workloads have changed so
significantly? I would be far more interested in a 3.2 and then a
3.6/3.7 benchmark in a years time

In particular recent benchmarks on Phoronix show btrfs exceeding XFS
performance on heavily threaded benchmarks - however, I doubt this is
representative of performance on a multi-disk benchmark?
Post by Stan Hoeppner
It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35. EXT4 and JFS have seen little performance
work since.
My understanding is that there was a significant multi-thread
performance boost for EXT4 in the last year kind of timeframe? I don't
have a link to hand, but someone did some work to reduce lock contention
(??) which I seem to recall made a very large difference on multi-user
or multi-cpu workloads? I seem to recall that the summary was that it
allowed Ext4 to scale up to a good fraction of XFS performance on
"medium sized" systems? (I believe that XFS still continues to scale far
better than anything else on large systems)

Point is that I think it's a bit unfair to say that little has changed
on Ext4? It still seems to be developing faster than "maintenance only"

However, well OT... The original question was: anyone tried very recent
BTRFS on a multi-disk system. Seems like the answer is no. My proposal
is that it may be worth watching in the future

Cheers

Ed W

P.S. I have always been intrigued by the idea that a COW based
filesystem could potentially implement much faster "RAID" parity,
because it can avoid reading the whole stripe. The idea is that you
treat unallocated space as "zero", which means you can compute the
incremental parity with only a read/write of the checksum value (and
with a COW filesystem you only ever update by rewriting to new "zero'd"
space). I had in mind something like a fixed parity disk (RAID4?) and
allowing the parity disk to be "write behind" cached in ram (ie exposed
to risk of: power fails AND data disk fails at the same time). My code
may not be following along for a while though...
Emmanuel Noobadmin
2012-04-12 02:23:19 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.
Ed W
2012-04-12 11:45:51 UTC
Permalink
Post by Stan Hoeppner
Post by Ed W
Re XFS. Have you been watching BTRFS recently?
I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"
Links?
http://btrfs.ipv5.de/index.php?title=Main_Page#Benchmarking

See the regular Phoronix benchmarks in particular. However, I believe
these are all single disk?
Post by Stan Hoeppner
Post by Ed W
I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?
http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html
This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%.
My instinct is that this is an irrelevant benchmark for BTRFS because
its performance characteristics for these workloads have changed so
significantly? I would be far more interested in a 3.2 and then a
3.6/3.7 benchmark in a years time

In particular recent benchmarks on Phoronix show btrfs exceeding XFS
performance on heavily threaded benchmarks - however, I doubt this is
representative of performance on a multi-disk benchmark?
Post by Stan Hoeppner
It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35. EXT4 and JFS have seen little performance
work since.
My understanding is that there was a significant multi-thread
performance boost for EXT4 in the last year kind of timeframe? I don't
have a link to hand, but someone did some work to reduce lock contention
(??) which I seem to recall made a very large difference on multi-user
or multi-cpu workloads? I seem to recall that the summary was that it
allowed Ext4 to scale up to a good fraction of XFS performance on
"medium sized" systems? (I believe that XFS still continues to scale far
better than anything else on large systems)

Point is that I think it's a bit unfair to say that little has changed
on Ext4? It still seems to be developing faster than "maintenance only"

However, well OT... The original question was: anyone tried very recent
BTRFS on a multi-disk system. Seems like the answer is no. My proposal
is that it may be worth watching in the future

Cheers

Ed W

P.S. I have always been intrigued by the idea that a COW based
filesystem could potentially implement much faster "RAID" parity,
because it can avoid reading the whole stripe. The idea is that you
treat unallocated space as "zero", which means you can compute the
incremental parity with only a read/write of the checksum value (and
with a COW filesystem you only ever update by rewriting to new "zero'd"
space). I had in mind something like a fixed parity disk (RAID4?) and
allowing the parity disk to be "write behind" cached in ram (ie exposed
to risk of: power fails AND data disk fails at the same time). My code
may not be following along for a while though...
Adrian Minta
2012-04-11 20:48:00 UTC
Permalink
On 04/11/12 19:50, Ed W wrote:

...
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
RAID6 is very slow for write operations. That's why is the worst choice
for maildir.
Stan Hoeppner
2012-04-12 01:18:08 UTC
Permalink
Post by Ed W
Re XFS. Have you been watching BTRFS recently?
I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"
Links?
Post by Ed W
I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?
http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%. It's hard to say because the Boxacle folks didn't
show the XFG AG config they used. The concat+RAID1 setup can decrease
disk seeks by many orders of magnitude vs striping. Everyone knows as
seeks go down IOPS go up. Even with this very suboptimal disk setup,
XFS still trounces everything but JFS which is a close 2nd. BTRFS is
way down in the pack. It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35. EXT4 and JFS have seen little performance
work since. In fact JFS has seen no commits but bug fixes and changes
to allow compiling with recent kernels.
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
--
Stan
Adrian Minta
2012-04-11 20:48:00 UTC
Permalink
On 04/11/12 19:50, Ed W wrote:

...
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
RAID6 is very slow for write operations. That's why is the worst choice
for maildir.
Stan Hoeppner
2012-04-12 01:18:08 UTC
Permalink
Post by Ed W
Re XFS. Have you been watching BTRFS recently?
I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"
Links?
Post by Ed W
I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?
http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%. It's hard to say because the Boxacle folks didn't
show the XFG AG config they used. The concat+RAID1 setup can decrease
disk seeks by many orders of magnitude vs striping. Everyone knows as
seeks go down IOPS go up. Even with this very suboptimal disk setup,
XFS still trounces everything but JFS which is a close 2nd. BTRFS is
way down in the pack. It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35. EXT4 and JFS have seen little performance
work since. In fact JFS has seen no commits but bug fixes and changes
to allow compiling with recent kernels.
Post by Ed W
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...
Except we're using hardware RAID1 here and mdraid linear. Thus the
controller takes care of sector integrity. RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.
--
Stan
Ed W
2012-04-11 16:50:09 UTC
Permalink
Re XFS. Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"

I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...

Regards

Ed W
Ed W
2012-04-11 16:50:09 UTC
Permalink
Re XFS. Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet. However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better. I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up. Basically what I have seen seems "competitive"

I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...

Regards

Ed W
Stan Hoeppner
2012-04-11 07:18:49 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
It's pretty phenomenally low considering what all you get, especially 20
enterprise class drives.
Post by Emmanuel Noobadmin
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.
The 10K drives I mentioned are SATA not SAS. WD's 7.2k RE and 10k
Raptor series drives are both SATA but have RAID specific firmware,
better reliability, longer warranties, etc. The RAID specific firmware
is why both are tested and certified by LSI with their RAID cards.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.
Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
Not to mention rebuild times for large width RAID5/6.
Post by Emmanuel Noobadmin
MegaRAID being used as the actual RAID controller or just as a HBA?
It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD
support, the works. It's an LSI "Feature Line" card:
http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.aspx

The specs:
http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.aspx

You'll need the cache battery module for safe write caching, which I
forgot in the wish list (now added), $160:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163&Tpk=LSIiBBU08

With your workload and RAID10 you should run with all 512MB configured
as write cache. Linux caches all reads so using any controller cache
for reads is a waste. Using all 512MB for write cache will increase
random write IOPS.

Note the 9280 allows up to 64 LUNs, so you can do tiered storage within
this 20 bay chassis. For spares management you'd probably not want to
bother with two different sized drives.

I didn't mention the 300GB 10K Raptors previously due to their limited
capacity. Note they're only $15 more apiece than the 1TB RE4 drives in
the original parts list. For a total of $300 more you get the same 40%
increase in IOPs of the 600GB model, but you'll only have 3TB net space
after RAID10. If 3TB is sufficient space for your needs, that extra 40%
IOPS makes this config a no brainer. The decreased latency of the 10K
drives will give a nice boost to VM read performance, especially when
using NFS. Write performance probably won't be much different due to
the generous 512MB write cache on the controller. I also forgot to
mention that with BBWC enabled you can turn off XFS barriers, which will
dramatically speed up Exim queues and Dovecot writes, all writes actually.

Again, you probably don't want the spares management overhead of two
different disk types on the shelf, but you could stick these 10K 300s in
the first 16 slots, and put the 2TB RE4 drive in the last 4 slots,
RAID10 on the 10K drives, RAID5 on the 2TB drives. This yields an 8
spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB
for near line storage such as your Dovecot alt storage, VM templates,
etc, 8.4TB net, 1.6TB less than the original 10TB setup. Total
additional cost is $920 for this setup. You'd have two XFS filesystems
(with quite different mkfs parameters).
Post by Emmanuel Noobadmin
I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.
Assuming you have the right connector configuration for your
drive/enclosure on the replacement card, you can usually swap out one
LSI RAID card with any other LSI RAID card in the same, or newer,
generation. It'll read the configuration metadata from the disks and be
up an running in minutes. This feature has been around all the way back
to the AMI/Mylex cards of the late 1990s. LSI acquired both companies,
who were #1 and #2 in RAID, which is why LSI is so successful today.
Back in those days LSI simply supplied the ASICs to AMI and Mylex. I
have an AMI MegaRAID 428, top of the line in 1998, lying around
somewhere. Still working when I retired it many years ago.

FYI, LSI is the OEM provider of RAID and SAS/SATA HBA ASIC silicon for
the tier 1 HBA and mobo down markets. Dell, HP, IBM, Intel, Oracle
(Sun), Siemens/Fujitsu, all use LSI silicon and firmware. Some simply
rebadge OEM LSI cards with their own model and part numbers. IBM and
Dell specifically have been doing this rebadging for well over a decade,
long before LSI acquired Mylex and AMI. The Dell PERC/2 is a rebadged
AMI MegaRAID 428.

Software and hardware RAID each have their pros and cons. I prefer
hardware RAID for write cache performance and many administrative
reasons, including SAF-TE enclosure management (fault LEDs, alarms, etc)
so you know at a glance which drive has failed and needs replacing,
email and SNMP notification of events, automatic rebuild, configurable
rebuild priority, etc, etc, and good performance with striping and
mirroring. Parity RAID performance often lags behind md with heavy
workloads but not with light/medium. FWIW I rarely use parity RAID, due
to the myriad performance downsides.

For ultra high random IOPS workloads, or when I need a single filesystem
space larger than the drive limit or practical limit for one RAID HBA,
I'll stitch hardware RAID1 or small stripe width RAID 10 arrays (4-8
drives, 2-4 spindles) together with md RAID 0 or 1.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.
Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.
Absolutely. If you setup these 20 drives as a single RAID10, soft/hard
or hybrid, with the LSI cache set to 100% write-back, with a single XFS
filesystem with 10 allocation groups and proper stripe alignment, you'll
get maximum performance for pretty much any conceivable workload.

Your only limitations will be possible NFS or TCP tuning issues, and
maybe having only two GbE ports. For small random IOPS such as Exim
queues, Dovecot store, VM image IO, etc, the two GbE ports are plenty.
But if you add any large NFS file copies into the mix, such as copying
new VM templates or ISO images over, etc, or do backups over NFS instead
of directly on the host machine at the XFS level, then two bonded GbE
ports might prove a bottleneck.

The mobo has 2 PCIe x8 slots and one x4 slot. One of the x8 slots is an
x16 physical connector. You'll put the LSI card in the x16 slot. If
you mount the Intel SAS expander to the chassis as I do instead of in a
slot, you have one free x8 and one free x4 slot. Given the $250 price,
I'd simply ad an Intel quad port GbE NIC to the order. Link aggregate
all 4 ports on day one and use one IP address for the NFS traffic. Use
the two on board ports for management etc. This should give you a
theoretical 400MB/s of peak NFS throughput, which should be plenty no
matter what workload you throw at it.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.
Just about every 1U server I've seen that's been racked for 3 or more
years has warped under its own weight. I even saw an HPQ 2U that was
warped this way, badly warped. In this instance the slide rail bolts
had never been tightened down to the rack--could spin them by hand.
Since the chassis side panels weren't secured, and there was lateral
play, the weight of the 6 drives caused the side walls of the case to
fold into a mild trapezoid, which allowed the bottom and top panels to
bow. Let this be a lesson boys and girls: always tighten your rack
bolts. :)
--
Stan
Stan Hoeppner
2012-04-11 07:18:49 UTC
Permalink
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
It's pretty phenomenally low considering what all you get, especially 20
enterprise class drives.
Post by Emmanuel Noobadmin
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.
The 10K drives I mentioned are SATA not SAS. WD's 7.2k RE and 10k
Raptor series drives are both SATA but have RAID specific firmware,
better reliability, longer warranties, etc. The RAID specific firmware
is why both are tested and certified by LSI with their RAID cards.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.
Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
Not to mention rebuild times for large width RAID5/6.
Post by Emmanuel Noobadmin
MegaRAID being used as the actual RAID controller or just as a HBA?
It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD
support, the works. It's an LSI "Feature Line" card:
http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.aspx

The specs:
http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.aspx

You'll need the cache battery module for safe write caching, which I
forgot in the wish list (now added), $160:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163&Tpk=LSIiBBU08

With your workload and RAID10 you should run with all 512MB configured
as write cache. Linux caches all reads so using any controller cache
for reads is a waste. Using all 512MB for write cache will increase
random write IOPS.

Note the 9280 allows up to 64 LUNs, so you can do tiered storage within
this 20 bay chassis. For spares management you'd probably not want to
bother with two different sized drives.

I didn't mention the 300GB 10K Raptors previously due to their limited
capacity. Note they're only $15 more apiece than the 1TB RE4 drives in
the original parts list. For a total of $300 more you get the same 40%
increase in IOPs of the 600GB model, but you'll only have 3TB net space
after RAID10. If 3TB is sufficient space for your needs, that extra 40%
IOPS makes this config a no brainer. The decreased latency of the 10K
drives will give a nice boost to VM read performance, especially when
using NFS. Write performance probably won't be much different due to
the generous 512MB write cache on the controller. I also forgot to
mention that with BBWC enabled you can turn off XFS barriers, which will
dramatically speed up Exim queues and Dovecot writes, all writes actually.

Again, you probably don't want the spares management overhead of two
different disk types on the shelf, but you could stick these 10K 300s in
the first 16 slots, and put the 2TB RE4 drive in the last 4 slots,
RAID10 on the 10K drives, RAID5 on the 2TB drives. This yields an 8
spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB
for near line storage such as your Dovecot alt storage, VM templates,
etc, 8.4TB net, 1.6TB less than the original 10TB setup. Total
additional cost is $920 for this setup. You'd have two XFS filesystems
(with quite different mkfs parameters).
Post by Emmanuel Noobadmin
I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.
Assuming you have the right connector configuration for your
drive/enclosure on the replacement card, you can usually swap out one
LSI RAID card with any other LSI RAID card in the same, or newer,
generation. It'll read the configuration metadata from the disks and be
up an running in minutes. This feature has been around all the way back
to the AMI/Mylex cards of the late 1990s. LSI acquired both companies,
who were #1 and #2 in RAID, which is why LSI is so successful today.
Back in those days LSI simply supplied the ASICs to AMI and Mylex. I
have an AMI MegaRAID 428, top of the line in 1998, lying around
somewhere. Still working when I retired it many years ago.

FYI, LSI is the OEM provider of RAID and SAS/SATA HBA ASIC silicon for
the tier 1 HBA and mobo down markets. Dell, HP, IBM, Intel, Oracle
(Sun), Siemens/Fujitsu, all use LSI silicon and firmware. Some simply
rebadge OEM LSI cards with their own model and part numbers. IBM and
Dell specifically have been doing this rebadging for well over a decade,
long before LSI acquired Mylex and AMI. The Dell PERC/2 is a rebadged
AMI MegaRAID 428.

Software and hardware RAID each have their pros and cons. I prefer
hardware RAID for write cache performance and many administrative
reasons, including SAF-TE enclosure management (fault LEDs, alarms, etc)
so you know at a glance which drive has failed and needs replacing,
email and SNMP notification of events, automatic rebuild, configurable
rebuild priority, etc, etc, and good performance with striping and
mirroring. Parity RAID performance often lags behind md with heavy
workloads but not with light/medium. FWIW I rarely use parity RAID, due
to the myriad performance downsides.

For ultra high random IOPS workloads, or when I need a single filesystem
space larger than the drive limit or practical limit for one RAID HBA,
I'll stitch hardware RAID1 or small stripe width RAID 10 arrays (4-8
drives, 2-4 spindles) together with md RAID 0 or 1.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.
Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.
Absolutely. If you setup these 20 drives as a single RAID10, soft/hard
or hybrid, with the LSI cache set to 100% write-back, with a single XFS
filesystem with 10 allocation groups and proper stripe alignment, you'll
get maximum performance for pretty much any conceivable workload.

Your only limitations will be possible NFS or TCP tuning issues, and
maybe having only two GbE ports. For small random IOPS such as Exim
queues, Dovecot store, VM image IO, etc, the two GbE ports are plenty.
But if you add any large NFS file copies into the mix, such as copying
new VM templates or ISO images over, etc, or do backups over NFS instead
of directly on the host machine at the XFS level, then two bonded GbE
ports might prove a bottleneck.

The mobo has 2 PCIe x8 slots and one x4 slot. One of the x8 slots is an
x16 physical connector. You'll put the LSI card in the x16 slot. If
you mount the Intel SAS expander to the chassis as I do instead of in a
slot, you have one free x8 and one free x4 slot. Given the $250 price,
I'd simply ad an Intel quad port GbE NIC to the order. Link aggregate
all 4 ports on day one and use one IP address for the NFS traffic. Use
the two on board ports for management etc. This should give you a
theoretical 400MB/s of peak NFS throughput, which should be plenty no
matter what workload you throw at it.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.
Just about every 1U server I've seen that's been racked for 3 or more
years has warped under its own weight. I even saw an HPQ 2U that was
warped this way, badly warped. In this instance the slide rail bolts
had never been tightened down to the rack--could spin them by hand.
Since the chassis side panels weren't secured, and there was lateral
play, the weight of the 6 drives caused the side walls of the case to
fold into a mild trapezoid, which allowed the bottom and top panels to
bow. Let this be a lesson boys and girls: always tighten your rack
bolts. :)
--
Stan
Adrian Minta
2012-04-10 10:22:18 UTC
Permalink
Post by Stan Hoeppner
Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985
Don't forget the Battery Backup Unit for RAID card !!!
Stan Hoeppner
2012-04-11 22:46:44 UTC
Permalink
Post by Adrian Minta
Post by Stan Hoeppner
Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985
Don't forget the Battery Backup Unit for RAID card !!!
Heh, thanks for the reminder Adrian. :)

I got to your email a little late--already corrected the omission. Yes,
battery or flash backup for the RAID cache is always a necessity when
doing write-back caching.
--
Stan
Stan Hoeppner
2012-04-11 22:46:44 UTC
Permalink
Post by Adrian Minta
Post by Stan Hoeppner
Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985
Don't forget the Battery Backup Unit for RAID card !!!
Heh, thanks for the reminder Adrian. :)

I got to your email a little late--already corrected the omission. Yes,
battery or flash backup for the RAID cache is always a necessity when
doing write-back caching.
--
Stan
Stan Hoeppner
2012-04-11 22:46:44 UTC
Permalink
Post by Adrian Minta
Post by Stan Hoeppner
Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985
Don't forget the Battery Backup Unit for RAID card !!!
Heh, thanks for the reminder Adrian. :)

I got to your email a little late--already corrected the omission. Yes,
battery or flash backup for the RAID cache is always a necessity when
doing write-back caching.
--
Stan
Emmanuel Noobadmin
2012-04-10 06:09:18 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
So I have to make do with OTS commodity parts and free software for
the most parts.
OTS meaning you build your own systems from components? Too few in the
business realm do so today. :(
For the inhouse stuff and budget customers yes, in fact both the email
servers are on seconded hardware that started life as something else.
I spec HP servers for our app servers to customers who are willing to
pay for their own colocated or onsite servers but still there are
customers who balk at the cost and so go OTS or virtualized.
Post by Stan Hoeppner
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.
Post by Stan Hoeppner
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.
Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
MegaRAID being used as the actual RAID controller or just as a HBA?

I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.
Post by Stan Hoeppner
Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.
Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.
Post by Stan Hoeppner
Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.
Adrian Minta
2012-04-10 10:22:18 UTC
Permalink
Post by Stan Hoeppner
Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985
Don't forget the Battery Backup Unit for RAID card !!!
Emmanuel Noobadmin
2012-04-10 06:09:18 UTC
Permalink
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
So I have to make do with OTS commodity parts and free software for
the most parts.
OTS meaning you build your own systems from components? Too few in the
business realm do so today. :(
For the inhouse stuff and budget customers yes, in fact both the email
servers are on seconded hardware that started life as something else.
I spec HP servers for our app servers to customers who are willing to
pay for their own colocated or onsite servers but still there are
customers who balk at the cost and so go OTS or virtualized.
Post by Stan Hoeppner
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.
Post by Stan Hoeppner
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.
Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
MegaRAID being used as the actual RAID controller or just as a HBA?

I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.
Post by Stan Hoeppner
Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.
Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.
Post by Stan Hoeppner
Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.
Adrian Minta
2012-04-10 10:22:18 UTC
Permalink
Post by Stan Hoeppner
Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985
Don't forget the Battery Backup Unit for RAID card !!!
Stan Hoeppner
2012-04-10 05:00:19 UTC
Permalink
Post by Emmanuel Noobadmin
Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all "servers" after all.
So I have to make do with OTS commodity parts and free software for
the most parts.
OTS meaning you build your own systems from components? Too few in the
business realm do so today. :(

It sounds like budget overrides redundancy then. You can do an NFS
cluster with SAN and GFS2, or two servers with their own storage and
DRBD mirroring. Here's how to do the latter:
http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat

The total cost is about the same for each solution as an iSCSI SAN array
of drive count X is about the same cost as two JBOD disk arrays of count
X*2. Redundancy in this case is expensive no matter the method. Given
how infrequent host failures are, and the fact your storage is
redundant, it may make more sense to simply keep spare components on
hand and swap what fails--PSU, mobo, etc.

Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts list:
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

In case the link doesn't work, the core components are:

SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.

Price today: $5,376.62

Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.

If you need more transactional throughput you could use 20 WD6000HLHX
600GB 10K RPM WD Raptor drives. You'll get 40% more throughput and 6TB
net space with RAID10. They'll cost you $1200 more, or $6,576.62 total.
Well worth the $1200 for 40% more throughput, if 6TB is enough.

Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.

Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.

The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.

Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
--
Stan
Stan Hoeppner
2012-04-10 05:00:19 UTC
Permalink
Post by Emmanuel Noobadmin
Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all "servers" after all.
So I have to make do with OTS commodity parts and free software for
the most parts.
OTS meaning you build your own systems from components? Too few in the
business realm do so today. :(

It sounds like budget overrides redundancy then. You can do an NFS
cluster with SAN and GFS2, or two servers with their own storage and
DRBD mirroring. Here's how to do the latter:
http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat

The total cost is about the same for each solution as an iSCSI SAN array
of drive count X is about the same cost as two JBOD disk arrays of count
X*2. Redundancy in this case is expensive no matter the method. Given
how infrequent host failures are, and the fact your storage is
redundant, it may make more sense to simply keep spare components on
hand and swap what fails--PSU, mobo, etc.

Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components. If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot. The parts list:
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

In case the link doesn't work, the core components are:

SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List. I've not written
assembly instructions. I figure anyone who would build this knows what
s/he is doing.

Price today: $5,376.62

Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.

If you need more transactional throughput you could use 20 WD6000HLHX
600GB 10K RPM WD Raptor drives. You'll get 40% more throughput and 6TB
net space with RAID10. They'll cost you $1200 more, or $6,576.62 total.
Well worth the $1200 for 40% more throughput, if 6TB is enough.

Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list. The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.

Anyway, lots of option out there. But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.

The MegaRAID 9280-4i4e has an external SFF8088 port For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives. That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.

Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware. With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine. In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.
--
Stan
Emmanuel Noobadmin
2012-04-09 19:15:02 UTC
Permalink
Post by Stan Hoeppner
1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.
2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs
I started to do this and realize I have a serious mess on hand that
makes delving in other people's uncommented source code seem like a
joy :D

Management added to this by deciding if we're going to offload the
email storage to a network storage, we might as well consolidate
everything into that shared storage system so we don't have TBs of
un-utilized space. So I might not even be able to use your tested XFS
+ concat solution since it may not be optimal for VM images and
databases.

As the requirements' changed, I'll stop asking here as it's no longer
really relevant just for Dovecot purposes.
Post by Stan Hoeppner
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.
A SAN is required for such a setup.
Thanks for the suggestion, I will need to find some time to look into
this although we've mostly been using KVM for virtualization so far.
Although the "SAN" part will probably prevent this from being accepted
due to cost.
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
True, but I'd hate to be the customer who get to pick up the pieces
when things explode due to unintended negligence by a dev trying to
level up by multi-classing as an admin.
Post by Stan Hoeppner
physical network interface. You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.
Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all "servers" after all.

So I have to make do with OTS commodity parts and free software for
the most parts.
Emmanuel Noobadmin
2012-04-09 19:15:02 UTC
Permalink
Post by Stan Hoeppner
1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.
2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs
I started to do this and realize I have a serious mess on hand that
makes delving in other people's uncommented source code seem like a
joy :D

Management added to this by deciding if we're going to offload the
email storage to a network storage, we might as well consolidate
everything into that shared storage system so we don't have TBs of
un-utilized space. So I might not even be able to use your tested XFS
+ concat solution since it may not be optimal for VM images and
databases.

As the requirements' changed, I'll stop asking here as it's no longer
really relevant just for Dovecot purposes.
Post by Stan Hoeppner
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.
A SAN is required for such a setup.
Thanks for the suggestion, I will need to find some time to look into
this although we've mostly been using KVM for virtualization so far.
Although the "SAN" part will probably prevent this from being accepted
due to cost.
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
True, but I'd hate to be the customer who get to pick up the pieces
when things explode due to unintended negligence by a dev trying to
level up by multi-classing as an admin.
Post by Stan Hoeppner
physical network interface. You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.
Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all "servers" after all.

So I have to make do with OTS commodity parts and free software for
the most parts.
Stan Hoeppner
2012-04-08 18:21:47 UTC
Permalink
Post by Emmanuel Noobadmin
Firstly, thanks for the comprehensive reply. :)
Post by Stan Hoeppner
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
So it seems you have two courses of action:

1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.

2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs

Adding an NFS server and moving infrequently (old) accessed files to
alternate storage will alleviate your space problems. But it will
probably not fix some of the other problems you mention, such as servers
bogging down and becoming unresponsive, as that's not a space issue.
The cause of that would likely be an IOPS issue, meaning you don't have
enough storage spindles to service requests in a timely manner.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.

A SAN is required for such a setup. I had extensive experience with ESX
and HA about 5 years ago and it works as advertised. After 5 years it
can only have improved. It's not "cheap" but usually pays for itself
due to being able to consolidate the workload of dozens of physical
servers into just 2 or 3 boxes.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl.
Maybe just starting with a description of your current hardware setup
and number of total users/mailboxes would be a good starting point. How
many servers do you have, what storage is connected to each, percent of
MUA POP/IMAP connections from user PCs versus those from webmail
applications, etc, etc.

Probably the single most important piece of information would be the
hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc,
and what version of Dovecot you're running.

The focus of your email is building a storage server strictly to offload
old mail and free up space on the Dovecot server. From the sound of
things, this may not be sufficient to solve all your problems.
Post by Emmanuel Noobadmin
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
If multiple servers are screeching to a halt due to iowait, either all
of your servers individual disks are overloaded, or you already have
shared storage. We really need more info on your current architecture.
Right now we don't knw if we're talking about 4 servers or 40., 100
users or 10,000.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?
If the big storage node is strictly alt storage, and it dies, Dovecot
will still access its main mdbox storage just fine. It simply wouldn't
be able to access the alt storage and would log errors for those requests.

If you design a whole new architecture from scratch, going with ESX and
an iSCSI SAN this whole line of thinking is moot as there is no SPOF.
This can be done with as little as two physical servers and one iSCSI
SAN array which has dual redundant controllers in the base config.
Depending on your actual IOPS needs, you could possibly consolidate
everything you have now into two physical servers and one iSCSI SAN
array, for between $30-40K USD in hardware and $8-10K in ESX licenses.
That just a guess on ESX as I don't know the current pricing. Even if
it's that "high" it's far more than worth the price due to the capability.

Such a setup allows you to run all of your Exim, webmail, Dovecot, etc
servers on two machines, and you usually get much better performance
than with individual boxes, especially if you manually place the VMs on
the nodes for lowest network latency. For instance, if you place your
webmail server VM on the same host as the Dovecot VM, TCP packet latency
drops from the high micro/low milliscond range into the mid nanosecond
range--a 1000x decrease in latency. Why? The packet transfer is now a
memory-to-memory copy through the hypervisor. The packets never reach a
physical network interface. You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.
Search the list archives for Charles' thread about bringing up a 2nd
office site. His desire was/is to duplicate machines at the 2nd site
for redundancy, when the proper thing to do is duplicate them at the
primary site, and simply duplicate the network links between sites. My
point to you and Charles is that you never add complexity for the sake
of adding complexity.
Post by Emmanuel Noobadmin
Of course management and clients don't agree with me since
backup/redundancy costs money. :)
So does gasoline, but even as the price has more than doubled in 3 years
in the States, people haven't stopped buying it. Why? They have to
have it. The case is the same for certain levels of redundancy. You
simply have to have it. You job is properly explaining that need. Ask
the CEO/CFO how much money the company will lose in productivity if
nobody has email for 1 workday, as that is how long it will take to
rebuild it from scratch and restore all the data when it fails. Then
ask what the cost is if all the email is completely lost because they
were to cheap to pay for a backup solution?

To executives, money in the bank is like the family jewels in their
trousers. Kicking the family jewels and generating that level of pain
seriously gets their attention. Likewise, if a failed server plus
rebuild/restore costs $50K in lost productivity, spending $20K on a
solution to prevent that from happening is a good investment. Explain
it in terms execs understand. Have industry data to back your position.
There plenty of it available.
--
Stan
Stan Hoeppner
2012-04-08 18:21:47 UTC
Permalink
Post by Emmanuel Noobadmin
Firstly, thanks for the comprehensive reply. :)
Post by Stan Hoeppner
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
So it seems you have two courses of action:

1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.

2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs

Adding an NFS server and moving infrequently (old) accessed files to
alternate storage will alleviate your space problems. But it will
probably not fix some of the other problems you mention, such as servers
bogging down and becoming unresponsive, as that's not a space issue.
The cause of that would likely be an IOPS issue, meaning you don't have
enough storage spindles to service requests in a timely manner.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.

A SAN is required for such a setup. I had extensive experience with ESX
and HA about 5 years ago and it works as advertised. After 5 years it
can only have improved. It's not "cheap" but usually pays for itself
due to being able to consolidate the workload of dozens of physical
servers into just 2 or 3 boxes.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl.
Maybe just starting with a description of your current hardware setup
and number of total users/mailboxes would be a good starting point. How
many servers do you have, what storage is connected to each, percent of
MUA POP/IMAP connections from user PCs versus those from webmail
applications, etc, etc.

Probably the single most important piece of information would be the
hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc,
and what version of Dovecot you're running.

The focus of your email is building a storage server strictly to offload
old mail and free up space on the Dovecot server. From the sound of
things, this may not be sufficient to solve all your problems.
Post by Emmanuel Noobadmin
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
If multiple servers are screeching to a halt due to iowait, either all
of your servers individual disks are overloaded, or you already have
shared storage. We really need more info on your current architecture.
Right now we don't knw if we're talking about 4 servers or 40., 100
users or 10,000.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?
If the big storage node is strictly alt storage, and it dies, Dovecot
will still access its main mdbox storage just fine. It simply wouldn't
be able to access the alt storage and would log errors for those requests.

If you design a whole new architecture from scratch, going with ESX and
an iSCSI SAN this whole line of thinking is moot as there is no SPOF.
This can be done with as little as two physical servers and one iSCSI
SAN array which has dual redundant controllers in the base config.
Depending on your actual IOPS needs, you could possibly consolidate
everything you have now into two physical servers and one iSCSI SAN
array, for between $30-40K USD in hardware and $8-10K in ESX licenses.
That just a guess on ESX as I don't know the current pricing. Even if
it's that "high" it's far more than worth the price due to the capability.

Such a setup allows you to run all of your Exim, webmail, Dovecot, etc
servers on two machines, and you usually get much better performance
than with individual boxes, especially if you manually place the VMs on
the nodes for lowest network latency. For instance, if you place your
webmail server VM on the same host as the Dovecot VM, TCP packet latency
drops from the high micro/low milliscond range into the mid nanosecond
range--a 1000x decrease in latency. Why? The packet transfer is now a
memory-to-memory copy through the hypervisor. The packets never reach a
physical network interface. You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.
Post by Emmanuel Noobadmin
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.
Search the list archives for Charles' thread about bringing up a 2nd
office site. His desire was/is to duplicate machines at the 2nd site
for redundancy, when the proper thing to do is duplicate them at the
primary site, and simply duplicate the network links between sites. My
point to you and Charles is that you never add complexity for the sake
of adding complexity.
Post by Emmanuel Noobadmin
Of course management and clients don't agree with me since
backup/redundancy costs money. :)
So does gasoline, but even as the price has more than doubled in 3 years
in the States, people haven't stopped buying it. Why? They have to
have it. The case is the same for certain levels of redundancy. You
simply have to have it. You job is properly explaining that need. Ask
the CEO/CFO how much money the company will lose in productivity if
nobody has email for 1 workday, as that is how long it will take to
rebuild it from scratch and restore all the data when it fails. Then
ask what the cost is if all the email is completely lost because they
were to cheap to pay for a backup solution?

To executives, money in the bank is like the family jewels in their
trousers. Kicking the family jewels and generating that level of pain
seriously gets their attention. Likewise, if a failed server plus
rebuild/restore costs $50K in lost productivity, spending $20K on a
solution to prevent that from happening is a good investment. Explain
it in terms execs understand. Have industry data to back your position.
There plenty of it available.
--
Stan
Robin
2012-04-07 20:45:08 UTC
Permalink
Post by Stan Hoeppner
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?

If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct. I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.

Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.

The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?

Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?

Is there ANY point to using striping at all, a la "RAID10" in this? I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?

=R=
Stan Hoeppner
2012-04-08 00:46:20 UTC
Permalink
Post by Robin
Post by Stan Hoeppner
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?
It doesn't distribute AGs. There are a static number created during
mkfs.xfs. The inode64 allocator round robins new directory creation
across the AGs, and does the same with files created in those
directories. Having the directory metadata and file extents in the same
AG decreases head movement and thus seek latency for mixed
metadata/extent high IOPS workloads.
Post by Robin
If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct.
Correct. And adding more space and IOPS is uncomplicated. No chunk
calculations, no restriping of the array. You simply grow the md linear
array adding the new disk device. Then grow XFS to add the new free
space to the filesystem. AFAIK this can be done infinitely,
theoretically. I'm guessing md has a device count limit somewhere. If
not your bash line buffer might. ;)
Post by Robin
I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.
XFS locking is done as minimally as possibly and is insanely fast. I've
not come across any reported performance issues relating to it. And
yes, any single metadata lock will occur in a single AG on one mirror
pair using the concat setup.
Post by Robin
Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.
Don't sweat it. All of this stuff is covered in the XFS Filesystem
Structure Guide, exciting reading if you enjoy a root canal while
watching snales race:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html
Post by Robin
The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?
The problem with striped parity RAID is not allocation, which takes
place in free space and is pretty fast. The problem is the extra read
seeks and bandwidth of the RMW cycle when you modify an existing stripe.
Updating a single flag in a Dovecot index causes md or the hardware
RAID controller to read the entire stripe into buffer space or RAID
cache, modify the flag byte, recalculate parity, then write the whole
stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB
filesystem block, maybe only a single 512B sector. I'm at the edge of
my knowledge here. I don't know exactly how Timo does the index
updates. Regardless of the method, the index update is light years
faster with the concat setup as there is no RMW and full stripe
writeback as with the RAID5/6 case.
Post by Robin
Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?
EXT, any version, does not. ReiserFS does not. Both require disk
striping to achieve any parallelism. With concat they both simply start
writing at the beginning sectors of the first RAID1 pair and 4 years
later maybe reach the last pair as they fill up the volume. ;) JFS has
a more advanced allocation strategy that EXT or ReiserFS, not as
advanced as XFS. I've never read of a concat example with JFS and I've
never tested it. It's all but a dead filesystem at this point anyway,
less than 2 dozen commits in 8 years last I checked, and these were
simple bug fixes and changes to keep it building on new kernels. If
it's not suffering bit rot now I'm sure it will be in the near future.
Post by Robin
Is there ANY point to using striping at all, a la "RAID10" in this? I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?
If you're using XFS, and your workload is overwhelmingly mail,
RAID1+concat is the only way to fly, and it flies. If the workload is
not mail, say large file streaming writes, then you're limited to
100-200MB/s, a single drive of throughput, as each file is written to a
single directory on a single AG on a single disk. For streaming write
performance you'll need striping. If you have many concurrent large
streaming writes, you'll want to concat multiple striped arrays.
--
Stan
Stan Hoeppner
2012-04-08 00:46:20 UTC
Permalink
Post by Robin
Post by Stan Hoeppner
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?
It doesn't distribute AGs. There are a static number created during
mkfs.xfs. The inode64 allocator round robins new directory creation
across the AGs, and does the same with files created in those
directories. Having the directory metadata and file extents in the same
AG decreases head movement and thus seek latency for mixed
metadata/extent high IOPS workloads.
Post by Robin
If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct.
Correct. And adding more space and IOPS is uncomplicated. No chunk
calculations, no restriping of the array. You simply grow the md linear
array adding the new disk device. Then grow XFS to add the new free
space to the filesystem. AFAIK this can be done infinitely,
theoretically. I'm guessing md has a device count limit somewhere. If
not your bash line buffer might. ;)
Post by Robin
I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.
XFS locking is done as minimally as possibly and is insanely fast. I've
not come across any reported performance issues relating to it. And
yes, any single metadata lock will occur in a single AG on one mirror
pair using the concat setup.
Post by Robin
Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.
Don't sweat it. All of this stuff is covered in the XFS Filesystem
Structure Guide, exciting reading if you enjoy a root canal while
watching snales race:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html
Post by Robin
The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?
The problem with striped parity RAID is not allocation, which takes
place in free space and is pretty fast. The problem is the extra read
seeks and bandwidth of the RMW cycle when you modify an existing stripe.
Updating a single flag in a Dovecot index causes md or the hardware
RAID controller to read the entire stripe into buffer space or RAID
cache, modify the flag byte, recalculate parity, then write the whole
stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB
filesystem block, maybe only a single 512B sector. I'm at the edge of
my knowledge here. I don't know exactly how Timo does the index
updates. Regardless of the method, the index update is light years
faster with the concat setup as there is no RMW and full stripe
writeback as with the RAID5/6 case.
Post by Robin
Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?
EXT, any version, does not. ReiserFS does not. Both require disk
striping to achieve any parallelism. With concat they both simply start
writing at the beginning sectors of the first RAID1 pair and 4 years
later maybe reach the last pair as they fill up the volume. ;) JFS has
a more advanced allocation strategy that EXT or ReiserFS, not as
advanced as XFS. I've never read of a concat example with JFS and I've
never tested it. It's all but a dead filesystem at this point anyway,
less than 2 dozen commits in 8 years last I checked, and these were
simple bug fixes and changes to keep it building on new kernels. If
it's not suffering bit rot now I'm sure it will be in the near future.
Post by Robin
Is there ANY point to using striping at all, a la "RAID10" in this? I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?
If you're using XFS, and your workload is overwhelmingly mail,
RAID1+concat is the only way to fly, and it flies. If the workload is
not mail, say large file streaming writes, then you're limited to
100-200MB/s, a single drive of throughput, as each file is written to a
single directory on a single AG on a single disk. For streaming write
performance you'll need striping. If you have many concurrent large
streaming writes, you'll want to concat multiple striped arrays.
--
Stan
Stan Hoeppner
2012-04-08 00:46:20 UTC
Permalink
Post by Robin
Post by Stan Hoeppner
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?
It doesn't distribute AGs. There are a static number created during
mkfs.xfs. The inode64 allocator round robins new directory creation
across the AGs, and does the same with files created in those
directories. Having the directory metadata and file extents in the same
AG decreases head movement and thus seek latency for mixed
metadata/extent high IOPS workloads.
Post by Robin
If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct.
Correct. And adding more space and IOPS is uncomplicated. No chunk
calculations, no restriping of the array. You simply grow the md linear
array adding the new disk device. Then grow XFS to add the new free
space to the filesystem. AFAIK this can be done infinitely,
theoretically. I'm guessing md has a device count limit somewhere. If
not your bash line buffer might. ;)
Post by Robin
I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.
XFS locking is done as minimally as possibly and is insanely fast. I've
not come across any reported performance issues relating to it. And
yes, any single metadata lock will occur in a single AG on one mirror
pair using the concat setup.
Post by Robin
Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.
Don't sweat it. All of this stuff is covered in the XFS Filesystem
Structure Guide, exciting reading if you enjoy a root canal while
watching snales race:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html
Post by Robin
The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?
The problem with striped parity RAID is not allocation, which takes
place in free space and is pretty fast. The problem is the extra read
seeks and bandwidth of the RMW cycle when you modify an existing stripe.
Updating a single flag in a Dovecot index causes md or the hardware
RAID controller to read the entire stripe into buffer space or RAID
cache, modify the flag byte, recalculate parity, then write the whole
stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB
filesystem block, maybe only a single 512B sector. I'm at the edge of
my knowledge here. I don't know exactly how Timo does the index
updates. Regardless of the method, the index update is light years
faster with the concat setup as there is no RMW and full stripe
writeback as with the RAID5/6 case.
Post by Robin
Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?
EXT, any version, does not. ReiserFS does not. Both require disk
striping to achieve any parallelism. With concat they both simply start
writing at the beginning sectors of the first RAID1 pair and 4 years
later maybe reach the last pair as they fill up the volume. ;) JFS has
a more advanced allocation strategy that EXT or ReiserFS, not as
advanced as XFS. I've never read of a concat example with JFS and I've
never tested it. It's all but a dead filesystem at this point anyway,
less than 2 dozen commits in 8 years last I checked, and these were
simple bug fixes and changes to keep it building on new kernels. If
it's not suffering bit rot now I'm sure it will be in the near future.
Post by Robin
Is there ANY point to using striping at all, a la "RAID10" in this? I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?
If you're using XFS, and your workload is overwhelmingly mail,
RAID1+concat is the only way to fly, and it flies. If the workload is
not mail, say large file streaming writes, then you're limited to
100-200MB/s, a single drive of throughput, as each file is written to a
single directory on a single AG on a single disk. For streaming write
performance you'll need striping. If you have many concurrent large
streaming writes, you'll want to concat multiple striped arrays.
--
Stan
Emmanuel Noobadmin
2012-04-07 14:43:09 UTC
Permalink
On 4/7/12, Stan Hoeppner <stan at hardwarefreak.com> wrote:

Firstly, thanks for the comprehensive reply. :)
Post by Stan Hoeppner
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
Post by Stan Hoeppner
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
Post by Stan Hoeppner
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl. My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Post by Stan Hoeppner
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
Post by Stan Hoeppner
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.

Of course management and clients don't agree with me since
backup/redundancy costs money. :)
Robin
2012-04-07 20:45:08 UTC
Permalink
Post by Stan Hoeppner
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?

If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct. I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.

Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.

The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?

Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?

Is there ANY point to using striping at all, a la "RAID10" in this? I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?

=R=
Emmanuel Noobadmin
2012-04-07 14:43:09 UTC
Permalink
On 4/7/12, Stan Hoeppner <stan at hardwarefreak.com> wrote:

Firstly, thanks for the comprehensive reply. :)
Post by Stan Hoeppner
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
Post by Stan Hoeppner
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
Post by Stan Hoeppner
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl. My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Post by Stan Hoeppner
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
Post by Stan Hoeppner
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?
Post by Stan Hoeppner
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.

Of course management and clients don't agree with me since
backup/redundancy costs money. :)
Robin
2012-04-07 20:45:08 UTC
Permalink
Post by Stan Hoeppner
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?

If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct. I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.

Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.

The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?

Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?

Is there ANY point to using striping at all, a la "RAID10" in this? I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?

=R=
Emmanuel Noobadmin
2012-04-05 20:02:17 UTC
Permalink
I'm trying to improve the setup of our Dovecot/Exim mail servers to
handle the increasingly huge accounts (everybody thinks it's like
infinitely growing storage like gmail and stores everything forever in
their email accounts) by changing from Maildir to mdbox, and to take
advantage of offloading older emails to alternative networked storage
nodes.

The question now is whether having a single large server or will a
number of 1U servers with the same total capacity be better? Will be
using RAID 1 pairs, likely XFS based on reading Hoeppner's
recommendation on this and the mdadm list.


Currently, I'm leaning towards multiple small servers because I think
it should be better in terms of performance. At the very least even if
one node gets jammed up, the rest should still be able serve up the
emails for other accounts that is unless Dovecot will get locked up by
that jammed transaction. Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.

Would I be correct in these or do actual experiences say otherwise?
Stan Hoeppner
2012-04-07 10:19:46 UTC
Permalink
On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,
Post by Emmanuel Noobadmin
I'm trying to improve the setup of our Dovecot/Exim mail servers to
handle the increasingly huge accounts (everybody thinks it's like
infinitely growing storage like gmail and stores everything forever in
their email accounts) by changing from Maildir to mdbox, and to take
advantage of offloading older emails to alternative networked storage
nodes.
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
Post by Emmanuel Noobadmin
The question now is whether having a single large server or will a
number of 1U servers with the same total capacity be better?
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I came up with the following system that should be close to suitable,
for ~$10k USD. The 4 node system runs ~$12k USD. At $2k this isn't
substantially higher. But when we double the storage of each
architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a
difference of ~$7k. That's $1k shy of another 12 disk JBOD. Since CPU
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1
8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB)
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1
w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12
allocation groups, 12TB usable. Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64. No XFS stripe alignment to monkey with. No md
chunk size or anything else to worry about. XFS' allocation group
design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can
daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add
cables. This quadruples IOPS, throughput, and capacity--96TB total,
48TB net. Simply create 6 more mdraid1 devices and grow the linear
array with them. Then do an xfs_growfs to bring the extra 12TB of free
space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive
whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD
chassis for $7400 USD. That twice the drives, capacity, IOPS, for
~$2500 less than the HP JBOD. And unlike the HP 'enterprise SATA'
drives, the 2TB WD Black series have a 5 year warranty, and work great
with mdraid. Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our
LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles
in our concat+RAID1 setup, 144TB net space.
Post by Emmanuel Noobadmin
Will be
using RAID 1 pairs, likely XFS based on reading Hoeppner's
recommendation on this and the mdadm list.
To be clear, the XFS configuration I recommend/promote for mailbox
storage is very specific and layered. The layers must all be used
together to get the performance. These layers consist of using multiple
hardware or software RAID1 pairs and concatenating them with an md
linear array. You then format that md device with the XFS defaults, or
a specific agcount if you know how to precisely tune AG layout based on
disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.

2. Because we decrease seeks dramatically we also decrease response
latency significantly. With the RAID1+concat+XFS we have 12 disks each
with only 2 AGs spaced evenly down each platter. We have the same 4
user mail dirs in each AG, but in this case only 8 user mail dirs are
contained on each disk instead of portions all 96. With the same 96
concurrent writes to indexes, in this case end up with only 16 seeks per
drive--again, one to update each index file and one to update the metadata.

Assuming these drives have a max seek rate of 150 which is the average
for 7.2k drives, it will take 192/150 = 1.28 seconds for these
operations on the RAID10 array. With the concat array it will only take
16/150 = 0.11 seconds. Extrapolating from that demonstrates that the
concat array can handle 1.28/0.11 = 11.6*96 = 1,111 concurrent user
index updates in the same time as the RAID10 array, just over 10 times
more users. Granted, these are rough theoretical numbers--an index plus
metadata update isn't always going to cause a seek on every chunk in a
stripe, etc. But this does paint a very accurate picture of the
differences in mailbox workload disk seek patterns between XFS on concat
and RAID10 with the same hardware. In production one should be able to
handle at minimum 2x more users, probably many more, with the
RAID1+concat+XFS vs RAID10+XFS setup on the same hardware.
Post by Emmanuel Noobadmin
Currently, I'm leaning towards multiple small servers because I think
it should be better in terms of performance.
This usually isn't the case with mail. It's impossible to split up the
user files across the storage nodes in a way that balances block usage
on each node and user access to those blocks. Hotspots are inevitable
in both categories. You may achieve the same total performance of a
single server, maybe slightly surpass it depending on user load, but you
end up spending extra money on building resources that are idle most of
the time, in the case of CPU and NICs, or under/over utilized, in the
case of disk capacity in each node. Switch ports aren't horribly
expensive today, but you're still wasting some with the farm setup.
Post by Emmanuel Noobadmin
At the very least even if
one node gets jammed up, the rest should still be able serve up the
emails for other accounts that is unless Dovecot will get locked up by
that jammed transaction.
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Post by Emmanuel Noobadmin
Would I be correct in these or do actual experiences say otherwise?
Oracles on Mount Interweb profess that 2^5 nodes wide scale out is the
holy grail. IBM's mainframe evangelists tell us to put 5 million mail
users on a SystemZ with hundreds of Linux VMs.

I think bliss for most of us is found somewhere in the middle.
--
Stan
Emmanuel Noobadmin
2012-04-05 20:02:17 UTC
Permalink
I'm trying to improve the setup of our Dovecot/Exim mail servers to
handle the increasingly huge accounts (everybody thinks it's like
infinitely growing storage like gmail and stores everything forever in
their email accounts) by changing from Maildir to mdbox, and to take
advantage of offloading older emails to alternative networked storage
nodes.

The question now is whether having a single large server or will a
number of 1U servers with the same total capacity be better? Will be
using RAID 1 pairs, likely XFS based on reading Hoeppner's
recommendation on this and the mdadm list.


Currently, I'm leaning towards multiple small servers because I think
it should be better in terms of performance. At the very least even if
one node gets jammed up, the rest should still be able serve up the
emails for other accounts that is unless Dovecot will get locked up by
that jammed transaction. Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.

Would I be correct in these or do actual experiences say otherwise?
Stan Hoeppner
2012-04-07 10:19:46 UTC
Permalink
On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,
Post by Emmanuel Noobadmin
I'm trying to improve the setup of our Dovecot/Exim mail servers to
handle the increasingly huge accounts (everybody thinks it's like
infinitely growing storage like gmail and stores everything forever in
their email accounts) by changing from Maildir to mdbox, and to take
advantage of offloading older emails to alternative networked storage
nodes.
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".
Post by Emmanuel Noobadmin
The question now is whether having a single large server or will a
number of 1U servers with the same total capacity be better?
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I came up with the following system that should be close to suitable,
for ~$10k USD. The 4 node system runs ~$12k USD. At $2k this isn't
substantially higher. But when we double the storage of each
architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a
difference of ~$7k. That's $1k shy of another 12 disk JBOD. Since CPU
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1
8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB)
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1
w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12
allocation groups, 12TB usable. Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64. No XFS stripe alignment to monkey with. No md
chunk size or anything else to worry about. XFS' allocation group
design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can
daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add
cables. This quadruples IOPS, throughput, and capacity--96TB total,
48TB net. Simply create 6 more mdraid1 devices and grow the linear
array with them. Then do an xfs_growfs to bring the extra 12TB of free
space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive
whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD
chassis for $7400 USD. That twice the drives, capacity, IOPS, for
~$2500 less than the HP JBOD. And unlike the HP 'enterprise SATA'
drives, the 2TB WD Black series have a 5 year warranty, and work great
with mdraid. Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our
LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles
in our concat+RAID1 setup, 144TB net space.
Post by Emmanuel Noobadmin
Will be
using RAID 1 pairs, likely XFS based on reading Hoeppner's
recommendation on this and the mdadm list.
To be clear, the XFS configuration I recommend/promote for mailbox
storage is very specific and layered. The layers must all be used
together to get the performance. These layers consist of using multiple
hardware or software RAID1 pairs and concatenating them with an md
linear array. You then format that md device with the XFS defaults, or
a specific agcount if you know how to precisely tune AG layout based on
disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created. The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key. By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1. You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern. Each user mailbox is stored in a different directory.
Each directory was created in a different AG. So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192
instead of 96? The modification time in the directory metadata must be
updated for each index file, among other things.

2. Because we decrease seeks dramatically we also decrease response
latency significantly. With the RAID1+concat+XFS we have 12 disks each
with only 2 AGs spaced evenly down each platter. We have the same 4
user mail dirs in each AG, but in this case only 8 user mail dirs are
contained on each disk instead of portions all 96. With the same 96
concurrent writes to indexes, in this case end up with only 16 seeks per
drive--again, one to update each index file and one to update the metadata.

Assuming these drives have a max seek rate of 150 which is the average
for 7.2k drives, it will take 192/150 = 1.28 seconds for these
operations on the RAID10 array. With the concat array it will only take
16/150 = 0.11 seconds. Extrapolating from that demonstrates that the
concat array can handle 1.28/0.11 = 11.6*96 = 1,111 concurrent user
index updates in the same time as the RAID10 array, just over 10 times
more users. Granted, these are rough theoretical numbers--an index plus
metadata update isn't always going to cause a seek on every chunk in a
stripe, etc. But this does paint a very accurate picture of the
differences in mailbox workload disk seek patterns between XFS on concat
and RAID10 with the same hardware. In production one should be able to
handle at minimum 2x more users, probably many more, with the
RAID1+concat+XFS vs RAID10+XFS setup on the same hardware.
Post by Emmanuel Noobadmin
Currently, I'm leaning towards multiple small servers because I think
it should be better in terms of performance.
This usually isn't the case with mail. It's impossible to split up the
user files across the storage nodes in a way that balances block usage
on each node and user access to those blocks. Hotspots are inevitable
in both categories. You may achieve the same total performance of a
single server, maybe slightly surpass it depending on user load, but you
end up spending extra money on building resources that are idle most of
the time, in the case of CPU and NICs, or under/over utilized, in the
case of disk capacity in each node. Switch ports aren't horribly
expensive today, but you're still wasting some with the farm setup.
Post by Emmanuel Noobadmin
At the very least even if
one node gets jammed up, the rest should still be able serve up the
emails for other accounts that is unless Dovecot will get locked up by
that jammed transaction.
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
Post by Emmanuel Noobadmin
Also, I could possibly arrange them in a sort
of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you
are, and brush your hair away from your forehead. I'm coming over with
my branding iron that says "K.I.S.S"
Post by Emmanuel Noobadmin
Would I be correct in these or do actual experiences say otherwise?
Oracles on Mount Interweb profess that 2^5 nodes wide scale out is the
holy grail. IBM's mainframe evangelists tell us to put 5 million mail
users on a SystemZ with hundreds of Linux VMs.

I think bliss for most of us is found somewhere in the middle.
--
Stan
Loading...