[ceph-users] Failure probability with largish deployments

Discussion:

Christian Balzer

2013-12-19 08:39:54 UTC

Hello,

In my "Sanity check" thread I postulated yesterday that to get the same
redundancy and resilience for disk failures (excluding other factors) as
my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
dual disk failures) to get the similar capacity and a 7th identical node to
allow for node failure/maintenance.

That was basically based on me thinking "must not get caught be a dual
disk failure ever again", as that happened twice to me, once with a RAID5
and the expected consequences, once with a RAID10 where I got lucky (8
disks total each time).

However something was nagging me at the back of my brain and turned out to
be my long forgotten statistics classes in school. ^o^

So I after reading some articles basically telling the same things I found
this: https://www.memset.com/tools/raid-calculator/

Now this is based on assumptions, onto which I will add some more, but the
last sentence on that page still is quite valid.

So lets compare these 2 configurations above, I assumed 75GB/s recovery
speed for the RAID6 configuration something I've seen in practice.
Basically that's half speed, something that will be lower during busy hours
and higher during off peak hours. I made the same assumption for Ceph with
a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
The rebalancing would have to compete with other replication traffic
(likely not much of an issue) and the actual speed/load of the individual
drives involved. Note that if we assume a totally quiet setup, were 100%
of all resources would be available for recovery the numbers would of
course change, but NOT their ratios.
I went with the default disk lifetime of 3 years and 0 day replacement
time. The latter of course gives very unrealistic results for anything w/o
hotspare drive, but we're comparing 2 different beasts here.

So that all said, the results of that page that make sense in this
comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
drive failure in the time before recovery is complete, the replacement
setting of 0 giving us the best possible number and since one would deploy
a Ceph cluster with sufficient extra capacity that's what we shall use.

For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
1 in 58497.9 ratio of data loss per year.
Alas for the 70 HDs in the comparable Ceph configuration we wind up with
just a 1 in 13094.31 ratio, which while still quite acceptable clearly
shows where this is going.

So am I completely off my wagon here?
How do people deal with this when potentially deploying hundreds of disks
in a single cluster/pool?

I mean, when we get too 600 disks (and that's just one rack full, OK,
maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
servers (or 72 disk per 4U if you're happy with killing another drive when
replacing a faulty one in that Supermicro contraption), that ratio is down
to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Wolfgang Hennerbichler

2013-12-19 08:53:58 UTC