Discussion:
[ceph-users] Failure probability with largish deployments
Christian Balzer
2013-12-19 08:39:54 UTC
Permalink
Hello,

In my "Sanity check" thread I postulated yesterday that to get the same
redundancy and resilience for disk failures (excluding other factors) as
my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
dual disk failures) to get the similar capacity and a 7th identical node to
allow for node failure/maintenance.

That was basically based on me thinking "must not get caught be a dual
disk failure ever again", as that happened twice to me, once with a RAID5
and the expected consequences, once with a RAID10 where I got lucky (8
disks total each time).

However something was nagging me at the back of my brain and turned out to
be my long forgotten statistics classes in school. ^o^

So I after reading some articles basically telling the same things I found
this: https://www.memset.com/tools/raid-calculator/

Now this is based on assumptions, onto which I will add some more, but the
last sentence on that page still is quite valid.

So lets compare these 2 configurations above, I assumed 75GB/s recovery
speed for the RAID6 configuration something I've seen in practice.
Basically that's half speed, something that will be lower during busy hours
and higher during off peak hours. I made the same assumption for Ceph with
a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
The rebalancing would have to compete with other replication traffic
(likely not much of an issue) and the actual speed/load of the individual
drives involved. Note that if we assume a totally quiet setup, were 100%
of all resources would be available for recovery the numbers would of
course change, but NOT their ratios.
I went with the default disk lifetime of 3 years and 0 day replacement
time. The latter of course gives very unrealistic results for anything w/o
hotspare drive, but we're comparing 2 different beasts here.

So that all said, the results of that page that make sense in this
comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
drive failure in the time before recovery is complete, the replacement
setting of 0 giving us the best possible number and since one would deploy
a Ceph cluster with sufficient extra capacity that's what we shall use.

For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
1 in 58497.9 ratio of data loss per year.
Alas for the 70 HDs in the comparable Ceph configuration we wind up with
just a 1 in 13094.31 ratio, which while still quite acceptable clearly
shows where this is going.

So am I completely off my wagon here?
How do people deal with this when potentially deploying hundreds of disks
in a single cluster/pool?

I mean, when we get too 600 disks (and that's just one rack full, OK,
maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
servers (or 72 disk per 4U if you're happy with killing another drive when
replacing a faulty one in that Supermicro contraption), that ratio is down
to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 08:53:58 UTC
Permalink
Hello,

although I don't know much about this topic, I believe that ceph erasure
encoding will probably solve a lot of these issues with some speed
tradeoff. With erasure encoding the replicated data eats way less disk
capacity, so you could use a higher replication factor with a lower disk
usage tradeoff.

Wolfgang

On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler at risc-software.at
http://www.risc-software.at
Christian Balzer
2013-12-19 09:11:40 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 09:53:58 +0100 Wolfgang Hennerbichler wrote:

> Hello,
>
> although I don't know much about this topic, I believe that ceph erasure
> encoding will probably solve a lot of these issues with some speed
> tradeoff. With erasure encoding the replicated data eats way less disk
> capacity, so you could use a higher replication factor with a lower disk
> usage tradeoff.
>
Yeah, I saw erasure encoding mentioned a little while ago, but that's
likely not to be around by the time I'm going to deploy things.
Nevermind that super bleeding edge isn't my style when it comes to
production systems. ^o^

And at something like 600 disks, that would still have to be a mighty high
level of replication to combat failure statistics...

> Wolfgang
>
> On 12/19/2013 09:39 AM, Christian Balzer wrote:
> >
[snip]

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Robert van Leeuwen
2013-12-19 09:50:24 UTC
Permalink
> Yeah, I saw erasure encoding mentioned a little while ago, but that's
> likely not to be around by the time I'm going to deploy things.
> Nevermind that super bleeding edge isn't my style when it comes to
> production systems. ^o^

> And at something like 600 disks, that would still have to be a mighty high
> level of replication to combat failure statistics...

Not sure if I understand correctly but:
It looks like it currently is a raid 01 kind of solution
So failure domain is a raid 0 and mirror the failure domain to X replicas.
When you have a rep count of 3 you could be unlucky with 3 disks failing in three failure domains at the same time.
If you have enough disks in the cluster the chances are this will happen at some point.

It would make sense that you would be able to create a raid 10 kind of solution:
Where disk1 in failure domain 1 has the same content as disk1 in failure domain 2 and domain 3.
So the PGs that are on one OSD will be exactly mirrored to another OSD in another failure domain.
This would require more uniform hardware and you lose flexibility but you win a lot of reliability.
Without knowing anything about the code base I *think* it should be pretty trivial to change the code to support this and would be a very small change compared to erasure code.

( I looked a bit at crush map Bucket Types but it *seems* that all Bucket types will still stripe the PGs across all nodes within a failure domain )

Cheers,
Robert van Leeuwen
Robert van Leeuwen
2013-12-19 09:50:24 UTC
Permalink
> Yeah, I saw erasure encoding mentioned a little while ago, but that's
> likely not to be around by the time I'm going to deploy things.
> Nevermind that super bleeding edge isn't my style when it comes to
> production systems. ^o^

> And at something like 600 disks, that would still have to be a mighty high
> level of replication to combat failure statistics...

Not sure if I understand correctly but:
It looks like it currently is a raid 01 kind of solution
So failure domain is a raid 0 and mirror the failure domain to X replicas.
When you have a rep count of 3 you could be unlucky with 3 disks failing in three failure domains at the same time.
If you have enough disks in the cluster the chances are this will happen at some point.

It would make sense that you would be able to create a raid 10 kind of solution:
Where disk1 in failure domain 1 has the same content as disk1 in failure domain 2 and domain 3.
So the PGs that are on one OSD will be exactly mirrored to another OSD in another failure domain.
This would require more uniform hardware and you lose flexibility but you win a lot of reliability.
Without knowing anything about the code base I *think* it should be pretty trivial to change the code to support this and would be a very small change compared to erasure code.

( I looked a bit at crush map Bucket Types but it *seems* that all Bucket types will still stripe the PGs across all nodes within a failure domain )

Cheers,
Robert van Leeuwen
Robert van Leeuwen
2013-12-19 09:50:24 UTC
Permalink
> Yeah, I saw erasure encoding mentioned a little while ago, but that's
> likely not to be around by the time I'm going to deploy things.
> Nevermind that super bleeding edge isn't my style when it comes to
> production systems. ^o^

> And at something like 600 disks, that would still have to be a mighty high
> level of replication to combat failure statistics...

Not sure if I understand correctly but:
It looks like it currently is a raid 01 kind of solution
So failure domain is a raid 0 and mirror the failure domain to X replicas.
When you have a rep count of 3 you could be unlucky with 3 disks failing in three failure domains at the same time.
If you have enough disks in the cluster the chances are this will happen at some point.

It would make sense that you would be able to create a raid 10 kind of solution:
Where disk1 in failure domain 1 has the same content as disk1 in failure domain 2 and domain 3.
So the PGs that are on one OSD will be exactly mirrored to another OSD in another failure domain.
This would require more uniform hardware and you lose flexibility but you win a lot of reliability.
Without knowing anything about the code base I *think* it should be pretty trivial to change the code to support this and would be a very small change compared to erasure code.

( I looked a bit at crush map Bucket Types but it *seems* that all Bucket types will still stripe the PGs across all nodes within a failure domain )

Cheers,
Robert van Leeuwen
Robert van Leeuwen
2013-12-19 09:50:24 UTC
Permalink
> Yeah, I saw erasure encoding mentioned a little while ago, but that's
> likely not to be around by the time I'm going to deploy things.
> Nevermind that super bleeding edge isn't my style when it comes to
> production systems. ^o^

> And at something like 600 disks, that would still have to be a mighty high
> level of replication to combat failure statistics...

Not sure if I understand correctly but:
It looks like it currently is a raid 01 kind of solution
So failure domain is a raid 0 and mirror the failure domain to X replicas.
When you have a rep count of 3 you could be unlucky with 3 disks failing in three failure domains at the same time.
If you have enough disks in the cluster the chances are this will happen at some point.

It would make sense that you would be able to create a raid 10 kind of solution:
Where disk1 in failure domain 1 has the same content as disk1 in failure domain 2 and domain 3.
So the PGs that are on one OSD will be exactly mirrored to another OSD in another failure domain.
This would require more uniform hardware and you lose flexibility but you win a lot of reliability.
Without knowing anything about the code base I *think* it should be pretty trivial to change the code to support this and would be a very small change compared to erasure code.

( I looked a bit at crush map Bucket Types but it *seems* that all Bucket types will still stripe the PGs across all nodes within a failure domain )

Cheers,
Robert van Leeuwen
Christian Balzer
2013-12-19 09:11:40 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 09:53:58 +0100 Wolfgang Hennerbichler wrote:

> Hello,
>
> although I don't know much about this topic, I believe that ceph erasure
> encoding will probably solve a lot of these issues with some speed
> tradeoff. With erasure encoding the replicated data eats way less disk
> capacity, so you could use a higher replication factor with a lower disk
> usage tradeoff.
>
Yeah, I saw erasure encoding mentioned a little while ago, but that's
likely not to be around by the time I'm going to deploy things.
Nevermind that super bleeding edge isn't my style when it comes to
production systems. ^o^

And at something like 600 disks, that would still have to be a mighty high
level of replication to combat failure statistics...

> Wolfgang
>
> On 12/19/2013 09:39 AM, Christian Balzer wrote:
> >
[snip]

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-19 09:11:40 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 09:53:58 +0100 Wolfgang Hennerbichler wrote:

> Hello,
>
> although I don't know much about this topic, I believe that ceph erasure
> encoding will probably solve a lot of these issues with some speed
> tradeoff. With erasure encoding the replicated data eats way less disk
> capacity, so you could use a higher replication factor with a lower disk
> usage tradeoff.
>
Yeah, I saw erasure encoding mentioned a little while ago, but that's
likely not to be around by the time I'm going to deploy things.
Nevermind that super bleeding edge isn't my style when it comes to
production systems. ^o^

And at something like 600 disks, that would still have to be a mighty high
level of replication to combat failure statistics...

> Wolfgang
>
> On 12/19/2013 09:39 AM, Christian Balzer wrote:
> >
[snip]

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-19 09:11:40 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 09:53:58 +0100 Wolfgang Hennerbichler wrote:

> Hello,
>
> although I don't know much about this topic, I believe that ceph erasure
> encoding will probably solve a lot of these issues with some speed
> tradeoff. With erasure encoding the replicated data eats way less disk
> capacity, so you could use a higher replication factor with a lower disk
> usage tradeoff.
>
Yeah, I saw erasure encoding mentioned a little while ago, but that's
likely not to be around by the time I'm going to deploy things.
Nevermind that super bleeding edge isn't my style when it comes to
production systems. ^o^

And at something like 600 disks, that would still have to be a mighty high
level of replication to combat failure statistics...

> Wolfgang
>
> On 12/19/2013 09:39 AM, Christian Balzer wrote:
> >
[snip]

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Mariusz Gronczewski
2013-12-19 11:12:13 UTC
Permalink
Dnia 2013-12-19, o godz. 17:39:54
Christian Balzer <chibi at gol.com> napisa?(a):

>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the
> same redundancy and resilience for disk failures (excluding other
> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
> protect against dual disk failures) to get the similar capacity and a
> 7th identical node to allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a
> RAID5 and the expected consequences, once with a RAID10 where I got
> lucky (8 disks total each time).



The thing is, in default config each copy of data is on different
physical machine, to allow for maintenance and hardware failures

in that case, losing 3 disks in one node is much better in 6 node
cluster, than in 2 node cluster, as data transfers needed for recover
is only 1/6th of your dataset, and also time to recovery is much
shorter as you need to read only 3TB data from whole cluster, not
3TB * 9 disks as it is in RAID6

first setup saves you from "3 disks in different machines are dead" at
cost of much of your IO and long recovery time

second setup have potential to recover much quicker, as it only needs
to transfer 3TB of data per disk failure to recover to clean state,
compared to 3TBx9 per RAID disk. Also impact of one node dead is vastly
lower.

basically, first case is better when disks drop dead exactly at same
time, second one is better when disks drop within few hours between
eachother


> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of
> disks in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive
> when replacing a faulty one in that Supermicro contraption), that
> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> mentioned up there.
>

That problem will only occur if you really want to have all those 600
disks in one pool and it so happens that 3 drives in different servers
unrecoverably die in same very short time interval, which is unlikely.
But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
makes more sense than running 60 OSDs just from memory/cpu usage
standpoint

From my experience disks rarely "just die" often it's either starts
to have bad blocks and write errors or performance degrades and it starts spewing media
errors (which usually means you can
recover 90%+ data from it if you need to without using anything more
than ddrescue). Which means ceph can access most of data for recovery
and recover just those few missing blocks.

each pool consist of many PGs and to make PG fail all disks had to be
hit so in worst case you will most likely just lose access to small part
(that pg that out of 600 disks happened to be on those 3) of data, not
everything that is on given array.

And again, that's only in case those disks die exactly at same moment,
with no time to recovery. even 60 min between failures will let most of
the data replicate. And in worst case, there is always data recovery
service. And backups




--
Mariusz Gronczewski, Administrator

Efigence Sp. z o. o.
ul. Wo?oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski at efigence.com
<mailto:mariusz.gronczewski at efigence.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131219/e589a4d2/attachment.pgp>
Christian Balzer
2013-12-20 03:14:04 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:12:13 +0100 Mariusz Gronczewski wrote:

> Dnia 2013-12-19, o godz. 17:39:54
> Christian Balzer <chibi at gol.com> napisa?(a):
[snip]
>
>
> > So am I completely off my wagon here?
> > How do people deal with this when potentially deploying hundreds of
> > disks in a single cluster/pool?
> >
> > I mean, when we get too 600 disks (and that's just one rack full, OK,
> > maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> > servers (or 72 disk per 4U if you're happy with killing another drive
> > when replacing a faulty one in that Supermicro contraption), that
> > ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> > mentioned up there.
> >
>
> That problem will only occur if you really want to have all those 600
> disks in one pool and it so happens that 3 drives in different servers
> unrecoverably die in same very short time interval, which is unlikely.
The likelihood of that is in that calculator, some more refined studies
and formulas can be found on the web as well.
And as Gregory acknowledged, with large pools that probability becomes
significant.

> But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
> makes more sense than running 60 OSDs just from memory/cpu usage
> standpoint
>
Yup, I would do that as well, if I were to deploy such a massive system.

> From my experience disks rarely "just die" often it's either starts
> to have bad blocks and write errors or performance degrades and it
> starts spewing media errors (which usually means you can
> recover 90%+ data from it if you need to without using anything more
> than ddrescue). Which means ceph can access most of data for recovery
> and recover just those few missing blocks.
>
I would certainly agree with the fact that with most disks you can see
them becoming marginal by watching smart output, unfortunately this is
quite a job to monitor with many disks and most of the time the on-disk
SMART algorithm will NOT trigger an impending failure status at all or with
sufficient warning time. And some disks really just drop dead, w/o any
warning at all.

> each pool consist of many PGs and to make PG fail all disks had to be
> hit so in worst case you will most likely just lose access to small part
> (that pg that out of 600 disks happened to be on those 3) of data, not
> everything that is on given array.
>
Yes, is the one thing I'm still not 100% sure about in my understanding of
Ceph.
In my scenario (VM volumes/images of 50GB to 2TB size). I would assume
them to be striped in such a way that there is more than a small impact
from a tripple failure.

> And again, that's only in case those disks die exactly at same moment,
> with no time to recovery. even 60 min between failures will let most of
> the data replicate. And in worst case, there is always data recovery
> service. And backups
>
My design goal for all my systems is that backups are for times when
people do stupid things, as in deleted things they shouldn't have. The
actual storage should be reliable enough to survive anything reasonably
expected to happen.
Also on what kind of storage server would you put backups for 60TB, my
planned initial capacity, another Ceph cluster? ^o^

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:14:04 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:12:13 +0100 Mariusz Gronczewski wrote:

> Dnia 2013-12-19, o godz. 17:39:54
> Christian Balzer <chibi at gol.com> napisa?(a):
[snip]
>
>
> > So am I completely off my wagon here?
> > How do people deal with this when potentially deploying hundreds of
> > disks in a single cluster/pool?
> >
> > I mean, when we get too 600 disks (and that's just one rack full, OK,
> > maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> > servers (or 72 disk per 4U if you're happy with killing another drive
> > when replacing a faulty one in that Supermicro contraption), that
> > ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> > mentioned up there.
> >
>
> That problem will only occur if you really want to have all those 600
> disks in one pool and it so happens that 3 drives in different servers
> unrecoverably die in same very short time interval, which is unlikely.
The likelihood of that is in that calculator, some more refined studies
and formulas can be found on the web as well.
And as Gregory acknowledged, with large pools that probability becomes
significant.

> But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
> makes more sense than running 60 OSDs just from memory/cpu usage
> standpoint
>
Yup, I would do that as well, if I were to deploy such a massive system.

> From my experience disks rarely "just die" often it's either starts
> to have bad blocks and write errors or performance degrades and it
> starts spewing media errors (which usually means you can
> recover 90%+ data from it if you need to without using anything more
> than ddrescue). Which means ceph can access most of data for recovery
> and recover just those few missing blocks.
>
I would certainly agree with the fact that with most disks you can see
them becoming marginal by watching smart output, unfortunately this is
quite a job to monitor with many disks and most of the time the on-disk
SMART algorithm will NOT trigger an impending failure status at all or with
sufficient warning time. And some disks really just drop dead, w/o any
warning at all.

> each pool consist of many PGs and to make PG fail all disks had to be
> hit so in worst case you will most likely just lose access to small part
> (that pg that out of 600 disks happened to be on those 3) of data, not
> everything that is on given array.
>
Yes, is the one thing I'm still not 100% sure about in my understanding of
Ceph.
In my scenario (VM volumes/images of 50GB to 2TB size). I would assume
them to be striped in such a way that there is more than a small impact
from a tripple failure.

> And again, that's only in case those disks die exactly at same moment,
> with no time to recovery. even 60 min between failures will let most of
> the data replicate. And in worst case, there is always data recovery
> service. And backups
>
My design goal for all my systems is that backups are for times when
people do stupid things, as in deleted things they shouldn't have. The
actual storage should be reliable enough to survive anything reasonably
expected to happen.
Also on what kind of storage server would you put backups for 60TB, my
planned initial capacity, another Ceph cluster? ^o^

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:14:04 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:12:13 +0100 Mariusz Gronczewski wrote:

> Dnia 2013-12-19, o godz. 17:39:54
> Christian Balzer <chibi at gol.com> napisa?(a):
[snip]
>
>
> > So am I completely off my wagon here?
> > How do people deal with this when potentially deploying hundreds of
> > disks in a single cluster/pool?
> >
> > I mean, when we get too 600 disks (and that's just one rack full, OK,
> > maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> > servers (or 72 disk per 4U if you're happy with killing another drive
> > when replacing a faulty one in that Supermicro contraption), that
> > ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> > mentioned up there.
> >
>
> That problem will only occur if you really want to have all those 600
> disks in one pool and it so happens that 3 drives in different servers
> unrecoverably die in same very short time interval, which is unlikely.
The likelihood of that is in that calculator, some more refined studies
and formulas can be found on the web as well.
And as Gregory acknowledged, with large pools that probability becomes
significant.

> But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
> makes more sense than running 60 OSDs just from memory/cpu usage
> standpoint
>
Yup, I would do that as well, if I were to deploy such a massive system.

> From my experience disks rarely "just die" often it's either starts
> to have bad blocks and write errors or performance degrades and it
> starts spewing media errors (which usually means you can
> recover 90%+ data from it if you need to without using anything more
> than ddrescue). Which means ceph can access most of data for recovery
> and recover just those few missing blocks.
>
I would certainly agree with the fact that with most disks you can see
them becoming marginal by watching smart output, unfortunately this is
quite a job to monitor with many disks and most of the time the on-disk
SMART algorithm will NOT trigger an impending failure status at all or with
sufficient warning time. And some disks really just drop dead, w/o any
warning at all.

> each pool consist of many PGs and to make PG fail all disks had to be
> hit so in worst case you will most likely just lose access to small part
> (that pg that out of 600 disks happened to be on those 3) of data, not
> everything that is on given array.
>
Yes, is the one thing I'm still not 100% sure about in my understanding of
Ceph.
In my scenario (VM volumes/images of 50GB to 2TB size). I would assume
them to be striped in such a way that there is more than a small impact
from a tripple failure.

> And again, that's only in case those disks die exactly at same moment,
> with no time to recovery. even 60 min between failures will let most of
> the data replicate. And in worst case, there is always data recovery
> service. And backups
>
My design goal for all my systems is that backups are for times when
people do stupid things, as in deleted things they shouldn't have. The
actual storage should be reliable enough to survive anything reasonably
expected to happen.
Also on what kind of storage server would you put backups for 60TB, my
planned initial capacity, another Ceph cluster? ^o^

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:14:04 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:12:13 +0100 Mariusz Gronczewski wrote:

> Dnia 2013-12-19, o godz. 17:39:54
> Christian Balzer <chibi at gol.com> napisa?(a):
[snip]
>
>
> > So am I completely off my wagon here?
> > How do people deal with this when potentially deploying hundreds of
> > disks in a single cluster/pool?
> >
> > I mean, when we get too 600 disks (and that's just one rack full, OK,
> > maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> > servers (or 72 disk per 4U if you're happy with killing another drive
> > when replacing a faulty one in that Supermicro contraption), that
> > ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> > mentioned up there.
> >
>
> That problem will only occur if you really want to have all those 600
> disks in one pool and it so happens that 3 drives in different servers
> unrecoverably die in same very short time interval, which is unlikely.
The likelihood of that is in that calculator, some more refined studies
and formulas can be found on the web as well.
And as Gregory acknowledged, with large pools that probability becomes
significant.

> But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
> makes more sense than running 60 OSDs just from memory/cpu usage
> standpoint
>
Yup, I would do that as well, if I were to deploy such a massive system.

> From my experience disks rarely "just die" often it's either starts
> to have bad blocks and write errors or performance degrades and it
> starts spewing media errors (which usually means you can
> recover 90%+ data from it if you need to without using anything more
> than ddrescue). Which means ceph can access most of data for recovery
> and recover just those few missing blocks.
>
I would certainly agree with the fact that with most disks you can see
them becoming marginal by watching smart output, unfortunately this is
quite a job to monitor with many disks and most of the time the on-disk
SMART algorithm will NOT trigger an impending failure status at all or with
sufficient warning time. And some disks really just drop dead, w/o any
warning at all.

> each pool consist of many PGs and to make PG fail all disks had to be
> hit so in worst case you will most likely just lose access to small part
> (that pg that out of 600 disks happened to be on those 3) of data, not
> everything that is on given array.
>
Yes, is the one thing I'm still not 100% sure about in my understanding of
Ceph.
In my scenario (VM volumes/images of 50GB to 2TB size). I would assume
them to be striped in such a way that there is more than a small impact
from a tripple failure.

> And again, that's only in case those disks die exactly at same moment,
> with no time to recovery. even 60 min between failures will let most of
> the data replicate. And in worst case, there is always data recovery
> service. And backups
>
My design goal for all my systems is that backups are for times when
people do stupid things, as in deleted things they shouldn't have. The
actual storage should be reliable enough to survive anything reasonably
expected to happen.
Also on what kind of storage server would you put backups for 60TB, my
planned initial capacity, another Ceph cluster? ^o^

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wido den Hollander
2013-12-19 11:42:15 UTC
Permalink
On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>

I'd suggest to use different vendors for the disks, so that means you'll
probably be mixing Seagate and Western Digital in such a setup.

In this case you can also rule out batch issues with disks, but the
likelihood of the same disks failing becomes smaller as well.

Also, make sure that you define your crushmap that replicas never and up
on the same physical host and if possible not in the same cabinet/rack.

I would never run with 60 drives in a single machine in a Ceph cluster,
I'd suggest you use more machines with less disks per machine.

> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Christian Balzer
2013-12-20 03:33:22 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:42:15 +0100 Wido den Hollander wrote:

> On 12/19/2013 09:39 AM, Christian Balzer wrote:
[snip]
> >
>
> I'd suggest to use different vendors for the disks, so that means you'll
> probably be mixing Seagate and Western Digital in such a setup.
>
That's funny, because I wouldn't use either of these vendors these days,
in fact it is likely years before I will consider Seagate again, if ever.
WD are in comparison overpriced and of lower performance (I do love the
Velociraptors for certain situations though).
Since WD also bought Hitachi I am currently pretty much stuck with
Toshiba drives. ^.^
That all said, I know where you're coming from and on principle I'd agree.

Also buying the "same" 3TB disk from different vendors for vastly differing
prices is also going to mean battle with the people paying for the
hardware. ^o^

> In this case you can also rule out batch issues with disks, but the
> likelihood of the same disks failing becomes smaller as well.
>
> Also, make sure that you define your crushmap that replicas never and up
> on the same physical host and if possible not in the same cabinet/rack.
>
One would hope that to be a given, provided the correct input was made.
People seem to be obsessed by rack failures here, in my case everything
(switches, PDUs, dual PSUs per server) is redundant per rack, so no SPOF,
no particular likelihood for a rack to fail in its entirety.

> I would never run with 60 drives in a single machine in a Ceph cluster,
> I'd suggest you use more machines with less disks per machine.
>
This was given as an example to show how quickly and in how little space
you can reach an amount of disks that pushes up failure probabilities to
near certainty.
And I would do deploy such a cluster if I had the need, simply use n+1 or
n+2 machines. So be on the safe side, instead of 10 servers deploy 12, six
per rack, set your full ratio to 80%.
And of course reduce the overhead and failure likelihood further by doing
local RAIDs, for example 4x 14HD RAID6 and 4 global hotspares per node.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:33:22 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:42:15 +0100 Wido den Hollander wrote:

> On 12/19/2013 09:39 AM, Christian Balzer wrote:
[snip]
> >
>
> I'd suggest to use different vendors for the disks, so that means you'll
> probably be mixing Seagate and Western Digital in such a setup.
>
That's funny, because I wouldn't use either of these vendors these days,
in fact it is likely years before I will consider Seagate again, if ever.
WD are in comparison overpriced and of lower performance (I do love the
Velociraptors for certain situations though).
Since WD also bought Hitachi I am currently pretty much stuck with
Toshiba drives. ^.^
That all said, I know where you're coming from and on principle I'd agree.

Also buying the "same" 3TB disk from different vendors for vastly differing
prices is also going to mean battle with the people paying for the
hardware. ^o^

> In this case you can also rule out batch issues with disks, but the
> likelihood of the same disks failing becomes smaller as well.
>
> Also, make sure that you define your crushmap that replicas never and up
> on the same physical host and if possible not in the same cabinet/rack.
>
One would hope that to be a given, provided the correct input was made.
People seem to be obsessed by rack failures here, in my case everything
(switches, PDUs, dual PSUs per server) is redundant per rack, so no SPOF,
no particular likelihood for a rack to fail in its entirety.

> I would never run with 60 drives in a single machine in a Ceph cluster,
> I'd suggest you use more machines with less disks per machine.
>
This was given as an example to show how quickly and in how little space
you can reach an amount of disks that pushes up failure probabilities to
near certainty.
And I would do deploy such a cluster if I had the need, simply use n+1 or
n+2 machines. So be on the safe side, instead of 10 servers deploy 12, six
per rack, set your full ratio to 80%.
And of course reduce the overhead and failure likelihood further by doing
local RAIDs, for example 4x 14HD RAID6 and 4 global hotspares per node.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:33:22 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:42:15 +0100 Wido den Hollander wrote:

> On 12/19/2013 09:39 AM, Christian Balzer wrote:
[snip]
> >
>
> I'd suggest to use different vendors for the disks, so that means you'll
> probably be mixing Seagate and Western Digital in such a setup.
>
That's funny, because I wouldn't use either of these vendors these days,
in fact it is likely years before I will consider Seagate again, if ever.
WD are in comparison overpriced and of lower performance (I do love the
Velociraptors for certain situations though).
Since WD also bought Hitachi I am currently pretty much stuck with
Toshiba drives. ^.^
That all said, I know where you're coming from and on principle I'd agree.

Also buying the "same" 3TB disk from different vendors for vastly differing
prices is also going to mean battle with the people paying for the
hardware. ^o^

> In this case you can also rule out batch issues with disks, but the
> likelihood of the same disks failing becomes smaller as well.
>
> Also, make sure that you define your crushmap that replicas never and up
> on the same physical host and if possible not in the same cabinet/rack.
>
One would hope that to be a given, provided the correct input was made.
People seem to be obsessed by rack failures here, in my case everything
(switches, PDUs, dual PSUs per server) is redundant per rack, so no SPOF,
no particular likelihood for a rack to fail in its entirety.

> I would never run with 60 drives in a single machine in a Ceph cluster,
> I'd suggest you use more machines with less disks per machine.
>
This was given as an example to show how quickly and in how little space
you can reach an amount of disks that pushes up failure probabilities to
near certainty.
And I would do deploy such a cluster if I had the need, simply use n+1 or
n+2 machines. So be on the safe side, instead of 10 servers deploy 12, six
per rack, set your full ratio to 80%.
And of course reduce the overhead and failure likelihood further by doing
local RAIDs, for example 4x 14HD RAID6 and 4 global hotspares per node.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:33:22 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 12:42:15 +0100 Wido den Hollander wrote:

> On 12/19/2013 09:39 AM, Christian Balzer wrote:
[snip]
> >
>
> I'd suggest to use different vendors for the disks, so that means you'll
> probably be mixing Seagate and Western Digital in such a setup.
>
That's funny, because I wouldn't use either of these vendors these days,
in fact it is likely years before I will consider Seagate again, if ever.
WD are in comparison overpriced and of lower performance (I do love the
Velociraptors for certain situations though).
Since WD also bought Hitachi I am currently pretty much stuck with
Toshiba drives. ^.^
That all said, I know where you're coming from and on principle I'd agree.

Also buying the "same" 3TB disk from different vendors for vastly differing
prices is also going to mean battle with the people paying for the
hardware. ^o^

> In this case you can also rule out batch issues with disks, but the
> likelihood of the same disks failing becomes smaller as well.
>
> Also, make sure that you define your crushmap that replicas never and up
> on the same physical host and if possible not in the same cabinet/rack.
>
One would hope that to be a given, provided the correct input was made.
People seem to be obsessed by rack failures here, in my case everything
(switches, PDUs, dual PSUs per server) is redundant per rack, so no SPOF,
no particular likelihood for a rack to fail in its entirety.

> I would never run with 60 drives in a single machine in a Ceph cluster,
> I'd suggest you use more machines with less disks per machine.
>
This was given as an example to show how quickly and in how little space
you can reach an amount of disks that pushes up failure probabilities to
near certainty.
And I would do deploy such a cluster if I had the need, simply use n+1 or
n+2 machines. So be on the safe side, instead of 10 servers deploy 12, six
per rack, set your full ratio to 80%.
And of course reduce the overhead and failure likelihood further by doing
local RAIDs, for example 4x 14HD RAID6 and 4 global hotspares per node.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Gregory Farnum
2013-12-19 15:20:16 UTC
Permalink
On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

I don't know what assumptions that probability calculator is making
(and I think they're overly aggressive about the 3x replication, at
least if you're seeing 1 in 21.6 that doesn't match previous numbers
I've seen), but yes: as you get larger and larger numbers of disks,
your probabilities of failure go way up. This is a thing that people
with large systems deal with. The tradeoffs that Ceph makes, you get
about the same mean-time-to-failure as a collection of RAID systems of
equivalent size (recovery times are much shorter, but more disks are
involved whose failure can cause data loss), but you lose much less
data in any given incident.
As Wolfgang mentioned, erasure coded pools will handle this better
because they can provide much larger failure counts in a reasonable
disk overhead.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Gruher, Joseph R
2013-12-19 15:43:16 UTC
Permalink
>-----Original Message-----
>From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-
>bounces at lists.ceph.com] On Behalf Of Gregory Farnum
>Sent: Thursday, December 19, 2013 7:20 AM
>To: Christian Balzer
>Cc: ceph-users at lists.ceph.com
>Subject: Re: [ceph-users] Failure probability with largish deployments
>
>On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>>
>> Hello,
>>
>> In my "Sanity check" thread I postulated yesterday that to get the
>> same redundancy and resilience for disk failures (excluding other
>> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
>> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
>> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
>> protect against dual disk failures) to get the similar capacity and a
>> 7th identical node to allow for node failure/maintenance.
>>
>> That was basically based on me thinking "must not get caught be a dual
>> disk failure ever again", as that happened twice to me, once with a
>> RAID5 and the expected consequences, once with a RAID10 where I got
>> lucky (8 disks total each time).
>>
>> However something was nagging me at the back of my brain and turned
>> out to be my long forgotten statistics classes in school. ^o^
>>
>> So I after reading some articles basically telling the same things I
>> found
>> this: https://www.memset.com/tools/raid-calculator/
>>
>> Now this is based on assumptions, onto which I will add some more, but
>> the last sentence on that page still is quite valid.
>>
>> So lets compare these 2 configurations above, I assumed 75GB/s
>> recovery speed for the RAID6 configuration something I've seen in practice.
>> Basically that's half speed, something that will be lower during busy
>> hours and higher during off peak hours. I made the same assumption for
>> Ceph with a 10Gb/s network, assuming 500GB/s recovery/rebalancing
>speeds.
>> The rebalancing would have to compete with other replication traffic
>> (likely not much of an issue) and the actual speed/load of the
>> individual drives involved. Note that if we assume a totally quiet
>> setup, were 100% of all resources would be available for recovery the
>> numbers would of course change, but NOT their ratios.
>> I went with the default disk lifetime of 3 years and 0 day replacement
>> time. The latter of course gives very unrealistic results for anything
>> w/o hotspare drive, but we're comparing 2 different beasts here.
>>
>> So that all said, the results of that page that make sense in this
>> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a
>> 3rd drive failure in the time before recovery is complete, the
>> replacement setting of 0 giving us the best possible number and since
>> one would deploy a Ceph cluster with sufficient extra capacity that's what
>we shall use.
>>
>> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
>> 1 in 58497.9 ratio of data loss per year.
>> Alas for the 70 HDs in the comparable Ceph configuration we wind up
>> with just a 1 in 13094.31 ratio, which while still quite acceptable
>> clearly shows where this is going.
>>
>> So am I completely off my wagon here?
>> How do people deal with this when potentially deploying hundreds of
>> disks in a single cluster/pool?
>>
>> I mean, when we get too 600 disks (and that's just one rack full, OK,
>> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
>> servers (or 72 disk per 4U if you're happy with killing another drive
>> when replacing a faulty one in that Supermicro contraption), that
>> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
>mentioned up there.
>
>I don't know what assumptions that probability calculator is making (and I
>think they're overly aggressive about the 3x replication, at least if you're
>seeing 1 in 21.6 that doesn't match previous numbers I've seen), but yes: as
>you get larger and larger numbers of disks, your probabilities of failure go way
>up. This is a thing that people with large systems deal with. The tradeoffs that
>Ceph makes, you get about the same mean-time-to-failure as a collection of
>RAID systems of equivalent size (recovery times are much shorter, but more
>disks are involved whose failure can cause data loss), but you lose much less
>data in any given incident.
>As Wolfgang mentioned, erasure coded pools will handle this better because
>they can provide much larger failure counts in a reasonable disk overhead.
>-Greg

It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)? If your data is triple replicated and a copy of a given piece of data exists in three disks separate disks in the cluster, and you have three disks fail, the odds of it being the only three disks with copies of that data should be pretty low for a very large number of disks. For the 600 disk cluster, after the first disk fails you'd have a 2 in 599 chance of losing the second copy when the second disk fails, then a 1 in 598 chance of losing the third copy when the third disk fails, so even assuming a triple disk failure has already happened don't you still have something like a 99.94% chance that you didn't lose all copies of your data? And then if there's only a 1 in 21.6 chance of having a triple disk failure outpace recovery in the first place that gets you to something like 99.997% reliability?
Wolfgang Hennerbichler
2013-12-19 19:39:07 UTC
Permalink
On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:

> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?

not true with RBD images, which are potentially striped across the whole cluster.
Wido den Hollander
2013-12-19 20:01:47 UTC
Permalink
On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

You could always increase the order so that you end up with 64MB objects
instead of 4MB objects.

This would be order 26 instead of 22. You would have less objects, but
you will loose more data if you loose one PG holding that object.

> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Christian Balzer
2013-12-20 03:50:54 UTC
Permalink
On Thu, 19 Dec 2013 21:01:47 +0100 Wido den Hollander wrote:

> On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> > On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com>
> > wrote:
> >
> >> It seems like this calculation ignores that in a large Ceph cluster
> >> with triple replication having three drive failures doesn't
> >> automatically guarantee data loss (unlike a RAID6 array)?
> >
> > not true with RBD images, which are potentially striped across the
> > whole cluster.
>
> You could always increase the order so that you end up with 64MB objects
> instead of 4MB objects.
>
It is my understanding that this would likely significantly impact
performance of those RBD images...

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:50:54 UTC
Permalink
On Thu, 19 Dec 2013 21:01:47 +0100 Wido den Hollander wrote:

> On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> > On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com>
> > wrote:
> >
> >> It seems like this calculation ignores that in a large Ceph cluster
> >> with triple replication having three drive failures doesn't
> >> automatically guarantee data loss (unlike a RAID6 array)?
> >
> > not true with RBD images, which are potentially striped across the
> > whole cluster.
>
> You could always increase the order so that you end up with 64MB objects
> instead of 4MB objects.
>
It is my understanding that this would likely significantly impact
performance of those RBD images...

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:50:54 UTC
Permalink
On Thu, 19 Dec 2013 21:01:47 +0100 Wido den Hollander wrote:

> On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> > On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com>
> > wrote:
> >
> >> It seems like this calculation ignores that in a large Ceph cluster
> >> with triple replication having three drive failures doesn't
> >> automatically guarantee data loss (unlike a RAID6 array)?
> >
> > not true with RBD images, which are potentially striped across the
> > whole cluster.
>
> You could always increase the order so that you end up with 64MB objects
> instead of 4MB objects.
>
It is my understanding that this would likely significantly impact
performance of those RBD images...

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-20 03:50:54 UTC
Permalink
On Thu, 19 Dec 2013 21:01:47 +0100 Wido den Hollander wrote:

> On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> > On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com>
> > wrote:
> >
> >> It seems like this calculation ignores that in a large Ceph cluster
> >> with triple replication having three drive failures doesn't
> >> automatically guarantee data loss (unlike a RAID6 array)?
> >
> > not true with RBD images, which are potentially striped across the
> > whole cluster.
>
> You could always increase the order so that you end up with 64MB objects
> instead of 4MB objects.
>
It is my understanding that this would likely significantly impact
performance of those RBD images...

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Johannes Formann
2013-12-19 21:42:37 UTC
Permalink
Am 19.12.2013 um 20:39 schrieb Wolfgang Hennerbichler <wogri at wogri.com>:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

Even in this case it only results in data loss if all three existing copies of the data are affected. In a large cluster this is very unlikely with only three disks.
Three succeeding disk failures are getting ?common? if the cluster size is large enough, but it will usually only affect disks serving disjoint set of PGs, so for each disk there are still enough copies.

regards

Johannes
Wido den Hollander
2013-12-19 20:01:47 UTC
Permalink
On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

You could always increase the order so that you end up with 64MB objects
instead of 4MB objects.

This would be order 26 instead of 22. You would have less objects, but
you will loose more data if you loose one PG holding that object.

> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Johannes Formann
2013-12-19 21:42:37 UTC
Permalink
Am 19.12.2013 um 20:39 schrieb Wolfgang Hennerbichler <wogri at wogri.com>:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

Even in this case it only results in data loss if all three existing copies of the data are affected. In a large cluster this is very unlikely with only three disks.
Three succeeding disk failures are getting ?common? if the cluster size is large enough, but it will usually only affect disks serving disjoint set of PGs, so for each disk there are still enough copies.

regards

Johannes
Wido den Hollander
2013-12-19 20:01:47 UTC
Permalink
On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

You could always increase the order so that you end up with 64MB objects
instead of 4MB objects.

This would be order 26 instead of 22. You would have less objects, but
you will loose more data if you loose one PG holding that object.

> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Johannes Formann
2013-12-19 21:42:37 UTC
Permalink
Am 19.12.2013 um 20:39 schrieb Wolfgang Hennerbichler <wogri at wogri.com>:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

Even in this case it only results in data loss if all three existing copies of the data are affected. In a large cluster this is very unlikely with only three disks.
Three succeeding disk failures are getting ?common? if the cluster size is large enough, but it will usually only affect disks serving disjoint set of PGs, so for each disk there are still enough copies.

regards

Johannes
Wido den Hollander
2013-12-19 20:01:47 UTC
Permalink
On 12/19/2013 08:39 PM, Wolfgang Hennerbichler wrote:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

You could always increase the order so that you end up with 64MB objects
instead of 4MB objects.

This would be order 26 instead of 22. You would have less objects, but
you will loose more data if you loose one PG holding that object.

> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Johannes Formann
2013-12-19 21:42:37 UTC
Permalink
Am 19.12.2013 um 20:39 schrieb Wolfgang Hennerbichler <wogri at wogri.com>:
> On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:
>
>> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?
>
> not true with RBD images, which are potentially striped across the whole cluster.

Even in this case it only results in data loss if all three existing copies of the data are affected. In a large cluster this is very unlikely with only three disks.
Three succeeding disk failures are getting ?common? if the cluster size is large enough, but it will usually only affect disks serving disjoint set of PGs, so for each disk there are still enough copies.

regards

Johannes
Christian Balzer
2013-12-20 03:48:38 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 15:43:16 +0000 Gruher, Joseph R wrote:

[snip]
>
> It seems like this calculation ignores that in a large Ceph cluster with
> triple replication having three drive failures doesn't automatically
> guarantee data loss (unlike a RAID6 array)? If your data is triple
> replicated and a copy of a given piece of data exists in three disks
> separate disks in the cluster, and you have three disks fail, the odds
> of it being the only three disks with copies of that data should be
> pretty low for a very large number of disks. For the 600 disk cluster,
> after the first disk fails you'd have a 2 in 599 chance of losing the
> second copy when the second disk fails, then a 1 in 598 chance of losing
> the third copy when the third disk fails, so even assuming a triple disk
> failure has already happened don't you still have something like a
> 99.94% chance that you didn't lose all copies of your data? And then if
> there's only a 1 in 21.6 chance of having a triple disk failure outpace
> recovery in the first place that gets you to something like 99.997%
> reliability?
>
I think putting that number into perspective with a real event unfolding
just now in a data center that's not local and where no monkeys are
available might help.
24disk server, RAID6, one hotspare. 4 years old now, crappy Seagates
failing, already replaced 6.
On drive failed 2 days ago, yesterday nobody was available to go there and
swap a fresh one in, last night the next drive failed and now somebody is
dashing there with 2 spares. ^o^
More often than not the additional strain of recovery will push disks over
the edge, aside from increasing likelihood for clustered failures with
certain drives or when reaching certain ages.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 19:39:07 UTC
Permalink
On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:

> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?

not true with RBD images, which are potentially striped across the whole cluster.
Christian Balzer
2013-12-20 03:48:38 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 15:43:16 +0000 Gruher, Joseph R wrote:

[snip]
>
> It seems like this calculation ignores that in a large Ceph cluster with
> triple replication having three drive failures doesn't automatically
> guarantee data loss (unlike a RAID6 array)? If your data is triple
> replicated and a copy of a given piece of data exists in three disks
> separate disks in the cluster, and you have three disks fail, the odds
> of it being the only three disks with copies of that data should be
> pretty low for a very large number of disks. For the 600 disk cluster,
> after the first disk fails you'd have a 2 in 599 chance of losing the
> second copy when the second disk fails, then a 1 in 598 chance of losing
> the third copy when the third disk fails, so even assuming a triple disk
> failure has already happened don't you still have something like a
> 99.94% chance that you didn't lose all copies of your data? And then if
> there's only a 1 in 21.6 chance of having a triple disk failure outpace
> recovery in the first place that gets you to something like 99.997%
> reliability?
>
I think putting that number into perspective with a real event unfolding
just now in a data center that's not local and where no monkeys are
available might help.
24disk server, RAID6, one hotspare. 4 years old now, crappy Seagates
failing, already replaced 6.
On drive failed 2 days ago, yesterday nobody was available to go there and
swap a fresh one in, last night the next drive failed and now somebody is
dashing there with 2 spares. ^o^
More often than not the additional strain of recovery will push disks over
the edge, aside from increasing likelihood for clustered failures with
certain drives or when reaching certain ages.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 19:39:07 UTC
Permalink
On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:

> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?

not true with RBD images, which are potentially striped across the whole cluster.
Christian Balzer
2013-12-20 03:48:38 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 15:43:16 +0000 Gruher, Joseph R wrote:

[snip]
>
> It seems like this calculation ignores that in a large Ceph cluster with
> triple replication having three drive failures doesn't automatically
> guarantee data loss (unlike a RAID6 array)? If your data is triple
> replicated and a copy of a given piece of data exists in three disks
> separate disks in the cluster, and you have three disks fail, the odds
> of it being the only three disks with copies of that data should be
> pretty low for a very large number of disks. For the 600 disk cluster,
> after the first disk fails you'd have a 2 in 599 chance of losing the
> second copy when the second disk fails, then a 1 in 598 chance of losing
> the third copy when the third disk fails, so even assuming a triple disk
> failure has already happened don't you still have something like a
> 99.94% chance that you didn't lose all copies of your data? And then if
> there's only a 1 in 21.6 chance of having a triple disk failure outpace
> recovery in the first place that gets you to something like 99.997%
> reliability?
>
I think putting that number into perspective with a real event unfolding
just now in a data center that's not local and where no monkeys are
available might help.
24disk server, RAID6, one hotspare. 4 years old now, crappy Seagates
failing, already replaced 6.
On drive failed 2 days ago, yesterday nobody was available to go there and
swap a fresh one in, last night the next drive failed and now somebody is
dashing there with 2 spares. ^o^
More often than not the additional strain of recovery will push disks over
the edge, aside from increasing likelihood for clustered failures with
certain drives or when reaching certain ages.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 19:39:07 UTC
Permalink
On 19 Dec 2013, at 16:43, Gruher, Joseph R <joseph.r.gruher at intel.com> wrote:

> It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?

not true with RBD images, which are potentially striped across the whole cluster.
Christian Balzer
2013-12-20 03:48:38 UTC
Permalink
Hello,

On Thu, 19 Dec 2013 15:43:16 +0000 Gruher, Joseph R wrote:

[snip]
>
> It seems like this calculation ignores that in a large Ceph cluster with
> triple replication having three drive failures doesn't automatically
> guarantee data loss (unlike a RAID6 array)? If your data is triple
> replicated and a copy of a given piece of data exists in three disks
> separate disks in the cluster, and you have three disks fail, the odds
> of it being the only three disks with copies of that data should be
> pretty low for a very large number of disks. For the 600 disk cluster,
> after the first disk fails you'd have a 2 in 599 chance of losing the
> second copy when the second disk fails, then a 1 in 598 chance of losing
> the third copy when the third disk fails, so even assuming a triple disk
> failure has already happened don't you still have something like a
> 99.94% chance that you didn't lose all copies of your data? And then if
> there's only a 1 in 21.6 chance of having a triple disk failure outpace
> recovery in the first place that gets you to something like 99.997%
> reliability?
>
I think putting that number into perspective with a real event unfolding
just now in a data center that's not local and where no monkeys are
available might help.
24disk server, RAID6, one hotspare. 4 years old now, crappy Seagates
failing, already replaced 6.
On drive failed 2 days ago, yesterday nobody was available to go there and
swap a fresh one in, last night the next drive failed and now somebody is
dashing there with 2 spares. ^o^
More often than not the additional strain of recovery will push disks over
the edge, aside from increasing likelihood for clustered failures with
certain drives or when reaching certain ages.

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Gruher, Joseph R
2013-12-19 15:43:16 UTC
Permalink
>-----Original Message-----
>From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-
>bounces at lists.ceph.com] On Behalf Of Gregory Farnum
>Sent: Thursday, December 19, 2013 7:20 AM
>To: Christian Balzer
>Cc: ceph-users at lists.ceph.com
>Subject: Re: [ceph-users] Failure probability with largish deployments
>
>On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>>
>> Hello,
>>
>> In my "Sanity check" thread I postulated yesterday that to get the
>> same redundancy and resilience for disk failures (excluding other
>> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
>> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
>> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
>> protect against dual disk failures) to get the similar capacity and a
>> 7th identical node to allow for node failure/maintenance.
>>
>> That was basically based on me thinking "must not get caught be a dual
>> disk failure ever again", as that happened twice to me, once with a
>> RAID5 and the expected consequences, once with a RAID10 where I got
>> lucky (8 disks total each time).
>>
>> However something was nagging me at the back of my brain and turned
>> out to be my long forgotten statistics classes in school. ^o^
>>
>> So I after reading some articles basically telling the same things I
>> found
>> this: https://www.memset.com/tools/raid-calculator/
>>
>> Now this is based on assumptions, onto which I will add some more, but
>> the last sentence on that page still is quite valid.
>>
>> So lets compare these 2 configurations above, I assumed 75GB/s
>> recovery speed for the RAID6 configuration something I've seen in practice.
>> Basically that's half speed, something that will be lower during busy
>> hours and higher during off peak hours. I made the same assumption for
>> Ceph with a 10Gb/s network, assuming 500GB/s recovery/rebalancing
>speeds.
>> The rebalancing would have to compete with other replication traffic
>> (likely not much of an issue) and the actual speed/load of the
>> individual drives involved. Note that if we assume a totally quiet
>> setup, were 100% of all resources would be available for recovery the
>> numbers would of course change, but NOT their ratios.
>> I went with the default disk lifetime of 3 years and 0 day replacement
>> time. The latter of course gives very unrealistic results for anything
>> w/o hotspare drive, but we're comparing 2 different beasts here.
>>
>> So that all said, the results of that page that make sense in this
>> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a
>> 3rd drive failure in the time before recovery is complete, the
>> replacement setting of 0 giving us the best possible number and since
>> one would deploy a Ceph cluster with sufficient extra capacity that's what
>we shall use.
>>
>> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
>> 1 in 58497.9 ratio of data loss per year.
>> Alas for the 70 HDs in the comparable Ceph configuration we wind up
>> with just a 1 in 13094.31 ratio, which while still quite acceptable
>> clearly shows where this is going.
>>
>> So am I completely off my wagon here?
>> How do people deal with this when potentially deploying hundreds of
>> disks in a single cluster/pool?
>>
>> I mean, when we get too 600 disks (and that's just one rack full, OK,
>> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
>> servers (or 72 disk per 4U if you're happy with killing another drive
>> when replacing a faulty one in that Supermicro contraption), that
>> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
>mentioned up there.
>
>I don't know what assumptions that probability calculator is making (and I
>think they're overly aggressive about the 3x replication, at least if you're
>seeing 1 in 21.6 that doesn't match previous numbers I've seen), but yes: as
>you get larger and larger numbers of disks, your probabilities of failure go way
>up. This is a thing that people with large systems deal with. The tradeoffs that
>Ceph makes, you get about the same mean-time-to-failure as a collection of
>RAID systems of equivalent size (recovery times are much shorter, but more
>disks are involved whose failure can cause data loss), but you lose much less
>data in any given incident.
>As Wolfgang mentioned, erasure coded pools will handle this better because
>they can provide much larger failure counts in a reasonable disk overhead.
>-Greg

It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)? If your data is triple replicated and a copy of a given piece of data exists in three disks separate disks in the cluster, and you have three disks fail, the odds of it being the only three disks with copies of that data should be pretty low for a very large number of disks. For the 600 disk cluster, after the first disk fails you'd have a 2 in 599 chance of losing the second copy when the second disk fails, then a 1 in 598 chance of losing the third copy when the third disk fails, so even assuming a triple disk failure has already happened don't you still have something like a 99.94% chance that you didn't lose all copies of your data? And then if there's only a 1 in 21.6 chance of having a triple disk failure outpace recovery in the first place that gets you to something like 99.997% reliability?
Gruher, Joseph R
2013-12-19 15:43:16 UTC
Permalink
>-----Original Message-----
>From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-
>bounces at lists.ceph.com] On Behalf Of Gregory Farnum
>Sent: Thursday, December 19, 2013 7:20 AM
>To: Christian Balzer
>Cc: ceph-users at lists.ceph.com
>Subject: Re: [ceph-users] Failure probability with largish deployments
>
>On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>>
>> Hello,
>>
>> In my "Sanity check" thread I postulated yesterday that to get the
>> same redundancy and resilience for disk failures (excluding other
>> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
>> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
>> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
>> protect against dual disk failures) to get the similar capacity and a
>> 7th identical node to allow for node failure/maintenance.
>>
>> That was basically based on me thinking "must not get caught be a dual
>> disk failure ever again", as that happened twice to me, once with a
>> RAID5 and the expected consequences, once with a RAID10 where I got
>> lucky (8 disks total each time).
>>
>> However something was nagging me at the back of my brain and turned
>> out to be my long forgotten statistics classes in school. ^o^
>>
>> So I after reading some articles basically telling the same things I
>> found
>> this: https://www.memset.com/tools/raid-calculator/
>>
>> Now this is based on assumptions, onto which I will add some more, but
>> the last sentence on that page still is quite valid.
>>
>> So lets compare these 2 configurations above, I assumed 75GB/s
>> recovery speed for the RAID6 configuration something I've seen in practice.
>> Basically that's half speed, something that will be lower during busy
>> hours and higher during off peak hours. I made the same assumption for
>> Ceph with a 10Gb/s network, assuming 500GB/s recovery/rebalancing
>speeds.
>> The rebalancing would have to compete with other replication traffic
>> (likely not much of an issue) and the actual speed/load of the
>> individual drives involved. Note that if we assume a totally quiet
>> setup, were 100% of all resources would be available for recovery the
>> numbers would of course change, but NOT their ratios.
>> I went with the default disk lifetime of 3 years and 0 day replacement
>> time. The latter of course gives very unrealistic results for anything
>> w/o hotspare drive, but we're comparing 2 different beasts here.
>>
>> So that all said, the results of that page that make sense in this
>> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a
>> 3rd drive failure in the time before recovery is complete, the
>> replacement setting of 0 giving us the best possible number and since
>> one would deploy a Ceph cluster with sufficient extra capacity that's what
>we shall use.
>>
>> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
>> 1 in 58497.9 ratio of data loss per year.
>> Alas for the 70 HDs in the comparable Ceph configuration we wind up
>> with just a 1 in 13094.31 ratio, which while still quite acceptable
>> clearly shows where this is going.
>>
>> So am I completely off my wagon here?
>> How do people deal with this when potentially deploying hundreds of
>> disks in a single cluster/pool?
>>
>> I mean, when we get too 600 disks (and that's just one rack full, OK,
>> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
>> servers (or 72 disk per 4U if you're happy with killing another drive
>> when replacing a faulty one in that Supermicro contraption), that
>> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
>mentioned up there.
>
>I don't know what assumptions that probability calculator is making (and I
>think they're overly aggressive about the 3x replication, at least if you're
>seeing 1 in 21.6 that doesn't match previous numbers I've seen), but yes: as
>you get larger and larger numbers of disks, your probabilities of failure go way
>up. This is a thing that people with large systems deal with. The tradeoffs that
>Ceph makes, you get about the same mean-time-to-failure as a collection of
>RAID systems of equivalent size (recovery times are much shorter, but more
>disks are involved whose failure can cause data loss), but you lose much less
>data in any given incident.
>As Wolfgang mentioned, erasure coded pools will handle this better because
>they can provide much larger failure counts in a reasonable disk overhead.
>-Greg

It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)? If your data is triple replicated and a copy of a given piece of data exists in three disks separate disks in the cluster, and you have three disks fail, the odds of it being the only three disks with copies of that data should be pretty low for a very large number of disks. For the 600 disk cluster, after the first disk fails you'd have a 2 in 599 chance of losing the second copy when the second disk fails, then a 1 in 598 chance of losing the third copy when the third disk fails, so even assuming a triple disk failure has already happened don't you still have something like a 99.94% chance that you didn't lose all copies of your data? And then if there's only a 1 in 21.6 chance of having a triple disk failure outpace recovery in the first place that gets you to something like 99.997% reliability?
Gruher, Joseph R
2013-12-19 15:43:16 UTC
Permalink
>-----Original Message-----
>From: ceph-users-bounces at lists.ceph.com [mailto:ceph-users-
>bounces at lists.ceph.com] On Behalf Of Gregory Farnum
>Sent: Thursday, December 19, 2013 7:20 AM
>To: Christian Balzer
>Cc: ceph-users at lists.ceph.com
>Subject: Re: [ceph-users] Failure probability with largish deployments
>
>On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>>
>> Hello,
>>
>> In my "Sanity check" thread I postulated yesterday that to get the
>> same redundancy and resilience for disk failures (excluding other
>> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
>> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
>> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
>> protect against dual disk failures) to get the similar capacity and a
>> 7th identical node to allow for node failure/maintenance.
>>
>> That was basically based on me thinking "must not get caught be a dual
>> disk failure ever again", as that happened twice to me, once with a
>> RAID5 and the expected consequences, once with a RAID10 where I got
>> lucky (8 disks total each time).
>>
>> However something was nagging me at the back of my brain and turned
>> out to be my long forgotten statistics classes in school. ^o^
>>
>> So I after reading some articles basically telling the same things I
>> found
>> this: https://www.memset.com/tools/raid-calculator/
>>
>> Now this is based on assumptions, onto which I will add some more, but
>> the last sentence on that page still is quite valid.
>>
>> So lets compare these 2 configurations above, I assumed 75GB/s
>> recovery speed for the RAID6 configuration something I've seen in practice.
>> Basically that's half speed, something that will be lower during busy
>> hours and higher during off peak hours. I made the same assumption for
>> Ceph with a 10Gb/s network, assuming 500GB/s recovery/rebalancing
>speeds.
>> The rebalancing would have to compete with other replication traffic
>> (likely not much of an issue) and the actual speed/load of the
>> individual drives involved. Note that if we assume a totally quiet
>> setup, were 100% of all resources would be available for recovery the
>> numbers would of course change, but NOT their ratios.
>> I went with the default disk lifetime of 3 years and 0 day replacement
>> time. The latter of course gives very unrealistic results for anything
>> w/o hotspare drive, but we're comparing 2 different beasts here.
>>
>> So that all said, the results of that page that make sense in this
>> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a
>> 3rd drive failure in the time before recovery is complete, the
>> replacement setting of 0 giving us the best possible number and since
>> one would deploy a Ceph cluster with sufficient extra capacity that's what
>we shall use.
>>
>> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
>> 1 in 58497.9 ratio of data loss per year.
>> Alas for the 70 HDs in the comparable Ceph configuration we wind up
>> with just a 1 in 13094.31 ratio, which while still quite acceptable
>> clearly shows where this is going.
>>
>> So am I completely off my wagon here?
>> How do people deal with this when potentially deploying hundreds of
>> disks in a single cluster/pool?
>>
>> I mean, when we get too 600 disks (and that's just one rack full, OK,
>> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
>> servers (or 72 disk per 4U if you're happy with killing another drive
>> when replacing a faulty one in that Supermicro contraption), that
>> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
>mentioned up there.
>
>I don't know what assumptions that probability calculator is making (and I
>think they're overly aggressive about the 3x replication, at least if you're
>seeing 1 in 21.6 that doesn't match previous numbers I've seen), but yes: as
>you get larger and larger numbers of disks, your probabilities of failure go way
>up. This is a thing that people with large systems deal with. The tradeoffs that
>Ceph makes, you get about the same mean-time-to-failure as a collection of
>RAID systems of equivalent size (recovery times are much shorter, but more
>disks are involved whose failure can cause data loss), but you lose much less
>data in any given incident.
>As Wolfgang mentioned, erasure coded pools will handle this better because
>they can provide much larger failure counts in a reasonable disk overhead.
>-Greg

It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)? If your data is triple replicated and a copy of a given piece of data exists in three disks separate disks in the cluster, and you have three disks fail, the odds of it being the only three disks with copies of that data should be pretty low for a very large number of disks. For the 600 disk cluster, after the first disk fails you'd have a 2 in 599 chance of losing the second copy when the second disk fails, then a 1 in 598 chance of losing the third copy when the third disk fails, so even assuming a triple disk failure has already happened don't you still have something like a 99.94% chance that you didn't lose all copies of your data? And then if there's only a 1 in 21.6 chance of having a triple disk failure outpace recovery in the first place that gets you to something like 99.997% reliability?
Kyle Bader
2013-12-20 21:37:18 UTC
Permalink
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RAID parameters
replace: 6 hours
recovery rate: 500MiB/s (100 minutes)
NRE model: fail
object size: 4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
0.000011% 0.000e+00 9.317e+07


Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate: 50MiB/s (40 seconds/drive)
osd fullness: 75%
declustering: 1100 PG/OSD
NRE model: fail
object size: 4MB
stripe length: 1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
0.000116% 0.000e+00 6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

--

Kyle
Christian Balzer
2013-12-22 14:03:15 UTC
Permalink
Hello Kyle,

On Fri, 20 Dec 2013 13:37:18 -0800 Kyle Bader wrote:

> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
I shall have to (literally, as in GIT) check that out next week...

However before that, some questions to help me understand what we are
measuring here and how.

For starters, I do really have a hard time figuring out what an "object" in
Ceph terminology is and I read the Architecture section of the
documentation page at list twice, along with many other resources.

Is an object a CephFS file or a RBD image or is it the 4MB blob on the
actual OSD FS?

In my case, I'm only looking at RBD images for KVM volume storage, even
given the default striping configuration I would assume that those 12500
OSD objects for a 50GB image would not be in the same PG and thus just on
3 (with 3 replicas set) OSDs total?

More questions inline below:
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>

What amount of disks (OSDs) did you punch in for the following run?
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
Blink???
I guess that goes back to the number of disks, but to restore 2.25GB at
50MB/s with 40 seconds per drive...

> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
I take it that is to mean that any RBD volume of sufficient size is indeed
spread over all disks?

>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Kyle Bader
2013-12-22 15:44:31 UTC
Permalink
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>> size: 3TiB
>> FIT rate: 826 (MTBF = 138.1 years)
>> NRE rate: 1.0E-16
>> RADOS parameters
>> auto mark-out: 10 minutes
>> recovery rate: 50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>> osd fullness: 75%
>> declustering: 1100 PG/OSD
>> NRE model: fail
>> object size: 4MB
>> stripe length: 1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

--

Kyle
Christian Balzer
2013-12-22 16:46:11 UTC
Permalink
Hello,

On Sun, 22 Dec 2013 07:44:31 -0800 Kyle Bader wrote:

> > Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> > actual OSD FS?
>
> Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
> objects are all composed by striping RADOS objects - default is 4MB.
>
Good, that clears that up and confirms how I figured it worked.

> > In my case, I'm only looking at RBD images for KVM volume storage, even
> > given the default striping configuration I would assume that those
> > 12500 OSD objects for a 50GB image would not be in the same PG and
> > thus just on 3 (with 3 replicas set) OSDs total?
>
> Objects are striped across placement groups, so you take your RBD size
> / 4 MB and cap it at the total number of placement groups in your
> cluster.
>

Yes, that also makes perfect sense, so the aforementioned 12500 objects
for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
replication that makes 2400 PGs, following the recommended formula.


> > What amount of disks (OSDs) did you punch in for the following run?
> >> Disk Modeling Parameters
> >> size: 3TiB
> >> FIT rate: 826 (MTBF = 138.1 years)
> >> NRE rate: 1.0E-16
> >> RADOS parameters
> >> auto mark-out: 10 minutes
> >> recovery rate: 50MiB/s (40 seconds/drive)
> > Blink???
> > I guess that goes back to the number of disks, but to restore 2.25GB at
> > 50MB/s with 40 seconds per drive...
>
> The surviving replicas for placement groups that the failed OSDs
> participated will naturally be distributed across many OSDs in the
> cluster, when the failed OSD is marked out, it's replicas will be
> remapped to many OSDs. It's not a 1:1 replacement like you might find
> in a RAID array.
>
I completely get that part, however the total amount of data to be
rebalanced after a single disk/OSD failure to fully restore redundancy is
still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
assumed.
What I'm still missing in this pictures is how many disks (OSDs) you
calculated this with. Maybe I'm just misreading the 40 seconds per drive
bit there. Because if that means each drive is only required to be just
active for 40 seconds to do it's bit of recovery, we're talking 1100
drives. ^o^ 1100 PGs would be another story.

> >> osd fullness: 75%
> >> declustering: 1100 PG/OSD
> >> NRE model: fail
> >> object size: 4MB
> >> stripe length: 1100
> > I take it that is to mean that any RBD volume of sufficient size is
> > indeed spread over all disks?
>
> Spread over all placement groups, the difference is subtle but there
> is a difference.
>
Right, it isn't exactly a 1:1 match from what I saw/read.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Kyle Bader
2013-12-26 22:17:06 UTC
Permalink
> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >> size: 3TiB
>> >> FIT rate: 826 (MTBF = 138.1 years)
>> >> NRE rate: 1.0E-16
>> >> RADOS parameters
>> >> auto mark-out: 10 minutes
>> >> recovery rate: 50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

--

Kyle
Kyle Bader
2013-12-26 22:17:06 UTC
Permalink
> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >> size: 3TiB
>> >> FIT rate: 826 (MTBF = 138.1 years)
>> >> NRE rate: 1.0E-16
>> >> RADOS parameters
>> >> auto mark-out: 10 minutes
>> >> recovery rate: 50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

--

Kyle
Kyle Bader
2013-12-26 22:17:06 UTC
Permalink
> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >> size: 3TiB
>> >> FIT rate: 826 (MTBF = 138.1 years)
>> >> NRE rate: 1.0E-16
>> >> RADOS parameters
>> >> auto mark-out: 10 minutes
>> >> recovery rate: 50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

--

Kyle
Kyle Bader
2013-12-26 22:17:06 UTC
Permalink
> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >> size: 3TiB
>> >> FIT rate: 826 (MTBF = 138.1 years)
>> >> NRE rate: 1.0E-16
>> >> RADOS parameters
>> >> auto mark-out: 10 minutes
>> >> recovery rate: 50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

--

Kyle
Christian Balzer
2013-12-22 16:46:11 UTC
Permalink
Hello,

On Sun, 22 Dec 2013 07:44:31 -0800 Kyle Bader wrote:

> > Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> > actual OSD FS?
>
> Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
> objects are all composed by striping RADOS objects - default is 4MB.
>
Good, that clears that up and confirms how I figured it worked.

> > In my case, I'm only looking at RBD images for KVM volume storage, even
> > given the default striping configuration I would assume that those
> > 12500 OSD objects for a 50GB image would not be in the same PG and
> > thus just on 3 (with 3 replicas set) OSDs total?
>
> Objects are striped across placement groups, so you take your RBD size
> / 4 MB and cap it at the total number of placement groups in your
> cluster.
>

Yes, that also makes perfect sense, so the aforementioned 12500 objects
for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
replication that makes 2400 PGs, following the recommended formula.


> > What amount of disks (OSDs) did you punch in for the following run?
> >> Disk Modeling Parameters
> >> size: 3TiB
> >> FIT rate: 826 (MTBF = 138.1 years)
> >> NRE rate: 1.0E-16
> >> RADOS parameters
> >> auto mark-out: 10 minutes
> >> recovery rate: 50MiB/s (40 seconds/drive)
> > Blink???
> > I guess that goes back to the number of disks, but to restore 2.25GB at
> > 50MB/s with 40 seconds per drive...
>
> The surviving replicas for placement groups that the failed OSDs
> participated will naturally be distributed across many OSDs in the
> cluster, when the failed OSD is marked out, it's replicas will be
> remapped to many OSDs. It's not a 1:1 replacement like you might find
> in a RAID array.
>
I completely get that part, however the total amount of data to be
rebalanced after a single disk/OSD failure to fully restore redundancy is
still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
assumed.
What I'm still missing in this pictures is how many disks (OSDs) you
calculated this with. Maybe I'm just misreading the 40 seconds per drive
bit there. Because if that means each drive is only required to be just
active for 40 seconds to do it's bit of recovery, we're talking 1100
drives. ^o^ 1100 PGs would be another story.

> >> osd fullness: 75%
> >> declustering: 1100 PG/OSD
> >> NRE model: fail
> >> object size: 4MB
> >> stripe length: 1100
> > I take it that is to mean that any RBD volume of sufficient size is
> > indeed spread over all disks?
>
> Spread over all placement groups, the difference is subtle but there
> is a difference.
>
Right, it isn't exactly a 1:1 match from what I saw/read.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-22 16:46:11 UTC
Permalink
Hello,

On Sun, 22 Dec 2013 07:44:31 -0800 Kyle Bader wrote:

> > Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> > actual OSD FS?
>
> Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
> objects are all composed by striping RADOS objects - default is 4MB.
>
Good, that clears that up and confirms how I figured it worked.

> > In my case, I'm only looking at RBD images for KVM volume storage, even
> > given the default striping configuration I would assume that those
> > 12500 OSD objects for a 50GB image would not be in the same PG and
> > thus just on 3 (with 3 replicas set) OSDs total?
>
> Objects are striped across placement groups, so you take your RBD size
> / 4 MB and cap it at the total number of placement groups in your
> cluster.
>

Yes, that also makes perfect sense, so the aforementioned 12500 objects
for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
replication that makes 2400 PGs, following the recommended formula.


> > What amount of disks (OSDs) did you punch in for the following run?
> >> Disk Modeling Parameters
> >> size: 3TiB
> >> FIT rate: 826 (MTBF = 138.1 years)
> >> NRE rate: 1.0E-16
> >> RADOS parameters
> >> auto mark-out: 10 minutes
> >> recovery rate: 50MiB/s (40 seconds/drive)
> > Blink???
> > I guess that goes back to the number of disks, but to restore 2.25GB at
> > 50MB/s with 40 seconds per drive...
>
> The surviving replicas for placement groups that the failed OSDs
> participated will naturally be distributed across many OSDs in the
> cluster, when the failed OSD is marked out, it's replicas will be
> remapped to many OSDs. It's not a 1:1 replacement like you might find
> in a RAID array.
>
I completely get that part, however the total amount of data to be
rebalanced after a single disk/OSD failure to fully restore redundancy is
still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
assumed.
What I'm still missing in this pictures is how many disks (OSDs) you
calculated this with. Maybe I'm just misreading the 40 seconds per drive
bit there. Because if that means each drive is only required to be just
active for 40 seconds to do it's bit of recovery, we're talking 1100
drives. ^o^ 1100 PGs would be another story.

> >> osd fullness: 75%
> >> declustering: 1100 PG/OSD
> >> NRE model: fail
> >> object size: 4MB
> >> stripe length: 1100
> > I take it that is to mean that any RBD volume of sufficient size is
> > indeed spread over all disks?
>
> Spread over all placement groups, the difference is subtle but there
> is a difference.
>
Right, it isn't exactly a 1:1 match from what I saw/read.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2013-12-22 16:46:11 UTC
Permalink
Hello,

On Sun, 22 Dec 2013 07:44:31 -0800 Kyle Bader wrote:

> > Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> > actual OSD FS?
>
> Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
> objects are all composed by striping RADOS objects - default is 4MB.
>
Good, that clears that up and confirms how I figured it worked.

> > In my case, I'm only looking at RBD images for KVM volume storage, even
> > given the default striping configuration I would assume that those
> > 12500 OSD objects for a 50GB image would not be in the same PG and
> > thus just on 3 (with 3 replicas set) OSDs total?
>
> Objects are striped across placement groups, so you take your RBD size
> / 4 MB and cap it at the total number of placement groups in your
> cluster.
>

Yes, that also makes perfect sense, so the aforementioned 12500 objects
for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
replication that makes 2400 PGs, following the recommended formula.


> > What amount of disks (OSDs) did you punch in for the following run?
> >> Disk Modeling Parameters
> >> size: 3TiB
> >> FIT rate: 826 (MTBF = 138.1 years)
> >> NRE rate: 1.0E-16
> >> RADOS parameters
> >> auto mark-out: 10 minutes
> >> recovery rate: 50MiB/s (40 seconds/drive)
> > Blink???
> > I guess that goes back to the number of disks, but to restore 2.25GB at
> > 50MB/s with 40 seconds per drive...
>
> The surviving replicas for placement groups that the failed OSDs
> participated will naturally be distributed across many OSDs in the
> cluster, when the failed OSD is marked out, it's replicas will be
> remapped to many OSDs. It's not a 1:1 replacement like you might find
> in a RAID array.
>
I completely get that part, however the total amount of data to be
rebalanced after a single disk/OSD failure to fully restore redundancy is
still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
assumed.
What I'm still missing in this pictures is how many disks (OSDs) you
calculated this with. Maybe I'm just misreading the 40 seconds per drive
bit there. Because if that means each drive is only required to be just
active for 40 seconds to do it's bit of recovery, we're talking 1100
drives. ^o^ 1100 PGs would be another story.

> >> osd fullness: 75%
> >> declustering: 1100 PG/OSD
> >> NRE model: fail
> >> object size: 4MB
> >> stripe length: 1100
> > I take it that is to mean that any RBD volume of sufficient size is
> > indeed spread over all disks?
>
> Spread over all placement groups, the difference is subtle but there
> is a difference.
>
Right, it isn't exactly a 1:1 match from what I saw/read.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Kyle Bader
2013-12-22 15:44:31 UTC
Permalink
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>> size: 3TiB
>> FIT rate: 826 (MTBF = 138.1 years)
>> NRE rate: 1.0E-16
>> RADOS parameters
>> auto mark-out: 10 minutes
>> recovery rate: 50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>> osd fullness: 75%
>> declustering: 1100 PG/OSD
>> NRE model: fail
>> object size: 4MB
>> stripe length: 1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

--

Kyle
Kyle Bader
2013-12-22 15:44:31 UTC
Permalink
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>> size: 3TiB
>> FIT rate: 826 (MTBF = 138.1 years)
>> NRE rate: 1.0E-16
>> RADOS parameters
>> auto mark-out: 10 minutes
>> recovery rate: 50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>> osd fullness: 75%
>> declustering: 1100 PG/OSD
>> NRE model: fail
>> object size: 4MB
>> stripe length: 1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

--

Kyle
Kyle Bader
2013-12-22 15:44:31 UTC
Permalink
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>> size: 3TiB
>> FIT rate: 826 (MTBF = 138.1 years)
>> NRE rate: 1.0E-16
>> RADOS parameters
>> auto mark-out: 10 minutes
>> recovery rate: 50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>> osd fullness: 75%
>> declustering: 1100 PG/OSD
>> NRE model: fail
>> object size: 4MB
>> stripe length: 1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

--

Kyle
Loic Dachary
2013-12-22 19:01:27 UTC
Permalink
Hi Kyle,

It would be great if you could share how you invoked the tool. I'm tempting to play with it and an example would help a great deal :-)

Cheers

On 20/12/2013 22:37, Kyle Bader wrote:
> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131222/43689a13/attachment.pgp>
Christian Balzer
2013-12-22 14:03:15 UTC
Permalink
Hello Kyle,

On Fri, 20 Dec 2013 13:37:18 -0800 Kyle Bader wrote:

> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
I shall have to (literally, as in GIT) check that out next week...

However before that, some questions to help me understand what we are
measuring here and how.

For starters, I do really have a hard time figuring out what an "object" in
Ceph terminology is and I read the Architecture section of the
documentation page at list twice, along with many other resources.

Is an object a CephFS file or a RBD image or is it the 4MB blob on the
actual OSD FS?

In my case, I'm only looking at RBD images for KVM volume storage, even
given the default striping configuration I would assume that those 12500
OSD objects for a 50GB image would not be in the same PG and thus just on
3 (with 3 replicas set) OSDs total?

More questions inline below:
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>

What amount of disks (OSDs) did you punch in for the following run?
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
Blink???
I guess that goes back to the number of disks, but to restore 2.25GB at
50MB/s with 40 seconds per drive...

> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
I take it that is to mean that any RBD volume of sufficient size is indeed
spread over all disks?

>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loic Dachary
2013-12-22 19:01:27 UTC
Permalink
Hi Kyle,

It would be great if you could share how you invoked the tool. I'm tempting to play with it and an example would help a great deal :-)

Cheers

On 20/12/2013 22:37, Kyle Bader wrote:
> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131222/43689a13/attachment-0002.pgp>
Christian Balzer
2013-12-22 14:03:15 UTC
Permalink
Hello Kyle,

On Fri, 20 Dec 2013 13:37:18 -0800 Kyle Bader wrote:

> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
I shall have to (literally, as in GIT) check that out next week...

However before that, some questions to help me understand what we are
measuring here and how.

For starters, I do really have a hard time figuring out what an "object" in
Ceph terminology is and I read the Architecture section of the
documentation page at list twice, along with many other resources.

Is an object a CephFS file or a RBD image or is it the 4MB blob on the
actual OSD FS?

In my case, I'm only looking at RBD images for KVM volume storage, even
given the default striping configuration I would assume that those 12500
OSD objects for a 50GB image would not be in the same PG and thus just on
3 (with 3 replicas set) OSDs total?

More questions inline below:
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>

What amount of disks (OSDs) did you punch in for the following run?
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
Blink???
I guess that goes back to the number of disks, but to restore 2.25GB at
50MB/s with 40 seconds per drive...

> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
I take it that is to mean that any RBD volume of sufficient size is indeed
spread over all disks?

>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loic Dachary
2013-12-22 19:01:27 UTC
Permalink
Hi Kyle,

It would be great if you could share how you invoked the tool. I'm tempting to play with it and an example would help a great deal :-)

Cheers

On 20/12/2013 22:37, Kyle Bader wrote:
> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131222/43689a13/attachment-0003.pgp>
Christian Balzer
2013-12-22 14:03:15 UTC
Permalink
Hello Kyle,

On Fri, 20 Dec 2013 13:37:18 -0800 Kyle Bader wrote:

> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
I shall have to (literally, as in GIT) check that out next week...

However before that, some questions to help me understand what we are
measuring here and how.

For starters, I do really have a hard time figuring out what an "object" in
Ceph terminology is and I read the Architecture section of the
documentation page at list twice, along with many other resources.

Is an object a CephFS file or a RBD image or is it the 4MB blob on the
actual OSD FS?

In my case, I'm only looking at RBD images for KVM volume storage, even
given the default striping configuration I would assume that those 12500
OSD objects for a 50GB image would not be in the same PG and thus just on
3 (with 3 replicas set) OSDs total?

More questions inline below:
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>

What amount of disks (OSDs) did you punch in for the following run?
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
Blink???
I guess that goes back to the number of disks, but to restore 2.25GB at
50MB/s with 40 seconds per drive...

> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
I take it that is to mean that any RBD volume of sufficient size is indeed
spread over all disks?

>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loic Dachary
2013-12-22 19:01:27 UTC
Permalink
Hi Kyle,

It would be great if you could share how you invoked the tool. I'm tempting to play with it and an example would help a great deal :-)

Cheers

On 20/12/2013 22:37, Kyle Bader wrote:
> Using your data as inputs to in the Ceph reliability calculator [1]
> results in the following:
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RAID parameters
> replace: 6 hours
> recovery rate: 500MiB/s (100 minutes)
> NRE model: fail
> object size: 4MiB
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
> 0.000011% 0.000e+00 9.317e+07
>
>
> Disk Modeling Parameters
> size: 3TiB
> FIT rate: 826 (MTBF = 138.1 years)
> NRE rate: 1.0E-16
> RADOS parameters
> auto mark-out: 10 minutes
> recovery rate: 50MiB/s (40 seconds/drive)
> osd fullness: 75%
> declustering: 1100 PG/OSD
> NRE model: fail
> object size: 4MB
> stripe length: 1100
>
> Column legends
> 1 storage unit/configuration being modeled
> 2 probability of object survival (per 1 years)
> 3 probability of loss due to site failures (per 1 years)
> 4 probability of loss due to drive failures (per 1 years)
> 5 probability of loss due to NREs during recovery (per 1 years)
> 6 probability of loss due to replication failure (per 1 years)
> 7 expected data loss per Petabyte (per 1 years)
>
> storage durability PL(site) PL(copies)
> PL(NRE) PL(rep) loss/PiB
> ---------- ---------- ---------- ----------
> ---------- ---------- ----------
> RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
> 0.000116% 0.000e+00 6.486e+03
>
> [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability
>

--
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131222/43689a13/attachment-0004.pgp>
Christian Balzer
2013-12-19 08:39:54 UTC
Permalink
Hello,

In my "Sanity check" thread I postulated yesterday that to get the same
redundancy and resilience for disk failures (excluding other factors) as
my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
dual disk failures) to get the similar capacity and a 7th identical node to
allow for node failure/maintenance.

That was basically based on me thinking "must not get caught be a dual
disk failure ever again", as that happened twice to me, once with a RAID5
and the expected consequences, once with a RAID10 where I got lucky (8
disks total each time).

However something was nagging me at the back of my brain and turned out to
be my long forgotten statistics classes in school. ^o^

So I after reading some articles basically telling the same things I found
this: https://www.memset.com/tools/raid-calculator/

Now this is based on assumptions, onto which I will add some more, but the
last sentence on that page still is quite valid.

So lets compare these 2 configurations above, I assumed 75GB/s recovery
speed for the RAID6 configuration something I've seen in practice.
Basically that's half speed, something that will be lower during busy hours
and higher during off peak hours. I made the same assumption for Ceph with
a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
The rebalancing would have to compete with other replication traffic
(likely not much of an issue) and the actual speed/load of the individual
drives involved. Note that if we assume a totally quiet setup, were 100%
of all resources would be available for recovery the numbers would of
course change, but NOT their ratios.
I went with the default disk lifetime of 3 years and 0 day replacement
time. The latter of course gives very unrealistic results for anything w/o
hotspare drive, but we're comparing 2 different beasts here.

So that all said, the results of that page that make sense in this
comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
drive failure in the time before recovery is complete, the replacement
setting of 0 giving us the best possible number and since one would deploy
a Ceph cluster with sufficient extra capacity that's what we shall use.

For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
1 in 58497.9 ratio of data loss per year.
Alas for the 70 HDs in the comparable Ceph configuration we wind up with
just a 1 in 13094.31 ratio, which while still quite acceptable clearly
shows where this is going.

So am I completely off my wagon here?
How do people deal with this when potentially deploying hundreds of disks
in a single cluster/pool?

I mean, when we get too 600 disks (and that's just one rack full, OK,
maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
servers (or 72 disk per 4U if you're happy with killing another drive when
replacing a faulty one in that Supermicro contraption), that ratio is down
to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 08:53:58 UTC
Permalink
Hello,

although I don't know much about this topic, I believe that ceph erasure
encoding will probably solve a lot of these issues with some speed
tradeoff. With erasure encoding the replicated data eats way less disk
capacity, so you could use a higher replication factor with a lower disk
usage tradeoff.

Wolfgang

On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler at risc-software.at
http://www.risc-software.at
Mariusz Gronczewski
2013-12-19 11:12:13 UTC
Permalink
Dnia 2013-12-19, o godz. 17:39:54
Christian Balzer <chibi at gol.com> napisa?(a):

>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the
> same redundancy and resilience for disk failures (excluding other
> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
> protect against dual disk failures) to get the similar capacity and a
> 7th identical node to allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a
> RAID5 and the expected consequences, once with a RAID10 where I got
> lucky (8 disks total each time).



The thing is, in default config each copy of data is on different
physical machine, to allow for maintenance and hardware failures

in that case, losing 3 disks in one node is much better in 6 node
cluster, than in 2 node cluster, as data transfers needed for recover
is only 1/6th of your dataset, and also time to recovery is much
shorter as you need to read only 3TB data from whole cluster, not
3TB * 9 disks as it is in RAID6

first setup saves you from "3 disks in different machines are dead" at
cost of much of your IO and long recovery time

second setup have potential to recover much quicker, as it only needs
to transfer 3TB of data per disk failure to recover to clean state,
compared to 3TBx9 per RAID disk. Also impact of one node dead is vastly
lower.

basically, first case is better when disks drop dead exactly at same
time, second one is better when disks drop within few hours between
eachother


> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of
> disks in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive
> when replacing a faulty one in that Supermicro contraption), that
> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> mentioned up there.
>

That problem will only occur if you really want to have all those 600
disks in one pool and it so happens that 3 drives in different servers
unrecoverably die in same very short time interval, which is unlikely.
But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
makes more sense than running 60 OSDs just from memory/cpu usage
standpoint

From my experience disks rarely "just die" often it's either starts
to have bad blocks and write errors or performance degrades and it starts spewing media
errors (which usually means you can
recover 90%+ data from it if you need to without using anything more
than ddrescue). Which means ceph can access most of data for recovery
and recover just those few missing blocks.

each pool consist of many PGs and to make PG fail all disks had to be
hit so in worst case you will most likely just lose access to small part
(that pg that out of 600 disks happened to be on those 3) of data, not
everything that is on given array.

And again, that's only in case those disks die exactly at same moment,
with no time to recovery. even 60 min between failures will let most of
the data replicate. And in worst case, there is always data recovery
service. And backups




--
Mariusz Gronczewski, Administrator

Efigence Sp. z o. o.
ul. Wo?oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski at efigence.com
<mailto:mariusz.gronczewski at efigence.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131219/e589a4d2/attachment-0002.pgp>
Wido den Hollander
2013-12-19 11:42:15 UTC
Permalink
On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>

I'd suggest to use different vendors for the disks, so that means you'll
probably be mixing Seagate and Western Digital in such a setup.

In this case you can also rule out batch issues with disks, but the
likelihood of the same disks failing becomes smaller as well.

Also, make sure that you define your crushmap that replicas never and up
on the same physical host and if possible not in the same cabinet/rack.

I would never run with 60 drives in a single machine in a Ceph cluster,
I'd suggest you use more machines with less disks per machine.

> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Gregory Farnum
2013-12-19 15:20:16 UTC
Permalink
On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

I don't know what assumptions that probability calculator is making
(and I think they're overly aggressive about the 3x replication, at
least if you're seeing 1 in 21.6 that doesn't match previous numbers
I've seen), but yes: as you get larger and larger numbers of disks,
your probabilities of failure go way up. This is a thing that people
with large systems deal with. The tradeoffs that Ceph makes, you get
about the same mean-time-to-failure as a collection of RAID systems of
equivalent size (recovery times are much shorter, but more disks are
involved whose failure can cause data loss), but you lose much less
data in any given incident.
As Wolfgang mentioned, erasure coded pools will handle this better
because they can provide much larger failure counts in a reasonable
disk overhead.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Kyle Bader
2013-12-20 21:37:18 UTC
Permalink
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RAID parameters
replace: 6 hours
recovery rate: 500MiB/s (100 minutes)
NRE model: fail
object size: 4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
0.000011% 0.000e+00 9.317e+07


Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate: 50MiB/s (40 seconds/drive)
osd fullness: 75%
declustering: 1100 PG/OSD
NRE model: fail
object size: 4MB
stripe length: 1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
0.000116% 0.000e+00 6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

--

Kyle
Christian Balzer
2013-12-19 08:39:54 UTC
Permalink
Hello,

In my "Sanity check" thread I postulated yesterday that to get the same
redundancy and resilience for disk failures (excluding other factors) as
my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
dual disk failures) to get the similar capacity and a 7th identical node to
allow for node failure/maintenance.

That was basically based on me thinking "must not get caught be a dual
disk failure ever again", as that happened twice to me, once with a RAID5
and the expected consequences, once with a RAID10 where I got lucky (8
disks total each time).

However something was nagging me at the back of my brain and turned out to
be my long forgotten statistics classes in school. ^o^

So I after reading some articles basically telling the same things I found
this: https://www.memset.com/tools/raid-calculator/

Now this is based on assumptions, onto which I will add some more, but the
last sentence on that page still is quite valid.

So lets compare these 2 configurations above, I assumed 75GB/s recovery
speed for the RAID6 configuration something I've seen in practice.
Basically that's half speed, something that will be lower during busy hours
and higher during off peak hours. I made the same assumption for Ceph with
a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
The rebalancing would have to compete with other replication traffic
(likely not much of an issue) and the actual speed/load of the individual
drives involved. Note that if we assume a totally quiet setup, were 100%
of all resources would be available for recovery the numbers would of
course change, but NOT their ratios.
I went with the default disk lifetime of 3 years and 0 day replacement
time. The latter of course gives very unrealistic results for anything w/o
hotspare drive, but we're comparing 2 different beasts here.

So that all said, the results of that page that make sense in this
comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
drive failure in the time before recovery is complete, the replacement
setting of 0 giving us the best possible number and since one would deploy
a Ceph cluster with sufficient extra capacity that's what we shall use.

For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
1 in 58497.9 ratio of data loss per year.
Alas for the 70 HDs in the comparable Ceph configuration we wind up with
just a 1 in 13094.31 ratio, which while still quite acceptable clearly
shows where this is going.

So am I completely off my wagon here?
How do people deal with this when potentially deploying hundreds of disks
in a single cluster/pool?

I mean, when we get too 600 disks (and that's just one rack full, OK,
maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
servers (or 72 disk per 4U if you're happy with killing another drive when
replacing a faulty one in that Supermicro contraption), that ratio is down
to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 08:53:58 UTC
Permalink
Hello,

although I don't know much about this topic, I believe that ceph erasure
encoding will probably solve a lot of these issues with some speed
tradeoff. With erasure encoding the replicated data eats way less disk
capacity, so you could use a higher replication factor with a lower disk
usage tradeoff.

Wolfgang

On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler at risc-software.at
http://www.risc-software.at
Mariusz Gronczewski
2013-12-19 11:12:13 UTC
Permalink
Dnia 2013-12-19, o godz. 17:39:54
Christian Balzer <chibi at gol.com> napisa?(a):

>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the
> same redundancy and resilience for disk failures (excluding other
> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
> protect against dual disk failures) to get the similar capacity and a
> 7th identical node to allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a
> RAID5 and the expected consequences, once with a RAID10 where I got
> lucky (8 disks total each time).



The thing is, in default config each copy of data is on different
physical machine, to allow for maintenance and hardware failures

in that case, losing 3 disks in one node is much better in 6 node
cluster, than in 2 node cluster, as data transfers needed for recover
is only 1/6th of your dataset, and also time to recovery is much
shorter as you need to read only 3TB data from whole cluster, not
3TB * 9 disks as it is in RAID6

first setup saves you from "3 disks in different machines are dead" at
cost of much of your IO and long recovery time

second setup have potential to recover much quicker, as it only needs
to transfer 3TB of data per disk failure to recover to clean state,
compared to 3TBx9 per RAID disk. Also impact of one node dead is vastly
lower.

basically, first case is better when disks drop dead exactly at same
time, second one is better when disks drop within few hours between
eachother


> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of
> disks in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive
> when replacing a faulty one in that Supermicro contraption), that
> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> mentioned up there.
>

That problem will only occur if you really want to have all those 600
disks in one pool and it so happens that 3 drives in different servers
unrecoverably die in same very short time interval, which is unlikely.
But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
makes more sense than running 60 OSDs just from memory/cpu usage
standpoint

From my experience disks rarely "just die" often it's either starts
to have bad blocks and write errors or performance degrades and it starts spewing media
errors (which usually means you can
recover 90%+ data from it if you need to without using anything more
than ddrescue). Which means ceph can access most of data for recovery
and recover just those few missing blocks.

each pool consist of many PGs and to make PG fail all disks had to be
hit so in worst case you will most likely just lose access to small part
(that pg that out of 600 disks happened to be on those 3) of data, not
everything that is on given array.

And again, that's only in case those disks die exactly at same moment,
with no time to recovery. even 60 min between failures will let most of
the data replicate. And in worst case, there is always data recovery
service. And backups




--
Mariusz Gronczewski, Administrator

Efigence Sp. z o. o.
ul. Wo?oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski at efigence.com
<mailto:mariusz.gronczewski at efigence.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131219/e589a4d2/attachment-0003.pgp>
Wido den Hollander
2013-12-19 11:42:15 UTC
Permalink
On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>

I'd suggest to use different vendors for the disks, so that means you'll
probably be mixing Seagate and Western Digital in such a setup.

In this case you can also rule out batch issues with disks, but the
likelihood of the same disks failing becomes smaller as well.

Also, make sure that you define your crushmap that replicas never and up
on the same physical host and if possible not in the same cabinet/rack.

I would never run with 60 drives in a single machine in a Ceph cluster,
I'd suggest you use more machines with less disks per machine.

> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Gregory Farnum
2013-12-19 15:20:16 UTC
Permalink
On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

I don't know what assumptions that probability calculator is making
(and I think they're overly aggressive about the 3x replication, at
least if you're seeing 1 in 21.6 that doesn't match previous numbers
I've seen), but yes: as you get larger and larger numbers of disks,
your probabilities of failure go way up. This is a thing that people
with large systems deal with. The tradeoffs that Ceph makes, you get
about the same mean-time-to-failure as a collection of RAID systems of
equivalent size (recovery times are much shorter, but more disks are
involved whose failure can cause data loss), but you lose much less
data in any given incident.
As Wolfgang mentioned, erasure coded pools will handle this better
because they can provide much larger failure counts in a reasonable
disk overhead.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Kyle Bader
2013-12-20 21:37:18 UTC
Permalink
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RAID parameters
replace: 6 hours
recovery rate: 500MiB/s (100 minutes)
NRE model: fail
object size: 4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
0.000011% 0.000e+00 9.317e+07


Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate: 50MiB/s (40 seconds/drive)
osd fullness: 75%
declustering: 1100 PG/OSD
NRE model: fail
object size: 4MB
stripe length: 1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
0.000116% 0.000e+00 6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

--

Kyle
Christian Balzer
2013-12-19 08:39:54 UTC
Permalink
Hello,

In my "Sanity check" thread I postulated yesterday that to get the same
redundancy and resilience for disk failures (excluding other factors) as
my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
dual disk failures) to get the similar capacity and a 7th identical node to
allow for node failure/maintenance.

That was basically based on me thinking "must not get caught be a dual
disk failure ever again", as that happened twice to me, once with a RAID5
and the expected consequences, once with a RAID10 where I got lucky (8
disks total each time).

However something was nagging me at the back of my brain and turned out to
be my long forgotten statistics classes in school. ^o^

So I after reading some articles basically telling the same things I found
this: https://www.memset.com/tools/raid-calculator/

Now this is based on assumptions, onto which I will add some more, but the
last sentence on that page still is quite valid.

So lets compare these 2 configurations above, I assumed 75GB/s recovery
speed for the RAID6 configuration something I've seen in practice.
Basically that's half speed, something that will be lower during busy hours
and higher during off peak hours. I made the same assumption for Ceph with
a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
The rebalancing would have to compete with other replication traffic
(likely not much of an issue) and the actual speed/load of the individual
drives involved. Note that if we assume a totally quiet setup, were 100%
of all resources would be available for recovery the numbers would of
course change, but NOT their ratios.
I went with the default disk lifetime of 3 years and 0 day replacement
time. The latter of course gives very unrealistic results for anything w/o
hotspare drive, but we're comparing 2 different beasts here.

So that all said, the results of that page that make sense in this
comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
drive failure in the time before recovery is complete, the replacement
setting of 0 giving us the best possible number and since one would deploy
a Ceph cluster with sufficient extra capacity that's what we shall use.

For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
1 in 58497.9 ratio of data loss per year.
Alas for the 70 HDs in the comparable Ceph configuration we wind up with
just a 1 in 13094.31 ratio, which while still quite acceptable clearly
shows where this is going.

So am I completely off my wagon here?
How do people deal with this when potentially deploying hundreds of disks
in a single cluster/pool?

I mean, when we get too 600 disks (and that's just one rack full, OK,
maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
servers (or 72 disk per 4U if you're happy with killing another drive when
replacing a faulty one in that Supermicro contraption), that ratio is down
to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Wolfgang Hennerbichler
2013-12-19 08:53:58 UTC
Permalink
Hello,

although I don't know much about this topic, I believe that ceph erasure
encoding will probably solve a lot of these issues with some speed
tradeoff. With erasure encoding the replicated data eats way less disk
capacity, so you could use a higher replication factor with a lower disk
usage tradeoff.

Wolfgang

On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler at risc-software.at
http://www.risc-software.at
Mariusz Gronczewski
2013-12-19 11:12:13 UTC
Permalink
Dnia 2013-12-19, o godz. 17:39:54
Christian Balzer <chibi at gol.com> napisa?(a):

>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the
> same redundancy and resilience for disk failures (excluding other
> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
> protect against dual disk failures) to get the similar capacity and a
> 7th identical node to allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a
> RAID5 and the expected consequences, once with a RAID10 where I got
> lucky (8 disks total each time).



The thing is, in default config each copy of data is on different
physical machine, to allow for maintenance and hardware failures

in that case, losing 3 disks in one node is much better in 6 node
cluster, than in 2 node cluster, as data transfers needed for recover
is only 1/6th of your dataset, and also time to recovery is much
shorter as you need to read only 3TB data from whole cluster, not
3TB * 9 disks as it is in RAID6

first setup saves you from "3 disks in different machines are dead" at
cost of much of your IO and long recovery time

second setup have potential to recover much quicker, as it only needs
to transfer 3TB of data per disk failure to recover to clean state,
compared to 3TBx9 per RAID disk. Also impact of one node dead is vastly
lower.

basically, first case is better when disks drop dead exactly at same
time, second one is better when disks drop within few hours between
eachother


> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of
> disks in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive
> when replacing a faulty one in that Supermicro contraption), that
> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> mentioned up there.
>

That problem will only occur if you really want to have all those 600
disks in one pool and it so happens that 3 drives in different servers
unrecoverably die in same very short time interval, which is unlikely.
But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
makes more sense than running 60 OSDs just from memory/cpu usage
standpoint

From my experience disks rarely "just die" often it's either starts
to have bad blocks and write errors or performance degrades and it starts spewing media
errors (which usually means you can
recover 90%+ data from it if you need to without using anything more
than ddrescue). Which means ceph can access most of data for recovery
and recover just those few missing blocks.

each pool consist of many PGs and to make PG fail all disks had to be
hit so in worst case you will most likely just lose access to small part
(that pg that out of 600 disks happened to be on those 3) of data, not
everything that is on given array.

And again, that's only in case those disks die exactly at same moment,
with no time to recovery. even 60 min between failures will let most of
the data replicate. And in worst case, there is always data recovery
service. And backups




--
Mariusz Gronczewski, Administrator

Efigence Sp. z o. o.
ul. Wo?oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski at efigence.com
<mailto:mariusz.gronczewski at efigence.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20131219/e589a4d2/attachment-0004.pgp>
Wido den Hollander
2013-12-19 11:42:15 UTC
Permalink
On 12/19/2013 09:39 AM, Christian Balzer wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>

I'd suggest to use different vendors for the disks, so that means you'll
probably be mixing Seagate and Western Digital in such a setup.

In this case you can also rule out batch issues with disks, but the
likelihood of the same disks failing becomes smaller as well.

Also, make sure that you define your crushmap that replicas never and up
on the same physical host and if possible not in the same cabinet/rack.

I would never run with 60 drives in a single machine in a Ceph cluster,
I'd suggest you use more machines with less disks per machine.

> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
>
> Regards,
>
> Christian
>


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
Gregory Farnum
2013-12-19 15:20:16 UTC
Permalink
On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi at gol.com> wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

I don't know what assumptions that probability calculator is making
(and I think they're overly aggressive about the 3x replication, at
least if you're seeing 1 in 21.6 that doesn't match previous numbers
I've seen), but yes: as you get larger and larger numbers of disks,
your probabilities of failure go way up. This is a thing that people
with large systems deal with. The tradeoffs that Ceph makes, you get
about the same mean-time-to-failure as a collection of RAID systems of
equivalent size (recovery times are much shorter, but more disks are
involved whose failure can cause data loss), but you lose much less
data in any given incident.
As Wolfgang mentioned, erasure coded pools will handle this better
because they can provide much larger failure counts in a reasonable
disk overhead.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Kyle Bader
2013-12-20 21:37:18 UTC
Permalink
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RAID parameters
replace: 6 hours
recovery rate: 500MiB/s (100 minutes)
NRE model: fail
object size: 4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RAID-6: 9+2 6-nines 0.000e+00 2.763e-10
0.000011% 0.000e+00 9.317e+07


Disk Modeling Parameters
size: 3TiB
FIT rate: 826 (MTBF = 138.1 years)
NRE rate: 1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate: 50MiB/s (40 seconds/drive)
osd fullness: 75%
declustering: 1100 PG/OSD
NRE model: fail
object size: 4MB
stripe length: 1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage durability PL(site) PL(copies)
PL(NRE) PL(rep) loss/PiB
---------- ---------- ---------- ----------
---------- ---------- ----------
RADOS: 3 cp 10-nines 0.000e+00 5.232e-08
0.000116% 0.000e+00 6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

--

Kyle
Loading...