[ceph-users] degraded PGs when adding OSDs

Discussion:

Simon Ironside

2018-02-08 22:38:43 UTC

Hi Everyone,

I recently added an OSD to an active+clean Jewel (10.2.3) cluster and
was surprised to see a peak of 23% objects degraded. Surely this should
be at or near zero and the objects should show as misplaced?

I've searched and found Chad William Seys' thread from 2015 but didn't
see any conclusion that explains this:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html

Thanks,
Simon.

Janne Johansson

2018-02-09 09:05:33 UTC

Permalink

I agree, I always viewed it as if you had three copies of your PG, add a
new OSD and that PG decides one of the copies should be on that OSD instead
of one of the 3 older ones, it would just stop caring about the old PG,
create a new empty PG on the new OSD and then as the synch is going towards
the new PG it is "behind" in the data it contains until sync is done, but
it (and its 2 previous copies) are correctly placed for the new crush map.
Misplaced would probably be a more natural way of seeing it, at least if
the now-abandoned PG was still being updated while the sync is done, but I
don't think it is. It gets orphaned rather quickly as the new OSD kicks in.

I guess this design choice boils down to "being able to handle someone
adding more OSDs to a cluster that is close to getting full", at the
expense of "discarding one or more of the old copies and scaring the admin
as if there was a huge issue when just adding one or many new shiny OSDs".

--
May the most significant bit of your life be positive.

Simon Ironside

2018-02-11 22:51:08 UTC

Permalink

Post by Simon Ironside
Hi Everyone,
I recently added an OSD to an active+clean Jewel (10.2.3) cluster
and was surprised to see a peak of 23% objects degraded. Surely this
should be at or near zero and the objects should show as misplaced?
I've searched and found Chad William Seys' thread from 2015 but
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html>
I agree, I always viewed it as if you had three copies of your PG, add
a new OSD and that PG decides one of the copies should be on that OSD
instead of one of the 3 older ones, it would just stop caring about the
old PG, create a new empty PG on the new OSD and then as the synch is
going towards the new PG it is "behind" in the data it contains until
sync is done, but it (and its 2 previous copies) are correctly placed
for the new crush map. Misplaced would probably be a more natural way of
seeing it, at least if the now-abandoned PG was still being updated
while the sync is done, but I don't think it is. It gets orphaned rather
quickly as the new OSD kicks in.
I guess this design choice boils down to "being able to handle someone
adding more OSDs to a cluster that is close to getting full", at the
expense of "discarding one or more of the old copies and scaring the
admin as if there was a huge issue when just adding one or many new
shiny OSDs".

It certainly does scare me, especially as this particular cluster is
size=2, min_size=1.

My worry is that I could experience a disk failure while adding a new
OSD and potentially lose data while if the same disk failed when the
cluster was active+clean I wouldn't. That doesn't seem like a very safe
design choice but perhaps the real answer is to use size=3.

Reweighting an active OSD to 0 does the same thing on my cluster, causes
the objects to go degraded instead of misplaced as I'd expect.

Thanks,
Simon.

Brad Hubbard

2018-02-11 23:21:53 UTC

Permalink

Post by Simon Ironside
Hi Everyone,
I recently added an OSD to an active+clean Jewel (10.2.3) cluster
and was surprised to see a peak of 23% objects degraded. Surely this
should be at or near zero and the objects should show as misplaced?
I've searched and found Chad William Seys' thread from 2015 but
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html>
I agree, I always viewed it as if you had three copies of your PG, add a
new OSD and that PG decides one of the copies should be on that OSD instead
of one of the 3 older ones, it would just stop caring about the old PG,
create a new empty PG on the new OSD and then as the synch is going towards
the new PG it is "behind" in the data it contains until sync is done, but it
(and its 2 previous copies) are correctly placed for the new crush map.
Misplaced would probably be a more natural way of seeing it, at least if the
now-abandoned PG was still being updated while the sync is done, but I don't
think it is. It gets orphaned rather quickly as the new OSD kicks in.
I guess this design choice boils down to "being able to handle someone
adding more OSDs to a cluster that is close to getting full", at the expense
of "discarding one or more of the old copies and scaring the admin as if
there was a huge issue when just adding one or many new shiny OSDs".

It certainly does scare me, especially as this particular cluster is size=2,
min_size=1.
My worry is that I could experience a disk failure while adding a new OSD
and potentially lose data

You've already indicated you are willing to accept data loss by
configuring size=2, min_size=1.

Search for "2x replication: A BIG warning"

while if the same disk failed when the cluster was
active+clean I wouldn't. That doesn't seem like a very safe design choice
but perhaps the real answer is to use size=3.
Reweighting an active OSD to 0 does the same thing on my cluster, causes the
objects to go degraded instead of misplaced as I'd expect.
Thanks,
Simon.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad